* [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection
@ 2025-09-10 0:10 Mukesh Rathor
2025-09-10 0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor
` (5 more replies)
0 siblings, 6 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw)
To: linux-hyperv, linux-kernel, linux-arch
Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
hpa, arnd
This patch series implements hypervisor core collection when running
under linux as root (aka dom0). By default initial hypervisor ram is
already mapped into linux as reserved. Further any ram deposited comes
from linux memory heap. The hypervisor locks all that ram to protect
it from dom0 or any other domains. At a high level, the methodology
involes devirtualizing the system on the fly upon either linux crash
or the hypervisor crash, then collecting ram as usual. This means
hypervisor ram is automatically collected into the vmcore.
Hypervisor pages are then accessible via crash command (using raw mem
dump) or windbg which has the ability to read hypervisor pdb symbol
file.
V1:
o Describe changes in imperative mood. Remove "This commit"
o Remove pr_emerg: causing unnecessary review noise
o Add missing kexec_crash_loaded()
o Remove leftover unnecessary memcpy in hv_crash_setup_trampdata
o Address objtool warnings via annotations
Mukesh Rathor (6):
x86/hyperv: Rename guest crash shutdown function
hyperv: Add two new hypercall numbers to guest ABI public header
hyperv: Add definitions for hypervisor crash dump support
x86/hyperv: Add trampoline asm code to transition from hypervisor
x86/hyperv: Implement hypervisor ram collection into vmcore
x86/hyperv: Enable build of hypervisor crashdump collection files
arch/x86/hyperv/Makefile | 6 +
arch/x86/hyperv/hv_crash.c | 622 ++++++++++++++++++++++++++++++++
arch/x86/hyperv/hv_init.c | 1 +
arch/x86/hyperv/hv_trampoline.S | 105 ++++++
arch/x86/kernel/cpu/mshyperv.c | 5 +-
include/asm-generic/mshyperv.h | 9 +
include/hyperv/hvgdk_mini.h | 2 +
include/hyperv/hvhdk_mini.h | 55 +++
8 files changed, 803 insertions(+), 2 deletions(-)
create mode 100644 arch/x86/hyperv/hv_crash.c
create mode 100644 arch/x86/hyperv/hv_trampoline.S
--
2.36.1.vfs.0.0
^ permalink raw reply [flat|nested] 29+ messages in thread* [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor ` (4 subsequent siblings) 5 siblings, 0 replies; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Rename hv_machine_crash_shutdown to more appropriate hv_guest_crash_shutdown and make it applicable to guests only. This in preparation for the subsequent hypervisor root/dom0 crash support patches. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- arch/x86/kernel/cpu/mshyperv.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c index 25773af116bc..1c6ec9b6107f 100644 --- a/arch/x86/kernel/cpu/mshyperv.c +++ b/arch/x86/kernel/cpu/mshyperv.c @@ -219,7 +219,7 @@ static void hv_machine_shutdown(void) #endif /* CONFIG_KEXEC_CORE */ #ifdef CONFIG_CRASH_DUMP -static void hv_machine_crash_shutdown(struct pt_regs *regs) +static void hv_guest_crash_shutdown(struct pt_regs *regs) { if (hv_crash_handler) hv_crash_handler(regs); @@ -562,7 +562,8 @@ static void __init ms_hyperv_init_platform(void) machine_ops.shutdown = hv_machine_shutdown; #endif #if defined(CONFIG_CRASH_DUMP) - machine_ops.crash_shutdown = hv_machine_crash_shutdown; + if (!hv_root_partition()) + machine_ops.crash_shutdown = hv_guest_crash_shutdown; #endif #endif /* -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor ` (3 subsequent siblings) 5 siblings, 0 replies; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd In preparation for the subsequent crashdump patches, copy two hypercall numbers to the guest ABI header published by Hyper-V. One to notify hypervisor of an event that occurs in the root partition, other to ask hypervisor to disable the hypervisor. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- include/hyperv/hvgdk_mini.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h index 1be7f6a02304..5441bf47059a 100644 --- a/include/hyperv/hvgdk_mini.h +++ b/include/hyperv/hvgdk_mini.h @@ -469,6 +469,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ #define HVCALL_MAP_DEVICE_INTERRUPT 0x007c #define HVCALL_UNMAP_DEVICE_INTERRUPT 0x007d #define HVCALL_RETARGET_INTERRUPT 0x007e +#define HVCALL_NOTIFY_PARTITION_EVENT 0x0087 #define HVCALL_NOTIFY_PORT_RING_EMPTY 0x008b #define HVCALL_REGISTER_INTERCEPT_RESULT 0x0091 #define HVCALL_ASSERT_VIRTUAL_INTERRUPT 0x0094 @@ -492,6 +493,7 @@ union hv_vp_assist_msr_contents { /* HV_REGISTER_VP_ASSIST_PAGE */ #define HVCALL_GET_VP_CPUID_VALUES 0x00f4 #define HVCALL_MMIO_READ 0x0106 #define HVCALL_MMIO_WRITE 0x0107 +#define HVCALL_DISABLE_HYP_EX 0x010f /* HV_HYPERCALL_INPUT */ #define HV_HYPERCALL_RESULT_MASK GENMASK_ULL(15, 0) -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-15 17:54 ` Michael Kelley 2025-09-10 0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor ` (2 subsequent siblings) 5 siblings, 1 reply; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Add data structures for hypervisor crash dump support to the hypervisor host ABI header file. Details of their usages are in subsequent commits. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++ 1 file changed, 55 insertions(+) diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h index 858f6a3925b3..ad9a8048fb4e 100644 --- a/include/hyperv/hvhdk_mini.h +++ b/include/hyperv/hvhdk_mini.h @@ -116,6 +116,17 @@ enum hv_system_property { /* Add more values when needed */ HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15, HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21, + HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47, +}; + +#define HV_PFN_RANGE_PGBITS 24 /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */ +union hv_pfn_range { /* HV_SPA_PAGE_RANGE */ + u64 as_uint64; + struct { + /* 39:0: base pfn. 63:40: additional pages */ + u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS; + u64 add_pfns : HV_PFN_RANGE_PGBITS; + } __packed; }; enum hv_dynamic_processor_feature_property { @@ -142,6 +153,8 @@ struct hv_output_get_system_property { #if IS_ENABLED(CONFIG_X86) u64 hv_processor_feature_value; #endif + union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */ + u64 hv_tramp_pa; /* CrashdumpTrampolineAddress */ }; } __packed; @@ -234,6 +247,48 @@ union hv_gpa_page_access_state { u8 as_uint8; } __packed; +enum hv_crashdump_action { + HV_CRASHDUMP_NONE = 0, + HV_CRASHDUMP_SUSPEND_ALL_VPS, + HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE, + HV_CRASHDUMP_STATE_SAVED, + HV_CRASHDUMP_ENTRY, +}; + +struct hv_partition_event_root_crashdump_input { + u32 crashdump_action; /* enum hv_crashdump_action */ +} __packed; + +struct hv_input_disable_hyp_ex { /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */ + u64 rip; + u64 arg; +} __packed; + +struct hv_crashdump_area { /* HV_CRASHDUMP_AREA */ + u32 version; + union { + u32 flags_as_uint32; + struct { + u32 cda_valid : 1; + u32 cda_unused : 31; + } __packed; + }; + /* more unused fields */ +} __packed; + +union hv_partition_event_input { + struct hv_partition_event_root_crashdump_input crashdump_input; +}; + +enum hv_partition_event { + HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2, +}; + +struct hv_input_notify_partition_event { + u32 event; /* enum hv_partition_event */ + union hv_partition_event_input input; +} __packed; + struct hv_lp_startup_status { u64 hv_status; u64 substatus1; -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* RE: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support 2025-09-10 0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor @ 2025-09-15 17:54 ` Michael Kelley 2025-09-16 1:15 ` Mukesh R 0 siblings, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-15 17:54 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > > Add data structures for hypervisor crash dump support to the hypervisor > host ABI header file. Details of their usages are in subsequent commits. > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > --- > include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 55 insertions(+) > > diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h > index 858f6a3925b3..ad9a8048fb4e 100644 > --- a/include/hyperv/hvhdk_mini.h > +++ b/include/hyperv/hvhdk_mini.h > @@ -116,6 +116,17 @@ enum hv_system_property { > /* Add more values when needed */ > HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15, > HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21, > + HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47, > +}; > + > +#define HV_PFN_RANGE_PGBITS 24 /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */ > +union hv_pfn_range { /* HV_SPA_PAGE_RANGE */ > + u64 as_uint64; > + struct { > + /* 39:0: base pfn. 63:40: additional pages */ > + u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS; > + u64 add_pfns : HV_PFN_RANGE_PGBITS; > + } __packed; > }; > > enum hv_dynamic_processor_feature_property { > @@ -142,6 +153,8 @@ struct hv_output_get_system_property { > #if IS_ENABLED(CONFIG_X86) > u64 hv_processor_feature_value; > #endif > + union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */ > + u64 hv_tramp_pa; /* CrashdumpTrampolineAddress */ > }; > } __packed; > > @@ -234,6 +247,48 @@ union hv_gpa_page_access_state { > u8 as_uint8; > } __packed; > > +enum hv_crashdump_action { > + HV_CRASHDUMP_NONE = 0, > + HV_CRASHDUMP_SUSPEND_ALL_VPS, > + HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE, > + HV_CRASHDUMP_STATE_SAVED, > + HV_CRASHDUMP_ENTRY, > +}; Nit: Since these values are part of the ABI, it's probably better to assign explicit values to each enum member in order to ward off any mistaken reordering or additions in the middle of the list. > + > +struct hv_partition_event_root_crashdump_input { > + u32 crashdump_action; /* enum hv_crashdump_action */ > +} __packed; > + > +struct hv_input_disable_hyp_ex { /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */ > + u64 rip; > + u64 arg; > +} __packed; > + > +struct hv_crashdump_area { /* HV_CRASHDUMP_AREA */ > + u32 version; > + union { > + u32 flags_as_uint32; > + struct { > + u32 cda_valid : 1; > + u32 cda_unused : 31; > + } __packed; > + }; > + /* more unused fields */ > +} __packed; > + > +union hv_partition_event_input { > + struct hv_partition_event_root_crashdump_input crashdump_input; > +}; > + > +enum hv_partition_event { > + HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2, > +}; > + > +struct hv_input_notify_partition_event { > + u32 event; /* enum hv_partition_event */ > + union hv_partition_event_input input; > +} __packed; > + > struct hv_lp_startup_status { > u64 hv_status; > u64 substatus1; > -- > 2.36.1.vfs.0.0 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support 2025-09-15 17:54 ` Michael Kelley @ 2025-09-16 1:15 ` Mukesh R 2025-09-18 23:52 ` Michael Kelley 0 siblings, 1 reply; 29+ messages in thread From: Mukesh R @ 2025-09-16 1:15 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/15/25 10:54, Michael Kelley wrote: > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >> >> Add data structures for hypervisor crash dump support to the hypervisor >> host ABI header file. Details of their usages are in subsequent commits. >> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >> --- >> include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 55 insertions(+) >> >> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h >> index 858f6a3925b3..ad9a8048fb4e 100644 >> --- a/include/hyperv/hvhdk_mini.h >> +++ b/include/hyperv/hvhdk_mini.h >> @@ -116,6 +116,17 @@ enum hv_system_property { >> /* Add more values when needed */ >> HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15, >> HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21, >> + HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47, >> +}; >> + >> +#define HV_PFN_RANGE_PGBITS 24 /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */ >> +union hv_pfn_range { /* HV_SPA_PAGE_RANGE */ >> + u64 as_uint64; >> + struct { >> + /* 39:0: base pfn. 63:40: additional pages */ >> + u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS; >> + u64 add_pfns : HV_PFN_RANGE_PGBITS; >> + } __packed; >> }; >> >> enum hv_dynamic_processor_feature_property { >> @@ -142,6 +153,8 @@ struct hv_output_get_system_property { >> #if IS_ENABLED(CONFIG_X86) >> u64 hv_processor_feature_value; >> #endif >> + union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */ >> + u64 hv_tramp_pa; /* CrashdumpTrampolineAddress */ >> }; >> } __packed; >> >> @@ -234,6 +247,48 @@ union hv_gpa_page_access_state { >> u8 as_uint8; >> } __packed; >> >> +enum hv_crashdump_action { >> + HV_CRASHDUMP_NONE = 0, >> + HV_CRASHDUMP_SUSPEND_ALL_VPS, >> + HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE, >> + HV_CRASHDUMP_STATE_SAVED, >> + HV_CRASHDUMP_ENTRY, >> +}; > > Nit: Since these values are part of the ABI, it's probably better > to assign explicit values to each enum member in order to > ward off any mistaken reordering or additions in the middle > of the list. No, like I have mentioned in the past, we are mirroring hyp headers with the eventual goal of just consuming from there directly. Each change in ABI header is very carefully examined, we now have a process for it. >> + >> +struct hv_partition_event_root_crashdump_input { >> + u32 crashdump_action; /* enum hv_crashdump_action */ >> +} __packed; >> + >> +struct hv_input_disable_hyp_ex { /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */ >> + u64 rip; >> + u64 arg; >> +} __packed; >> + >> +struct hv_crashdump_area { /* HV_CRASHDUMP_AREA */ >> + u32 version; >> + union { >> + u32 flags_as_uint32; >> + struct { >> + u32 cda_valid : 1; >> + u32 cda_unused : 31; >> + } __packed; >> + }; >> + /* more unused fields */ >> +} __packed; >> + >> +union hv_partition_event_input { >> + struct hv_partition_event_root_crashdump_input crashdump_input; >> +}; >> + >> +enum hv_partition_event { >> + HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2, >> +}; >> + >> +struct hv_input_notify_partition_event { >> + u32 event; /* enum hv_partition_event */ >> + union hv_partition_event_input input; >> +} __packed; >> + >> struct hv_lp_startup_status { >> u64 hv_status; >> u64 substatus1; >> -- >> 2.36.1.vfs.0.0 >> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support 2025-09-16 1:15 ` Mukesh R @ 2025-09-18 23:52 ` Michael Kelley 0 siblings, 0 replies; 29+ messages in thread From: Michael Kelley @ 2025-09-18 23:52 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Monday, September 15, 2025 6:15 PM > > On 9/15/25 10:54, Michael Kelley wrote: > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > >> > >> Add data structures for hypervisor crash dump support to the hypervisor > >> host ABI header file. Details of their usages are in subsequent commits. > >> > >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > >> --- > >> include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 55 insertions(+) > >> > >> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h > >> index 858f6a3925b3..ad9a8048fb4e 100644 > >> --- a/include/hyperv/hvhdk_mini.h > >> +++ b/include/hyperv/hvhdk_mini.h > >> [snip] > >> +enum hv_crashdump_action { > >> + HV_CRASHDUMP_NONE = 0, > >> + HV_CRASHDUMP_SUSPEND_ALL_VPS, > >> + HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE, > >> + HV_CRASHDUMP_STATE_SAVED, > >> + HV_CRASHDUMP_ENTRY, > >> +}; > > > > Nit: Since these values are part of the ABI, it's probably better > > to assign explicit values to each enum member in order to > > ward off any mistaken reordering or additions in the middle > > of the list. > > No, like I have mentioned in the past, we are mirroring hyp headers > with the eventual goal of just consuming from there directly. > Each change in ABI header is very carefully examined, we now have > a process for it. > Acknowledged. I keep wanting to tighten up the ABI specification, and sometimes forget that there are constraints on doing so. Michael ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor ` (2 preceding siblings ...) 2025-09-10 0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-15 17:55 ` Michael Kelley 2025-09-10 0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor 5 siblings, 1 reply; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Introduce a small asm stub to transition from the hypervisor to linux upon devirtualization. At a high level, during panic of either the hypervisor or the dom0 (aka root), the nmi handler asks hypervisor to devirtualize. As part of that, the arguments include an entry point to return back to linux. This asm stub implements that entry point. The stub is entered in protected mode, uses temporary gdt and page table to enable long mode and get to kernel entry point which then restores full kernel context to resume execution to kexec. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++ 1 file changed, 105 insertions(+) create mode 100644 arch/x86/hyperv/hv_trampoline.S diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S new file mode 100644 index 000000000000..27a755401a42 --- /dev/null +++ b/arch/x86/hyperv/hv_trampoline.S @@ -0,0 +1,105 @@ +/* SPDX-License-Identifier: GPL-2.0-only */ +/* + * X86 specific Hyper-V kdump/crash related code. + * + * Copyright (C) 2025, Microsoft, Inc. + * + */ +#include <linux/linkage.h> +#include <asm/alternative.h> +#include <asm/msr.h> +#include <asm/processor-flags.h> +#include <asm/nospec-branch.h> + +/* + * void noreturn hv_crash_asm32(arg1) + * arg1 == edi == 32bit PA of struct hv_crash_trdata + * + * The hypervisor jumps here upon devirtualization in protected mode. This + * code gets copied to a page in the low 4G ie, 32bit space so it can run + * in the protected mode. Hence we cannot use any compile/link time offsets or + * addresses. It restores long mode via temporary gdt and page tables and + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. + * + * PreCondition (ie, Hypervisor call back ABI): + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled + * o CR4 is set to 0x0 + * o IA32_EFER is set to 0x901 (SCE and NXE are set) + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF + * o LDTR is initialized as invalid (limit of 0) + * o MSR PAT is power on default. + * o Other state/registers are cleared. All TLBs flushed. + * + * See Intel SDM 10.8.5 + */ + +#define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ +#define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ +#define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ +#define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ +#define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ +#define HV_CRASHDATA_TRAMPOLINE_CS 0x8 + + .text + .code32 + +SYM_CODE_START(hv_crash_asm32) + UNWIND_HINT_UNDEFINED + ANNOTATE_NOENDBR + movl $X86_CR4_PAE, %ecx + movl %ecx, %cr4 + + movl %edi, %ebx + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx + movl %cs:(%ebx), %eax + movl %eax, %cr3 + + # Setup EFER for long mode now. + movl $MSR_EFER, %ecx + rdmsr + btsl $_EFER_LME, %eax + wrmsr + + # Turn paging on using the temp 32bit trampoline page table. + movl %cr0, %eax + orl $(X86_CR0_PG), %eax + movl %eax, %cr0 + + /* since kernel cr3 could be above 4G, we need to be in the long mode + * before we can load 64bits of the kernel cr3. We use a temp gdt for + * that with CS.L=1 and CS.D=0 */ + mov %edi, %eax + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax + lgdtl %cs:(%eax) + + /* not done yet, restore CS now to switch to CS.L=1 */ + mov %edi, %eax + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax + ljmp %cs:*(%eax) +SYM_CODE_END(hv_crash_asm32) + + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ + .code64 + .balign 8 +SYM_CODE_START(hv_crash_asm64) + UNWIND_HINT_UNDEFINED + ANNOTATE_NOENDBR +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL) + /* restore kernel page tables so we can jump to kernel code */ + mov %edi, %eax + add $HV_CRASHDATA_OFFS_KERNCR3, %eax + movq %cs:(%eax), %rbx + movq %rbx, %cr3 + + mov %edi, %eax + add $HV_CRASHDATA_OFFS_C_entry, %eax + movq %cs:(%eax), %rbx + ANNOTATE_RETPOLINE_SAFE + jmp *%rbx + + int $3 + +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) +SYM_CODE_END(hv_crash_asm64) -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-10 0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor @ 2025-09-15 17:55 ` Michael Kelley 2025-09-16 21:30 ` Mukesh R 0 siblings, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-15 17:55 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > > Introduce a small asm stub to transition from the hypervisor to linux I'd argue for capitalizing "Linux" here and in other places in commit text and code comments throughout this patch set. > upon devirtualization. In this patch and subsequent patches, you've used the phrase "upon devirtualization", which seems a little vague to me. Does this mean "when devirtualization is complete" or perhaps "when the hypervisor completes devirtualization"? Since there's no spec on any of this, being as precise as possible will help future readers. > > At a high level, during panic of either the hypervisor or the dom0 (aka > root), the nmi handler asks hypervisor to devirtualize. Suggest: At a high level, during panic of either the hypervisor or Linux running in dom0 (a.k.a. the root partition), the Linux NMI handler asks the hypervisor to devirtualize. > As part of that, > the arguments include an entry point to return back to linux. This asm > stub implements that entry point. > > The stub is entered in protected mode, uses temporary gdt and page table > to enable long mode and get to kernel entry point which then restores full > kernel context to resume execution to kexec. > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > --- > arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++ > 1 file changed, 105 insertions(+) > create mode 100644 arch/x86/hyperv/hv_trampoline.S > > diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S > new file mode 100644 > index 000000000000..27a755401a42 > --- /dev/null > +++ b/arch/x86/hyperv/hv_trampoline.S > @@ -0,0 +1,105 @@ > +/* SPDX-License-Identifier: GPL-2.0-only */ > +/* > + * X86 specific Hyper-V kdump/crash related code. Add a qualification that this is for root partition only, and not for general guests? > + * > + * Copyright (C) 2025, Microsoft, Inc. > + * > + */ > +#include <linux/linkage.h> > +#include <asm/alternative.h> > +#include <asm/msr.h> > +#include <asm/processor-flags.h> > +#include <asm/nospec-branch.h> > + > +/* > + * void noreturn hv_crash_asm32(arg1) > + * arg1 == edi == 32bit PA of struct hv_crash_trdata I think this is "struct hv_crash_tramp_data". > + * > + * The hypervisor jumps here upon devirtualization in protected mode. This > + * code gets copied to a page in the low 4G ie, 32bit space so it can run > + * in the protected mode. Hence we cannot use any compile/link time offsets or > + * addresses. It restores long mode via temporary gdt and page tables and > + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. > + * > + * PreCondition (ie, Hypervisor call back ABI): > + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled > + * o CR4 is set to 0x0 > + * o IA32_EFER is set to 0x901 (SCE and NXE are set) > + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. > + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF > + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF > + * o LDTR is initialized as invalid (limit of 0) > + * o MSR PAT is power on default. > + * o Other state/registers are cleared. All TLBs flushed. Clarification about "Other state/registers are cleared": What about processor features that Linux may have enabled or disabled during its initial boot? Are those still in the states Linux set? Or are they reset to power-on defaults? For example, if Linux enabled x2apic, is x2apic still enabled when the stub is entered? > + * > + * See Intel SDM 10.8.5 Hmmm. I downloaded the latest combined SDM, and section 10.8.5 in Volume 3A is about Microcode Update Resources, which doesn't seem applicable here. Other volumes don't have a section 10.8.5. > + */ > + > +#define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ > +#define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ > +#define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ > +#define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ > +#define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ It seems like these offsets should go in a #include file along with the definition of struct hv_crash_tramp_data. Then the BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could check against these symbolic names instead of hardcoding numbers that must match these. > +#define HV_CRASHDATA_TRAMPOLINE_CS 0x8 This #define isn't used anywhere. > + > + .text > + .code32 > + > +SYM_CODE_START(hv_crash_asm32) > + UNWIND_HINT_UNDEFINED > + ANNOTATE_NOENDBR No ENDBR here, presumably because this function is entered via other than an indirect CALL or JMP instruction. Right? > + movl $X86_CR4_PAE, %ecx > + movl %ecx, %cr4 > + > + movl %edi, %ebx > + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx > + movl %cs:(%ebx), %eax > + movl %eax, %cr3 > + > + # Setup EFER for long mode now. > + movl $MSR_EFER, %ecx > + rdmsr > + btsl $_EFER_LME, %eax > + wrmsr > + > + # Turn paging on using the temp 32bit trampoline page table. > + movl %cr0, %eax > + orl $(X86_CR0_PG), %eax > + movl %eax, %cr0 > + > + /* since kernel cr3 could be above 4G, we need to be in the long mode > + * before we can load 64bits of the kernel cr3. We use a temp gdt for > + * that with CS.L=1 and CS.D=0 */ > + mov %edi, %eax > + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax > + lgdtl %cs:(%eax) > + > + /* not done yet, restore CS now to switch to CS.L=1 */ > + mov %edi, %eax > + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax > + ljmp %cs:*(%eax) > +SYM_CODE_END(hv_crash_asm32) > + > + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ > + .code64 > + .balign 8 > +SYM_CODE_START(hv_crash_asm64) > + UNWIND_HINT_UNDEFINED > + ANNOTATE_NOENDBR But this *is* entered via an indirect JMP, right? So back to my earlier question about the state of processor feature enablement. If Linux enabled IBT, is it still enabled after devirtualization and the hypervisor invokes this entry point? Linux guests on Hyper-V have historically not enabled IBT, but patches that enable it are now in linux-next, and will go into the 6.18 kernel. So maybe this needs an ENDBR64. > +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL) > + /* restore kernel page tables so we can jump to kernel code */ > + mov %edi, %eax > + add $HV_CRASHDATA_OFFS_KERNCR3, %eax > + movq %cs:(%eax), %rbx > + movq %rbx, %cr3 > + > + mov %edi, %eax > + add $HV_CRASHDATA_OFFS_C_entry, %eax > + movq %cs:(%eax), %rbx > + ANNOTATE_RETPOLINE_SAFE > + jmp *%rbx > + > + int $3 > + > +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) > +SYM_CODE_END(hv_crash_asm64) > -- > 2.36.1.vfs.0.0 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-15 17:55 ` Michael Kelley @ 2025-09-16 21:30 ` Mukesh R 2025-09-18 23:52 ` Michael Kelley 0 siblings, 1 reply; 29+ messages in thread From: Mukesh R @ 2025-09-16 21:30 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/15/25 10:55, Michael Kelley wrote: > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >> >> Introduce a small asm stub to transition from the hypervisor to linux > > I'd argue for capitalizing "Linux" here and in other places in commit > text and code comments throughout this patch set. I'd argue against it. A quick grep indicates it is a common practice, and in the code world goes easy on the eyes :). >> upon devirtualization. > > In this patch and subsequent patches, you've used the phrase "upon > devirtualization", which seems a little vague to me. Does this mean > "when devirtualization is complete" or perhaps "when the hypervisor > completes devirtualization"? Since there's no spec on any of this, > being as precise as possible will help future readers. since control comes back to linux at the callback here, i fail to understand what is vague about it. when hyp completes devirt, devirt is complete. >> >> At a high level, during panic of either the hypervisor or the dom0 (aka >> root), the nmi handler asks hypervisor to devirtualize. > > Suggest: > > At a high level, during panic of either the hypervisor or Linux running > in dom0 (a.k.a. the root partition), the Linux NMI handler asks the > hypervisor to devirtualize. > >> As part of that, >> the arguments include an entry point to return back to linux. This asm >> stub implements that entry point. >> >> The stub is entered in protected mode, uses temporary gdt and page table >> to enable long mode and get to kernel entry point which then restores full >> kernel context to resume execution to kexec. >> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >> --- >> arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++ >> 1 file changed, 105 insertions(+) >> create mode 100644 arch/x86/hyperv/hv_trampoline.S >> >> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S >> new file mode 100644 >> index 000000000000..27a755401a42 >> --- /dev/null >> +++ b/arch/x86/hyperv/hv_trampoline.S >> @@ -0,0 +1,105 @@ >> +/* SPDX-License-Identifier: GPL-2.0-only */ >> +/* >> + * X86 specific Hyper-V kdump/crash related code. > > Add a qualification that this is for root partition only, and not for > general guests? i don't think it is needed, it would be odd for guests to collect hyp core. besides makefile/kconfig shows this is root vm only >> + * >> + * Copyright (C) 2025, Microsoft, Inc. >> + * >> + */ >> +#include <linux/linkage.h> >> +#include <asm/alternative.h> >> +#include <asm/msr.h> >> +#include <asm/processor-flags.h> >> +#include <asm/nospec-branch.h> >> + >> +/* >> + * void noreturn hv_crash_asm32(arg1) >> + * arg1 == edi == 32bit PA of struct hv_crash_trdata > > I think this is "struct hv_crash_tramp_data". correct >> + * >> + * The hypervisor jumps here upon devirtualization in protected mode. This >> + * code gets copied to a page in the low 4G ie, 32bit space so it can run >> + * in the protected mode. Hence we cannot use any compile/link time offsets or >> + * addresses. It restores long mode via temporary gdt and page tables and >> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. >> + * >> + * PreCondition (ie, Hypervisor call back ABI): >> + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled >> + * o CR4 is set to 0x0 >> + * o IA32_EFER is set to 0x901 (SCE and NXE are set) >> + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. >> + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF >> + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF >> + * o LDTR is initialized as invalid (limit of 0) >> + * o MSR PAT is power on default. >> + * o Other state/registers are cleared. All TLBs flushed. > > Clarification about "Other state/registers are cleared": What about > processor features that Linux may have enabled or disabled during its > initial boot? Are those still in the states Linux set? Or are they reset to > power-on defaults? For example, if Linux enabled x2apic, is x2apic > still enabled when the stub is entered? correct, if linux set x2apic, x2apic would still be enabled. >> + * >> + * See Intel SDM 10.8.5 > > Hmmm. I downloaded the latest combined SDM, and section 10.8.5 > in Volume 3A is about Microcode Update Resources, which doesn't > seem applicable here. Other volumes don't have a section 10.8.5. google ai found it right away upon searching: intel sdm 10.8.5 ia-32e >> + */ >> + >> +#define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ >> +#define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ >> +#define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ >> +#define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ >> +#define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ > > It seems like these offsets should go in a #include file along > with the definition of struct hv_crash_tramp_data. Then the > BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could > check against these symbolic names instead of hardcoding > numbers that must match these. yeah, that works too and was the first cut. but given the small number of these, and that they are not used/needed anywhere else, and that they will almost never change, creating another tiny header in a non-driver directory didn't seem worth it.. but i could go either way. >> +#define HV_CRASHDATA_TRAMPOLINE_CS 0x8 > > This #define isn't used anywhere. removed >> + >> + .text >> + .code32 >> + >> +SYM_CODE_START(hv_crash_asm32) >> + UNWIND_HINT_UNDEFINED >> + ANNOTATE_NOENDBR > > No ENDBR here, presumably because this function is entered via other > than an indirect CALL or JMP instruction. Right? > >> + movl $X86_CR4_PAE, %ecx >> + movl %ecx, %cr4 >> + >> + movl %edi, %ebx >> + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx >> + movl %cs:(%ebx), %eax >> + movl %eax, %cr3 >> + >> + # Setup EFER for long mode now. >> + movl $MSR_EFER, %ecx >> + rdmsr >> + btsl $_EFER_LME, %eax >> + wrmsr >> + >> + # Turn paging on using the temp 32bit trampoline page table. >> + movl %cr0, %eax >> + orl $(X86_CR0_PG), %eax >> + movl %eax, %cr0 >> + >> + /* since kernel cr3 could be above 4G, we need to be in the long mode >> + * before we can load 64bits of the kernel cr3. We use a temp gdt for >> + * that with CS.L=1 and CS.D=0 */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax >> + lgdtl %cs:(%eax) >> + >> + /* not done yet, restore CS now to switch to CS.L=1 */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax >> + ljmp %cs:*(%eax) >> +SYM_CODE_END(hv_crash_asm32) >> + >> + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ >> + .code64 >> + .balign 8 >> +SYM_CODE_START(hv_crash_asm64) >> + UNWIND_HINT_UNDEFINED >> + ANNOTATE_NOENDBR > > But this *is* entered via an indirect JMP, right? So back to my > earlier question about the state of processor feature enablement. > If Linux enabled IBT, is it still enabled after devirtualization and > the hypervisor invokes this entry point? Linux guests on Hyper-V > have historically not enabled IBT, but patches that enable it are > now in linux-next, and will go into the 6.18 kernel. So maybe > this needs an ENDBR64. IBT would be disabled in the transition here.... so doesn't really matter. ENDBR ok too.. >> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL) >> + /* restore kernel page tables so we can jump to kernel code */ >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_KERNCR3, %eax >> + movq %cs:(%eax), %rbx >> + movq %rbx, %cr3 >> + >> + mov %edi, %eax >> + add $HV_CRASHDATA_OFFS_C_entry, %eax >> + movq %cs:(%eax), %rbx >> + ANNOTATE_RETPOLINE_SAFE >> + jmp *%rbx >> + >> + int $3 >> + >> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) >> +SYM_CODE_END(hv_crash_asm64) >> -- >> 2.36.1.vfs.0.0 >> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-16 21:30 ` Mukesh R @ 2025-09-18 23:52 ` Michael Kelley 2025-09-19 9:06 ` Borislav Petkov 0 siblings, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-18 23:52 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM > > On 9/15/25 10:55, Michael Kelley wrote: > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > >> > >> Introduce a small asm stub to transition from the hypervisor to linux > > > > I'd argue for capitalizing "Linux" here and in other places in commit > > text and code comments throughout this patch set. > > I'd argue against it. A quick grep indicates it is a common practice, > and in the code world goes easy on the eyes :). I'll offer a final comment on this topic, and then let it be. There's a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson, and other maintainers giving comments to use the capitalized form of "Linux", "MSR", "RAM", etc. See: https://lore.kernel.org/lkml/Y+4WHGNdWTZ5Hc6Y@kroah.com/ https://lore.kernel.org/lkml/86o7u0dqzj.wl-maz@kernel.org/ https://lore.kernel.org/lkml/408e68d0-1ae1-6d56-d008-61de14214326@linaro.org/ https://lore.kernel.org/lkml/20250819215304.GMaKTyQBWi6YzqZ0bW@fat_crate.local/ https://lore.kernel.org/lkml/Y0CAHch5UR2Lp0tU@google.com/ https://lore.kernel.org/lkml/20240126214336.GA453589@bhelgaas/ https://lore.kernel.org/lkml/20161117155543.vg3domfqm3bhp4f7@pd.tnic/ > > >> upon devirtualization. > > > > In this patch and subsequent patches, you've used the phrase "upon > > devirtualization", which seems a little vague to me. Does this mean > > "when devirtualization is complete" or perhaps "when the hypervisor > > completes devirtualization"? Since there's no spec on any of this, > > being as precise as possible will help future readers. > > since control comes back to linux at the callback here, i fail to > understand what is vague about it. when hyp completes devirt, > devirt is complete. To me, the word "upon" is less precise than just "after". In temporal contexts, "upon" might mean "at the same time as" or it might mean "immediately after". I wrote this comment as I was trying to figure out how the entire devirtualization process works. Eventually it became clear and the ambiguity was resolved, but initially I was uncertain. See some broader thoughts in my reply on Patch 5 of the series. > > >> > >> At a high level, during panic of either the hypervisor or the dom0 (aka > >> root), the nmi handler asks hypervisor to devirtualize. > > > > Suggest: > > > > At a high level, during panic of either the hypervisor or Linux running > > in dom0 (a.k.a. the root partition), the Linux NMI handler asks the > > hypervisor to devirtualize. > > > >> As part of that, > >> the arguments include an entry point to return back to linux. This asm > >> stub implements that entry point. > >> > >> The stub is entered in protected mode, uses temporary gdt and page table > >> to enable long mode and get to kernel entry point which then restores full > >> kernel context to resume execution to kexec. > >> > >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > >> --- > >> arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++ > >> 1 file changed, 105 insertions(+) > >> create mode 100644 arch/x86/hyperv/hv_trampoline.S > >> > >> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S > >> new file mode 100644 > >> index 000000000000..27a755401a42 > >> --- /dev/null > >> +++ b/arch/x86/hyperv/hv_trampoline.S > >> @@ -0,0 +1,105 @@ > >> +/* SPDX-License-Identifier: GPL-2.0-only */ > >> +/* > >> + * X86 specific Hyper-V kdump/crash related code. > > > > Add a qualification that this is for root partition only, and not for > > general guests? > > i don't think it is needed, it would be odd for guests to collect hyp > core. besides makefile/kconfig shows this is root vm only > > >> + * > >> + * Copyright (C) 2025, Microsoft, Inc. > >> + * > >> + */ > >> +#include <linux/linkage.h> > >> +#include <asm/alternative.h> > >> +#include <asm/msr.h> > >> +#include <asm/processor-flags.h> > >> +#include <asm/nospec-branch.h> > >> + > >> +/* > >> + * void noreturn hv_crash_asm32(arg1) > >> + * arg1 == edi == 32bit PA of struct hv_crash_trdata > > > > I think this is "struct hv_crash_tramp_data". > > correct > > >> + * > >> + * The hypervisor jumps here upon devirtualization in protected mode. This > >> + * code gets copied to a page in the low 4G ie, 32bit space so it can run > >> + * in the protected mode. Hence we cannot use any compile/link time offsets or > >> + * addresses. It restores long mode via temporary gdt and page tables and > >> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry. > >> + * > >> + * PreCondition (ie, Hypervisor call back ABI): > >> + * o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled > >> + * o CR4 is set to 0x0 > >> + * o IA32_EFER is set to 0x901 (SCE and NXE are set) > >> + * o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX. > >> + * o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF > >> + * o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF > >> + * o LDTR is initialized as invalid (limit of 0) > >> + * o MSR PAT is power on default. > >> + * o Other state/registers are cleared. All TLBs flushed. > > > > Clarification about "Other state/registers are cleared": What about > > processor features that Linux may have enabled or disabled during its > > initial boot? Are those still in the states Linux set? Or are they reset to > > power-on defaults? For example, if Linux enabled x2apic, is x2apic > > still enabled when the stub is entered? > > correct, if linux set x2apic, x2apic would still be enabled. > > >> + * > >> + * See Intel SDM 10.8.5 > > > > Hmmm. I downloaded the latest combined SDM, and section 10.8.5 > > in Volume 3A is about Microcode Update Resources, which doesn't > > seem applicable here. Other volumes don't have a section 10.8.5. > > google ai found it right away upon searching: intel sdm 10.8.5 ia-32e Unfortunately, Intel doesn't necessarily maintain the section numbering across revisions of the SDM. This web page: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html has a link to download the "Combined Volume Set", and currently provides the version dated June 2025. The section "Initializing IA-32e Mode" is numbered 11.8.5. The December 2024 version has the same 11.8.5 numbering. Are you finding an older version? Presumably the section title is less likely to change unless Intel does a major rewrite. So something like this would be more durable: * See Intel SDM Volume 3A section "Initializing IA-32e Mode" (numbered 11.8.5 in the June 2025 version) > > >> + */ > >> + > >> +#define HV_CRASHDATA_OFFS_TRAMPCR3 0x0 /* 0 */ > >> +#define HV_CRASHDATA_OFFS_KERNCR3 0x8 /* 8 */ > >> +#define HV_CRASHDATA_OFFS_GDTRLIMIT 0x12 /* 18 */ > >> +#define HV_CRASHDATA_OFFS_CS_JMPTGT 0x28 /* 40 */ > >> +#define HV_CRASHDATA_OFFS_C_entry 0x30 /* 48 */ > > > > It seems like these offsets should go in a #include file along > > with the definition of struct hv_crash_tramp_data. Then the > > BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could > > check against these symbolic names instead of hardcoding > > numbers that must match these. > > yeah, that works too and was the first cut. but given the small > number of these, and that they are not used/needed anywhere else, > and that they will almost never change, creating another tiny header > in a non-driver directory didn't seem worth it.. but i could go > either way. > > >> +#define HV_CRASHDATA_TRAMPOLINE_CS 0x8 > > > > This #define isn't used anywhere. > > removed > > >> + > >> + .text > >> + .code32 > >> + > >> +SYM_CODE_START(hv_crash_asm32) > >> + UNWIND_HINT_UNDEFINED > >> + ANNOTATE_NOENDBR > > > > No ENDBR here, presumably because this function is entered via other > > than an indirect CALL or JMP instruction. Right? > > > >> + movl $X86_CR4_PAE, %ecx > >> + movl %ecx, %cr4 > >> + > >> + movl %edi, %ebx > >> + add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx > >> + movl %cs:(%ebx), %eax > >> + movl %eax, %cr3 > >> + > >> + # Setup EFER for long mode now. > >> + movl $MSR_EFER, %ecx > >> + rdmsr > >> + btsl $_EFER_LME, %eax > >> + wrmsr > >> + > >> + # Turn paging on using the temp 32bit trampoline page table. > >> + movl %cr0, %eax > >> + orl $(X86_CR0_PG), %eax > >> + movl %eax, %cr0 > >> + > >> + /* since kernel cr3 could be above 4G, we need to be in the long mode > >> + * before we can load 64bits of the kernel cr3. We use a temp gdt for > >> + * that with CS.L=1 and CS.D=0 */ > >> + mov %edi, %eax > >> + add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax > >> + lgdtl %cs:(%eax) > >> + > >> + /* not done yet, restore CS now to switch to CS.L=1 */ > >> + mov %edi, %eax > >> + add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax > >> + ljmp %cs:*(%eax) > >> +SYM_CODE_END(hv_crash_asm32) > >> + > >> + /* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */ > >> + .code64 > >> + .balign 8 > >> +SYM_CODE_START(hv_crash_asm64) > >> + UNWIND_HINT_UNDEFINED > >> + ANNOTATE_NOENDBR > > > > But this *is* entered via an indirect JMP, right? So back to my > > earlier question about the state of processor feature enablement. > > If Linux enabled IBT, is it still enabled after devirtualization and > > the hypervisor invokes this entry point? Linux guests on Hyper-V > > have historically not enabled IBT, but patches that enable it are > > now in linux-next, and will go into the 6.18 kernel. So maybe > > this needs an ENDBR64. > > IBT would be disabled in the transition here.... so doesn't really > matter. ENDBR ok too.. So does Hyper-V explicitly disable IBT before making the callback? Or is the IBT disabling somehow a processor side effect of going back to protected mode? I don't see anything in the SDM about the latter. Not having a Hyper-V spec for all this is frustrating ... Doing the ENDBR64 here might be safer in the long run in case we ever do end up here with IBT enabled. > > >> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL) > >> + /* restore kernel page tables so we can jump to kernel code */ > >> + mov %edi, %eax > >> + add $HV_CRASHDATA_OFFS_KERNCR3, %eax > >> + movq %cs:(%eax), %rbx > >> + movq %rbx, %cr3 > >> + > >> + mov %edi, %eax > >> + add $HV_CRASHDATA_OFFS_C_entry, %eax > >> + movq %cs:(%eax), %rbx > >> + ANNOTATE_RETPOLINE_SAFE > >> + jmp *%rbx > >> + > >> + int $3 > >> + > >> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL) > >> +SYM_CODE_END(hv_crash_asm64) > >> -- > >> 2.36.1.vfs.0.0 > >> > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-18 23:52 ` Michael Kelley @ 2025-09-19 9:06 ` Borislav Petkov 2025-09-19 19:09 ` Mukesh R 0 siblings, 1 reply; 29+ messages in thread From: Borislav Petkov @ 2025-09-19 9:06 UTC (permalink / raw) To: Michael Kelley, Mukesh R Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On Thu, Sep 18, 2025 at 11:52:35PM +0000, Michael Kelley wrote: > From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM > > > > On 9/15/25 10:55, Michael Kelley wrote: > > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > > >> > > >> Introduce a small asm stub to transition from the hypervisor to linux > > > > > > I'd argue for capitalizing "Linux" here and in other places in commit > > > text and code comments throughout this patch set. > > > > I'd argue against it. A quick grep indicates it is a common practice, > > and in the code world goes easy on the eyes :). But not in commit messages. Commit messages should be maximally readable and things should start in capital letters if that is their common spelling. When it comes to "Linux", yeah, that's so widespread so you have both. If I'm referring to what Linux does as a policy or in general or so on, I'd spell it capitalized but I don't think we've enforced that too strictly... > I'll offer a final comment on this topic, and then let it be. There's > a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson, > and other maintainers giving comments to use the capitalized form > of "Linux", "MSR", "RAM", etc. See: MSR, RAM and other abbreviations are capitalized and that's the only correct way to spell them. > > >> upon devirtualization. What is "devirtualization"? > > since control comes back to linux at the callback here, i fail to > > understand what is vague about it. when hyp completes devirt, > > devirt is complete. This "speak" is what gets on my nerves. You're writing here as if everyone is in your head and everyone knows what "hyp" and "devirt" is. Commit mesages are not code and they should be maximally readable and accessible to the widest audience, not only to the three people who develop the feature. If this patch were aimed at the things I maintain, it'll need a serious commit message scrubbing and sanitizing first. HTH. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor 2025-09-19 9:06 ` Borislav Petkov @ 2025-09-19 19:09 ` Mukesh R 0 siblings, 0 replies; 29+ messages in thread From: Mukesh R @ 2025-09-19 19:09 UTC (permalink / raw) To: Borislav Petkov, Michael Kelley Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org, kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/19/25 02:06, Borislav Petkov wrote: > On Thu, Sep 18, 2025 at 11:52:35PM +0000, Michael Kelley wrote: >> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM >>> >>> On 9/15/25 10:55, Michael Kelley wrote: >>>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >>>>> >>>>> Introduce a small asm stub to transition from the hypervisor to linux >>>> >>>> I'd argue for capitalizing "Linux" here and in other places in commit >>>> text and code comments throughout this patch set. >>> >>> I'd argue against it. A quick grep indicates it is a common practice, >>> and in the code world goes easy on the eyes :). > > But not in commit messages. > > Commit messages should be maximally readable and things should start in > capital letters if that is their common spelling. > > When it comes to "Linux", yeah, that's so widespread so you have both. If I'm > referring to what Linux does as a policy or in general or so on, I'd spell it > capitalized but I don't think we've enforced that too strictly... > >> I'll offer a final comment on this topic, and then let it be. There's >> a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson, >> and other maintainers giving comments to use the capitalized form >> of "Linux", "MSR", "RAM", etc. See: > > MSR, RAM and other abbreviations are capitalized and that's the only correct > way to spell them. > >>>>> upon devirtualization. > > What is "devirtualization"? Hypervisor is disabled, and it transfer control to the root/dom0 partition, so essentially hypervisor is gone when control comes back to root/dom0 Linux. >>> since control comes back to linux at the callback here, i fail to >>> understand what is vague about it. when hyp completes devirt, >>> devirt is complete. > > This "speak" is what gets on my nerves. You're writing here as if everyone is > in your head and everyone knows what "hyp" and "devirt" is. that's just follow up conversation, commit comment says "hypervisor" and "devirtualization". > Commit mesages are not code and they should be maximally readable and > accessible to the widest audience, not only to the three people who develop > the feature. > > If this patch were aimed at the things I maintain, it'll need a serious commit > message scrubbing and sanitizing first. > > HTH. > ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor ` (3 preceding siblings ...) 2025-09-10 0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-15 17:55 ` Michael Kelley 2025-09-18 17:11 ` Stanislav Kinsburskii 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor 5 siblings, 2 replies; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Introduce a new file to implement collection of hypervisor ram into the vmcore collected by linux. By default, the hypervisor ram is locked, ie, protected via hw page table. Hyper-V implements a disable hypercall which essentially devirtualizes the system on the fly. This mechanism makes the hypervisor ram accessible to linux. Because the hypervisor ram is already mapped into linux address space (as reserved ram), it is automatically collected into the vmcore without extra work. More details of the implementation are available in the file prologue. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ 1 file changed, 622 insertions(+) create mode 100644 arch/x86/hyperv/hv_crash.c diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c new file mode 100644 index 000000000000..531bac79d598 --- /dev/null +++ b/arch/x86/hyperv/hv_crash.c @@ -0,0 +1,622 @@ +// SPDX-License-Identifier: GPL-2.0-only +/* + * X86 specific Hyper-V kdump/crash support module + * + * Copyright (C) 2025, Microsoft, Inc. + * + * This module implements hypervisor ram collection into vmcore for both + * cases of the hypervisor crash and linux dom0/root crash. Hyper-V implements + * a devirtualization hypercall with a 32bit protected mode ABI callback. This + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram + * is already mapped in linux, it is automatically collected into linux vmcore, + * and can be examined by the crash command (raw ram dump) or windbg. + * + * At a high level: + * + * Hypervisor Crash: + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a + * restrictive mode with very limited hypercall and msr support. Each cpu + * then injects NMIs into dom0/root vcpus. A shared page is used to check + * by linux in the nmi handler if the hypervisor has crashed. This shared + * page is setup in hv_root_crash_init during boot. + * + * Linux Crash: + * In case of linux crash, the callback hv_crash_stop_other_cpus will send + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits + * for all cpus to be in NMI. + * + * NMI Handler (upon quorum): + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. + * Hyper-V requires the disable hypervisor must be done from the bsp. So + * the bsp nmi handler saves current context, does some fixups and makes + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor + * at that point will suspend all vcpus (except the bsp), unlock all its + * ram, and return to linux at the 32bit mode entry RIP. + * + * Linux 32bit entry trampoline will then restore long mode and call C + * function here to restore context and continue execution to crash kexec. + */ + +#include <linux/delay.h> +#include <linux/kexec.h> +#include <linux/crash_dump.h> +#include <linux/panic.h> +#include <asm/apic.h> +#include <asm/desc.h> +#include <asm/page.h> +#include <asm/pgalloc.h> +#include <asm/mshyperv.h> +#include <asm/nmi.h> +#include <asm/idtentry.h> +#include <asm/reboot.h> +#include <asm/intel_pt.h> + +int hv_crash_enabled; +EXPORT_SYMBOL_GPL(hv_crash_enabled); + +struct hv_crash_ctxt { + ulong rsp; + ulong cr0; + ulong cr2; + ulong cr4; + ulong cr8; + + u16 cs; + u16 ss; + u16 ds; + u16 es; + u16 fs; + u16 gs; + + u16 gdt_fill; + struct desc_ptr gdtr; + char idt_fill[6]; + struct desc_ptr idtr; + + u64 gsbase; + u64 efer; + u64 pat; +}; +static struct hv_crash_ctxt hv_crash_ctxt; + +/* Shared hypervisor page that contains crash dump area we peek into. + * NB: windbg looks for "hv_cda" symbol so don't change it. + */ +static struct hv_crashdump_area *hv_cda; + +static u32 trampoline_pa, devirt_cr3arg; +static atomic_t crash_cpus_wait; +static void *hv_crash_ptpgs[4]; +static int hv_has_crashed, lx_has_crashed; + +/* This cannot be inlined as it needs stack */ +static noinline __noclone void hv_crash_restore_tss(void) +{ + load_TR_desc(); +} + +/* This cannot be inlined as it needs stack */ +static noinline void hv_crash_clear_kernpt(void) +{ + pgd_t *pgd; + p4d_t *p4d; + + /* Clear entry so it's not confusing to someone looking at the core */ + pgd = pgd_offset_k(trampoline_pa); + p4d = p4d_offset(pgd, trampoline_pa); + native_p4d_clear(p4d); +} + +/* + * This is the C entry point from the asm glue code after the devirt hypercall. + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel + * page tables with our below 4G page identity mapped, but using a temporary + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not + * available. We restore kernel GDT, and rest of the context, and continue + * to kexec. + */ +static asmlinkage void __noreturn hv_crash_c_entry(void) +{ + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; + + /* first thing, restore kernel gdt */ + native_load_gdt(&ctxt->gdtr); + + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); + + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); + + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); + + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); + + native_load_idt(&ctxt->idtr); + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); + native_wrmsrq(MSR_EFER, ctxt->efer); + + /* restore the original kernel CS now via far return */ + asm volatile("movzwq %0, %%rax\n\t" + "pushq %%rax\n\t" + "pushq $1f\n\t" + "lretq\n\t" + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); + + /* We are in asmlinkage without stack frame, hence make a C function + * call which will buy stack frame to restore the tss or clear PT entry. + */ + hv_crash_restore_tss(); + hv_crash_clear_kernpt(); + + /* we are now fully in devirtualized normal kernel mode */ + __crash_kexec(NULL); + + for (;;) + cpu_relax(); +} +/* Tell gcc we are using lretq long jump in the above function intentionally */ +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); + +static void hv_mark_tss_not_busy(void) +{ + struct desc_struct *desc = get_current_gdt_rw(); + tss_desc tss; + + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); +} + +/* Save essential context */ +static void hv_hvcrash_ctxt_save(void) +{ + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; + + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); + + ctxt->cr0 = native_read_cr0(); + ctxt->cr4 = native_read_cr4(); + + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); + + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); + + native_store_gdt(&ctxt->gdtr); + store_idt(&ctxt->idtr); + + ctxt->gsbase = __rdmsr(MSR_GS_BASE); + ctxt->efer = __rdmsr(MSR_EFER); + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); +} + +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ +static void hv_crash_fixup_kernpt(void) +{ + pgd_t *pgd; + p4d_t *p4d; + + pgd = pgd_offset_k(trampoline_pa); + p4d = p4d_offset(pgd, trampoline_pa); + + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ +} + +/* + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has + * crashed and will collect core. This will cause the hyp to quiesce and + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. + */ +static void hv_notify_prepare_hyp(void) +{ + u64 status; + struct hv_input_notify_partition_event *input; + struct hv_partition_event_root_crashdump_input *cda; + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + cda = &input->input.crashdump_input; + memset(input, 0, sizeof(*input)); + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; + + cda->crashdump_action = HV_CRASHDUMP_ENTRY; + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); + if (!hv_result_success(status)) + return; + + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); +} + +/* + * Common function for all cpus before devirtualization. + * + * Hypervisor crash: all cpus get here in nmi context. + * Linux crash: the panicing cpu gets here at base level, all others in nmi + * context. Note, panicing cpu may not be the bsp. + * + * The function is not inlined so it will show on the stack. It is named so + * because the crash cmd looks for certain well known function names on the + * stack before looking into the cpu saved note in the elf section, and + * that work is currently incomplete. + * + * Notes: + * Hypervisor crash: + * - the hypervisor is in a very restrictive mode at this point and any + * vmexit it cannot handle would result in reboot. For example, console + * output from here would result in synic ipi hcall, which would result + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. + * + * Devirtualization is supported from the bsp only. + */ +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) +{ + struct hv_input_disable_hyp_ex *input; + u64 status; + int msecs = 1000, ccpu = smp_processor_id(); + + if (ccpu == 0) { + /* crash_save_cpu() will be done in the kexec path */ + cpu_emergency_stop_pt(); /* disable performance trace */ + atomic_inc(&crash_cpus_wait); + } else { + crash_save_cpu(regs, ccpu); + cpu_emergency_stop_pt(); /* disable performance trace */ + atomic_inc(&crash_cpus_wait); + for (;;); /* cause no vmexits */ + } + + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) + mdelay(1); + + stop_nmi(); + if (!hv_has_crashed) + hv_notify_prepare_hyp(); + + if (crashing_cpu == -1) + crashing_cpu = ccpu; /* crash cmd uses this */ + + hv_hvcrash_ctxt_save(); + hv_mark_tss_not_busy(); + hv_crash_fixup_kernpt(); + + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + memset(input, 0, sizeof(*input)); + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ + + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); + + /* Devirt failed, just reboot as things are in very bad state now */ + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ +} + +/* + * Generic nmi callback handler: could be called without any crash also. + * hv crash: hypervisor injects nmi's into all cpus + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus + */ +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) +{ + int ccpu = smp_processor_id(); + + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) + hv_has_crashed = 1; + + if (!hv_has_crashed && !lx_has_crashed) + return NMI_DONE; /* ignore the nmi */ + + if (hv_has_crashed) { + if (!kexec_crash_loaded() || !hv_crash_enabled) { + if (ccpu == 0) { + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ + } else + for (;;); /* cause no vmexits */ + } + } + + crash_nmi_callback(regs); + + return NMI_DONE; +} + +/* + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus + * + * On normal linux panic, this is called twice: first from panic and then again + * from native_machine_crash_shutdown. + * + * In case of mshv, 3 ways to get here: + * 1. hv crash (only bsp will get here): + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry + * -> __crash_kexec -> native_machine_crash_shutdown + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus + * linux panic: + * 2. panic cpu x: panic() -> crash_smp_send_stop + * -> smp_ops.crash_stop_other_cpus + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop + * + * NB: noclone and non standard stack because of call to crash_setup_regs(). + */ +static void __noclone hv_crash_stop_other_cpus(void) +{ + static int crash_stop_done; + struct pt_regs lregs; + int ccpu = smp_processor_id(); + + if (hv_has_crashed) + return; /* all cpus already in nmi handler path */ + + if (!kexec_crash_loaded()) + return; + + if (crash_stop_done) + return; + crash_stop_done = 1; + + /* linux has crashed: hv is healthy, we can ipi safely */ + lx_has_crashed = 1; + wmb(); /* nmi handlers look at lx_has_crashed */ + + apic->send_IPI_allbutself(NMI_VECTOR); + + if (crashing_cpu == -1) + crashing_cpu = ccpu; /* crash cmd uses this */ + + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which + * is the bsp. We could be here on non-bsp cpu, collect regs if so. + */ + if (ccpu) + crash_setup_regs(&lregs, NULL); + + crash_nmi_callback(&lregs); +} +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); + +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ +struct hv_gdtreg_32 { + u16 fill; + u16 limit; + u32 address; +} __packed; + +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ +struct hv_crash_tramp_gdt { + u64 null; /* index 0, selector 0, null selector */ + u64 cs64; /* index 1, selector 8, cs64 selector */ +} __packed; + +/* No stack, so jump via far ptr in memory to load the 64bit CS */ +struct hv_cs_jmptgt { + u32 address; + u16 csval; + u16 fill; +} __packed; + +/* This trampoline data is copied onto the trampoline page after the asm code */ +struct hv_crash_tramp_data { + u64 tramp32_cr3; + u64 kernel_cr3; + struct hv_gdtreg_32 gdtr32; + struct hv_crash_tramp_gdt tramp_gdt; + struct hv_cs_jmptgt cs_jmptgt; + u64 c_entry_addr; +} __packed; + +/* + * Setup a temporary gdt to allow the asm code to switch to the long mode. + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip + * relative addressing, hence we must use trampoline_pa here. Also, save other + * info like jmp and C entry targets for same reasons. + * + * Returns: 0 on success, -1 on error + */ +static int hv_crash_setup_trampdata(u64 trampoline_va) +{ + int size, offs; + void *dest; + struct hv_crash_tramp_data *tramp; + + /* These must match exactly the ones in the corresponding asm file */ + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, + cs_jmptgt.address) != 40); + + /* hv_crash_asm_end is beyond last byte by 1 */ + size = &hv_crash_asm_end - &hv_crash_asm32; + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { + pr_err("%s: trampoline page overflow\n", __func__); + return -1; + } + + dest = (void *)trampoline_va; + memcpy(dest, &hv_crash_asm32, size); + + dest += size; + dest = (void *)round_up((ulong)dest, 16); + tramp = (struct hv_crash_tramp_data *)dest; + + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by + * non-PCID-aware users". Build cr3 with pcid 0 + */ + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); + + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); + + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); + tramp->gdtr32.address = trampoline_pa + + (ulong)&tramp->tramp_gdt - trampoline_va; + + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; + + tramp->cs_jmptgt.csval = 0x8; + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; + tramp->cs_jmptgt.address = trampoline_pa + offs; + + tramp->c_entry_addr = (u64)&hv_crash_c_entry; + + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; + + return 0; +} + +/* + * Build 32bit trampoline page table for transition from protected mode + * non-paging to long-mode paging. This transition needs pagetables below 4G. + */ +static void hv_crash_build_tramp_pt(void) +{ + p4d_t *p4d; + pud_t *pud; + pmd_t *pmd; + pte_t *pte; + u64 pa, addr = trampoline_pa; + + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); + pa = virt_to_phys(hv_crash_ptpgs[1]); + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ + + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); + pa = virt_to_phys(hv_crash_ptpgs[2]); + set_pud(pud, __pud(_PAGE_TABLE | pa)); + + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); + pa = virt_to_phys(hv_crash_ptpgs[3]); + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); + + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); +} + +/* + * Setup trampoline for devirtualization: + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to + * in protected mode. + * - 4 pages for a temporary page table that asm code uses to turn paging on + * - a temporary gdt to use in the compat mode. + * + * Returns: 0 on success + */ +static int hv_crash_trampoline_setup(void) +{ + int i, rc, order; + struct page *page; + u64 trampoline_va; + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; + + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ + page = alloc_page(flags32); + if (page == NULL) { + pr_err("%s: failed to alloc asm stub page\n", __func__); + return -1; + } + + trampoline_va = (u64)page_to_virt(page); + trampoline_pa = (u32)page_to_phys(page); + + order = 2; /* alloc 2^2 pages */ + page = alloc_pages(flags32, order); + if (page == NULL) { + pr_err("%s: failed to alloc pt pages\n", __func__); + free_page(trampoline_va); + return -1; + } + + for (i = 0; i < 4; i++, page++) + hv_crash_ptpgs[i] = page_to_virt(page); + + hv_crash_build_tramp_pt(); + + rc = hv_crash_setup_trampdata(trampoline_va); + if (rc) + goto errout; + + return 0; + +errout: + free_page(trampoline_va); + free_pages((ulong)hv_crash_ptpgs[0], order); + + return rc; +} + +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ +void hv_root_crash_init(void) +{ + int rc; + struct hv_input_get_system_property *input; + struct hv_output_get_system_property *output; + unsigned long flags; + u64 status; + union hv_pfn_range cda_info; + + if (pgtable_l5_enabled()) { + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); + return; + } + + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, + "hv_crash_nmi"); + if (rc) { + pr_err("Hyper-V: failed to register crash nmi handler\n"); + return; + } + + local_irq_save(flags); + input = *this_cpu_ptr(hyperv_pcpu_input_arg); + output = *this_cpu_ptr(hyperv_pcpu_output_arg); + + memset(input, 0, sizeof(*input)); + memset(output, 0, sizeof(*output)); + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; + + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); + cda_info.as_uint64 = output->hv_cda_info.as_uint64; + local_irq_restore(flags); + + if (!hv_result_success(status)) { + pr_err("Hyper-V: %s: property:%d %s\n", __func__, + input->property_id, hv_result_to_string(status)); + goto err_out; + } + + if (cda_info.base_pfn == 0) { + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); + goto err_out; + } + + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); + + rc = hv_crash_trampoline_setup(); + if (rc) + goto err_out; + + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; + + crash_kexec_post_notifiers = true; + hv_crash_enabled = 1; + pr_info("Hyper-V: linux and hv kdump support enabled\n"); + + return; + +err_out: + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); +} -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-10 0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor @ 2025-09-15 17:55 ` Michael Kelley 2025-09-17 1:13 ` Mukesh R 2025-09-18 17:11 ` Stanislav Kinsburskii 1 sibling, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-15 17:55 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > > Introduce a new file to implement collection of hypervisor ram into the s/ram/RAM/ (multiple places) > vmcore collected by linux. By default, the hypervisor ram is locked, ie, > protected via hw page table. Hyper-V implements a disable hypercall which The terminology here is a bit confusing since you have two names for the same thing: "disable" hypervisor, and "devirtualize". Is it possible to just use "devirtualize" everywhere, and drop the "disable" terminology? > essentially devirtualizes the system on the fly. This mechanism makes the > hypervisor ram accessible to linux. Because the hypervisor ram is already > mapped into linux address space (as reserved ram), Is the hypervisor RAM mapped into the VMM process user address space, or somewhere in the kernel address space? If the latter, where in the kernel code, or what mechanism, does that? Just curious, as I wasn't aware that this is happening .... > it is automatically > collected into the vmcore without extra work. More details of the > implementation are available in the file prologue. > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > --- > arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 622 insertions(+) > create mode 100644 arch/x86/hyperv/hv_crash.c > > diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c > new file mode 100644 > index 000000000000..531bac79d598 > --- /dev/null > +++ b/arch/x86/hyperv/hv_crash.c > @@ -0,0 +1,622 @@ > +// SPDX-License-Identifier: GPL-2.0-only > +/* > + * X86 specific Hyper-V kdump/crash support module > + * > + * Copyright (C) 2025, Microsoft, Inc. > + * > + * This module implements hypervisor ram collection into vmcore for both > + * cases of the hypervisor crash and linux dom0/root crash. For a hypervisor crash, does any of this apply to general guest VMs? I'm thinking it does not. Hypervisor RAM is collected only into the vmcore for the root partition, right? Maybe some additional clarification could be added so there's no confusion in this regard. And what *does* happen to guest VMs after a hypervisor crash? > + * Hyper-V implements > + * a devirtualization hypercall with a 32bit protected mode ABI callback. This > + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram > + * is already mapped in linux, it is automatically collected into linux vmcore, > + * and can be examined by the crash command (raw ram dump) or windbg. > + * > + * At a high level: > + * > + * Hypervisor Crash: > + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a > + * restrictive mode with very limited hypercall and msr support. s/msr/MSR/ > + * Each cpu then injects NMIs into dom0/root vcpus. The "Each cpu" part of this sentence is confusing to me -- which CPUs does this refer to? Maybe it would be better to say "It then injects an NMI into each dom0/root partition vCPU." without being specific as to which CPUs do the injecting since that seems more like a hypervisor implementation detail that's not relevant here. > + * A shared page is used to check > + * by linux in the nmi handler if the hypervisor has crashed. This shared s/nmi/NMI/ (multiple places) > + * page is setup in hv_root_crash_init during boot. > + * > + * Linux Crash: > + * In case of linux crash, the callback hv_crash_stop_other_cpus will send > + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits > + * for all cpus to be in NMI. > + * > + * NMI Handler (upon quorum): > + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. s/hanlder/handler/ And maybe just drop the word "wil" (which is misspelled). > + * Hyper-V requires the disable hypervisor must be done from the bsp. So s/bsp/BSP (multiple places) > + * the bsp nmi handler saves current context, does some fixups and makes > + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor > + * at that point will suspend all vcpus (except the bsp), unlock all its > + * ram, and return to linux at the 32bit mode entry RIP. > + * > + * Linux 32bit entry trampoline will then restore long mode and call C > + * function here to restore context and continue execution to crash kexec. > + */ > + > +#include <linux/delay.h> > +#include <linux/kexec.h> > +#include <linux/crash_dump.h> > +#include <linux/panic.h> > +#include <asm/apic.h> > +#include <asm/desc.h> > +#include <asm/page.h> > +#include <asm/pgalloc.h> > +#include <asm/mshyperv.h> > +#include <asm/nmi.h> > +#include <asm/idtentry.h> > +#include <asm/reboot.h> > +#include <asm/intel_pt.h> > + > +int hv_crash_enabled; Seems like this is conceptually a "bool", not an "int". > +EXPORT_SYMBOL_GPL(hv_crash_enabled); > + > +struct hv_crash_ctxt { > + ulong rsp; > + ulong cr0; > + ulong cr2; > + ulong cr4; > + ulong cr8; > + > + u16 cs; > + u16 ss; > + u16 ds; > + u16 es; > + u16 fs; > + u16 gs; > + > + u16 gdt_fill; > + struct desc_ptr gdtr; > + char idt_fill[6]; > + struct desc_ptr idtr; > + > + u64 gsbase; > + u64 efer; > + u64 pat; > +}; > +static struct hv_crash_ctxt hv_crash_ctxt; > + > +/* Shared hypervisor page that contains crash dump area we peek into. > + * NB: windbg looks for "hv_cda" symbol so don't change it. > + */ > +static struct hv_crashdump_area *hv_cda; > + > +static u32 trampoline_pa, devirt_cr3arg; > +static atomic_t crash_cpus_wait; > +static void *hv_crash_ptpgs[4]; > +static int hv_has_crashed, lx_has_crashed; These are conceptually "bool" as well. > + > +/* This cannot be inlined as it needs stack */ > +static noinline __noclone void hv_crash_restore_tss(void) > +{ > + load_TR_desc(); > +} > + > +/* This cannot be inlined as it needs stack */ > +static noinline void hv_crash_clear_kernpt(void) > +{ > + pgd_t *pgd; > + p4d_t *p4d; > + > + /* Clear entry so it's not confusing to someone looking at the core */ > + pgd = pgd_offset_k(trampoline_pa); > + p4d = p4d_offset(pgd, trampoline_pa); > + native_p4d_clear(p4d); > +} > + > +/* > + * This is the C entry point from the asm glue code after the devirt hypercall. > + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel > + * page tables with our below 4G page identity mapped, but using a temporary > + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not > + * available. We restore kernel GDT, and rest of the context, and continue > + * to kexec. > + */ > +static asmlinkage void __noreturn hv_crash_c_entry(void) > +{ > + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; > + > + /* first thing, restore kernel gdt */ > + native_load_gdt(&ctxt->gdtr); > + > + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); > + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); > + > + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); > + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); > + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); > + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); > + > + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); > + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); > + > + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); > + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); > + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); > + > + native_load_idt(&ctxt->idtr); > + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); > + native_wrmsrq(MSR_EFER, ctxt->efer); > + > + /* restore the original kernel CS now via far return */ > + asm volatile("movzwq %0, %%rax\n\t" > + "pushq %%rax\n\t" > + "pushq $1f\n\t" > + "lretq\n\t" > + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); > + > + /* We are in asmlinkage without stack frame, hence make a C function > + * call which will buy stack frame to restore the tss or clear PT entry. > + */ > + hv_crash_restore_tss(); > + hv_crash_clear_kernpt(); > + > + /* we are now fully in devirtualized normal kernel mode */ > + __crash_kexec(NULL); The comments for __crash_kexec() say that "panic_cpu" should be set to the current CPU. I don't see that such is the case here. > + > + for (;;) > + cpu_relax(); Is the intent that __crash_kexec() should never return, on any of the vCPUs, because devirtualization isn't done unless there's a valid kdump image loaded? I wonder if native_wrmsrq(HV_X64_MSR_RESET, 1); would be better than looping forever in case __crash_kexec() fails somewhere along the way even if there's a kdump image loaded. > +} > +/* Tell gcc we are using lretq long jump in the above function intentionally */ > +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); > + > +static void hv_mark_tss_not_busy(void) > +{ > + struct desc_struct *desc = get_current_gdt_rw(); > + tss_desc tss; > + > + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); > + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ > + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); > +} > + > +/* Save essential context */ > +static void hv_hvcrash_ctxt_save(void) > +{ > + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; > + > + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); > + > + ctxt->cr0 = native_read_cr0(); > + ctxt->cr4 = native_read_cr4(); > + > + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); > + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); > + > + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); > + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); > + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); > + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); > + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); > + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); > + > + native_store_gdt(&ctxt->gdtr); > + store_idt(&ctxt->idtr); > + > + ctxt->gsbase = __rdmsr(MSR_GS_BASE); > + ctxt->efer = __rdmsr(MSR_EFER); > + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); > +} > + > +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ > +static void hv_crash_fixup_kernpt(void) > +{ > + pgd_t *pgd; > + p4d_t *p4d; > + > + pgd = pgd_offset_k(trampoline_pa); > + p4d = p4d_offset(pgd, trampoline_pa); > + > + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ > + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); > + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ > +} > + > +/* > + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has > + * crashed and will collect core. This will cause the hyp to quiesce and > + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. > + */ > +static void hv_notify_prepare_hyp(void) > +{ > + u64 status; > + struct hv_input_notify_partition_event *input; > + struct hv_partition_event_root_crashdump_input *cda; > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + cda = &input->input.crashdump_input; The code ordering here is a bit weird. I'd expect this line to be grouped with cda->crashdump_action being set. > + memset(input, 0, sizeof(*input)); > + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; > + > + cda->crashdump_action = HV_CRASHDUMP_ENTRY; > + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); > + if (!hv_result_success(status)) > + return; > + > + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; > + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); > +} > + > +/* > + * Common function for all cpus before devirtualization. > + * > + * Hypervisor crash: all cpus get here in nmi context. > + * Linux crash: the panicing cpu gets here at base level, all others in nmi > + * context. Note, panicing cpu may not be the bsp. > + * > + * The function is not inlined so it will show on the stack. It is named so > + * because the crash cmd looks for certain well known function names on the > + * stack before looking into the cpu saved note in the elf section, and > + * that work is currently incomplete. > + * > + * Notes: > + * Hypervisor crash: > + * - the hypervisor is in a very restrictive mode at this point and any > + * vmexit it cannot handle would result in reboot. For example, console > + * output from here would result in synic ipi hcall, which would result > + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. > + * > + * Devirtualization is supported from the bsp only. > + */ > +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) > +{ > + struct hv_input_disable_hyp_ex *input; > + u64 status; > + int msecs = 1000, ccpu = smp_processor_id(); > + > + if (ccpu == 0) { > + /* crash_save_cpu() will be done in the kexec path */ > + cpu_emergency_stop_pt(); /* disable performance trace */ > + atomic_inc(&crash_cpus_wait); > + } else { > + crash_save_cpu(regs, ccpu); > + cpu_emergency_stop_pt(); /* disable performance trace */ > + atomic_inc(&crash_cpus_wait); > + for (;;); /* cause no vmexits */ > + } > + > + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) > + mdelay(1); > + > + stop_nmi(); > + if (!hv_has_crashed) > + hv_notify_prepare_hyp(); > + > + if (crashing_cpu == -1) > + crashing_cpu = ccpu; /* crash cmd uses this */ Could just be "crashing_cpu = 0" since only the BSP gets here. > + > + hv_hvcrash_ctxt_save(); > + hv_mark_tss_not_busy(); > + hv_crash_fixup_kernpt(); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ > + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? And just for clarification, Hyper-V treats this "arg" value as opaque and does not access it. It only provides it in EDI when it invokes the trampoline function, right? > + > + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); > + > + /* Devirt failed, just reboot as things are in very bad state now */ > + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ > +} > + > +/* > + * Generic nmi callback handler: could be called without any crash also. > + * hv crash: hypervisor injects nmi's into all cpus > + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus > + */ > +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) > +{ > + int ccpu = smp_processor_id(); > + > + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) > + hv_has_crashed = 1; > + > + if (!hv_has_crashed && !lx_has_crashed) > + return NMI_DONE; /* ignore the nmi */ > + > + if (hv_has_crashed) { > + if (!kexec_crash_loaded() || !hv_crash_enabled) { > + if (ccpu == 0) { > + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ > + } else > + for (;;); /* cause no vmexits */ > + } > + } > + > + crash_nmi_callback(regs); > + > + return NMI_DONE; crash_nmi_callback() should never return, right? Normally one would expect to return NMI_HANDLED here, but I guess it doesn't matter if the return is never executed. > +} > + > +/* > + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus > + * > + * On normal linux panic, this is called twice: first from panic and then again > + * from native_machine_crash_shutdown. > + * > + * In case of mshv, 3 ways to get here: > + * 1. hv crash (only bsp will get here): > + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry > + * -> __crash_kexec -> native_machine_crash_shutdown > + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus > + * linux panic: > + * 2. panic cpu x: panic() -> crash_smp_send_stop > + * -> smp_ops.crash_stop_other_cpus > + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop > + * > + * NB: noclone and non standard stack because of call to crash_setup_regs(). > + */ > +static void __noclone hv_crash_stop_other_cpus(void) > +{ > + static int crash_stop_done; > + struct pt_regs lregs; > + int ccpu = smp_processor_id(); > + > + if (hv_has_crashed) > + return; /* all cpus already in nmi handler path */ > + > + if (!kexec_crash_loaded()) > + return; If we're in a normal panic path (your Case #2 above) with no kdump kernel loaded, why leave the other vCPUs running? Seems like that could violate expectations in vpanic(), where it calls panic_other_cpus_shutdown() and thereafter assumes other vCPUs are not running. > + > + if (crash_stop_done) > + return; > + crash_stop_done = 1; Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called from crash_smp_send_stop(), which has its own static variable "cpus_stopped" that does the same thing. > + > + /* linux has crashed: hv is healthy, we can ipi safely */ > + lx_has_crashed = 1; > + wmb(); /* nmi handlers look at lx_has_crashed */ > + > + apic->send_IPI_allbutself(NMI_VECTOR); The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus(). In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but should disable_local_APIC() be done somewhere here as well? > + > + if (crashing_cpu == -1) > + crashing_cpu = ccpu; /* crash cmd uses this */ > + > + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which > + * is the bsp. We could be here on non-bsp cpu, collect regs if so. > + */ > + if (ccpu) > + crash_setup_regs(&lregs, NULL); > + > + crash_nmi_callback(&lregs); > +} > +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); > + > +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ > +struct hv_gdtreg_32 { > + u16 fill; > + u16 limit; > + u32 address; > +} __packed; > + > +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ > +struct hv_crash_tramp_gdt { > + u64 null; /* index 0, selector 0, null selector */ > + u64 cs64; /* index 1, selector 8, cs64 selector */ > +} __packed; > + > +/* No stack, so jump via far ptr in memory to load the 64bit CS */ > +struct hv_cs_jmptgt { > + u32 address; > + u16 csval; > + u16 fill; > +} __packed; > + > +/* This trampoline data is copied onto the trampoline page after the asm code */ > +struct hv_crash_tramp_data { > + u64 tramp32_cr3; > + u64 kernel_cr3; > + struct hv_gdtreg_32 gdtr32; > + struct hv_crash_tramp_gdt tramp_gdt; > + struct hv_cs_jmptgt cs_jmptgt; > + u64 c_entry_addr; > +} __packed; > + > +/* > + * Setup a temporary gdt to allow the asm code to switch to the long mode. > + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip > + * relative addressing, hence we must use trampoline_pa here. Also, save other > + * info like jmp and C entry targets for same reasons. > + * > + * Returns: 0 on success, -1 on error > + */ > +static int hv_crash_setup_trampdata(u64 trampoline_va) > +{ > + int size, offs; > + void *dest; > + struct hv_crash_tramp_data *tramp; > + > + /* These must match exactly the ones in the corresponding asm file */ > + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); > + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); > + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); > + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, > + cs_jmptgt.address) != 40); It would be nice to pick up the constants from a #include file that is shared with the asm code in Patch 4 of the series. > + > + /* hv_crash_asm_end is beyond last byte by 1 */ > + size = &hv_crash_asm_end - &hv_crash_asm32; > + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { > + pr_err("%s: trampoline page overflow\n", __func__); > + return -1; > + } > + > + dest = (void *)trampoline_va; > + memcpy(dest, &hv_crash_asm32, size); > + > + dest += size; > + dest = (void *)round_up((ulong)dest, 16); > + tramp = (struct hv_crash_tramp_data *)dest; > + > + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by > + * non-PCID-aware users". Build cr3 with pcid 0 > + */ > + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); > + > + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ > + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); > + > + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); > + tramp->gdtr32.address = trampoline_pa + > + (ulong)&tramp->tramp_gdt - trampoline_va; > + > + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ > + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; > + > + tramp->cs_jmptgt.csval = 0x8; > + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; > + tramp->cs_jmptgt.address = trampoline_pa + offs; > + > + tramp->c_entry_addr = (u64)&hv_crash_c_entry; > + > + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; > + > + return 0; > +} > + > +/* > + * Build 32bit trampoline page table for transition from protected mode > + * non-paging to long-mode paging. This transition needs pagetables below 4G. > + */ > +static void hv_crash_build_tramp_pt(void) > +{ > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + pte_t *pte; > + u64 pa, addr = trampoline_pa; > + > + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); > + pa = virt_to_phys(hv_crash_ptpgs[1]); > + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); > + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ > + > + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); > + pa = virt_to_phys(hv_crash_ptpgs[2]); > + set_pud(pud, __pud(_PAGE_TABLE | pa)); > + > + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); > + pa = virt_to_phys(hv_crash_ptpgs[3]); > + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); > + > + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); > + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); > +} > + > +/* > + * Setup trampoline for devirtualization: > + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to > + * in protected mode. > + * - 4 pages for a temporary page table that asm code uses to turn paging on > + * - a temporary gdt to use in the compat mode. > + * > + * Returns: 0 on success > + */ > +static int hv_crash_trampoline_setup(void) > +{ > + int i, rc, order; > + struct page *page; > + u64 trampoline_va; > + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; > + > + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ > + page = alloc_page(flags32); > + if (page == NULL) { > + pr_err("%s: failed to alloc asm stub page\n", __func__); > + return -1; > + } > + > + trampoline_va = (u64)page_to_virt(page); > + trampoline_pa = (u32)page_to_phys(page); > + > + order = 2; /* alloc 2^2 pages */ > + page = alloc_pages(flags32, order); > + if (page == NULL) { > + pr_err("%s: failed to alloc pt pages\n", __func__); > + free_page(trampoline_va); > + return -1; > + } > + > + for (i = 0; i < 4; i++, page++) > + hv_crash_ptpgs[i] = page_to_virt(page); > + > + hv_crash_build_tramp_pt(); > + > + rc = hv_crash_setup_trampdata(trampoline_va); > + if (rc) > + goto errout; > + > + return 0; > + > +errout: > + free_page(trampoline_va); > + free_pages((ulong)hv_crash_ptpgs[0], order); > + > + return rc; > +} > + > +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ > +void hv_root_crash_init(void) > +{ > + int rc; > + struct hv_input_get_system_property *input; > + struct hv_output_get_system_property *output; > + unsigned long flags; > + u64 status; > + union hv_pfn_range cda_info; > + > + if (pgtable_l5_enabled()) { > + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); > + return; > + } > + > + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, > + "hv_crash_nmi"); > + if (rc) { > + pr_err("Hyper-V: failed to register crash nmi handler\n"); > + return; > + } > + > + local_irq_save(flags); > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > + > + memset(input, 0, sizeof(*input)); > + memset(output, 0, sizeof(*output)); Why zero the output area? This is one of those hypercall things that we're inconsistent about. A few hypercall call sites zero the output area, and it's not clear why they do. Hyper-V should be responsible for properly filling in the output area. Linux should not need to do this zero'ing, unless there's some known bug in Hyper-V for certain hypercalls, in which case there should be a code comment stating "why". > + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; > + > + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); > + cda_info.as_uint64 = output->hv_cda_info.as_uint64; > + local_irq_restore(flags); > + > + if (!hv_result_success(status)) { > + pr_err("Hyper-V: %s: property:%d %s\n", __func__, > + input->property_id, hv_result_to_string(status)); > + goto err_out; > + } > + > + if (cda_info.base_pfn == 0) { > + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); > + goto err_out; > + } > + > + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in terms of the Hyper-V page size, which isn't necessarily the guest page size. Yes, on x86 there's no difference, but for future robustness .... > + > + rc = hv_crash_trampoline_setup(); > + if (rc) > + goto err_out; > + > + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; > + > + crash_kexec_post_notifiers = true; > + hv_crash_enabled = 1; > + pr_info("Hyper-V: linux and hv kdump support enabled\n"); This message and the message below aren't consistent. One refers to "hv kdump" and the other to "hyp kdump". > + > + return; > + > +err_out: > + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); > + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); > +} > -- > 2.36.1.vfs.0.0 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-15 17:55 ` Michael Kelley @ 2025-09-17 1:13 ` Mukesh R 2025-09-17 20:37 ` Mukesh R 2025-09-18 23:53 ` Michael Kelley 0 siblings, 2 replies; 29+ messages in thread From: Mukesh R @ 2025-09-17 1:13 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/15/25 10:55, Michael Kelley wrote: > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >> >> Introduce a new file to implement collection of hypervisor ram into the > > s/ram/RAM/ (multiple places) a quick grep indicates using saying ram is common, i like ram over RAM >> vmcore collected by linux. By default, the hypervisor ram is locked, ie, >> protected via hw page table. Hyper-V implements a disable hypercall which > > The terminology here is a bit confusing since you have two names for > the same thing: "disable" hypervisor, and "devirtualize". Is it possible to > just use "devirtualize" everywhere, and drop the "disable" terminology? The concept is devirtualize and the actual hypercall was originally named disable. so intermixing is natural imo. >> essentially devirtualizes the system on the fly. This mechanism makes the >> hypervisor ram accessible to linux. Because the hypervisor ram is already >> mapped into linux address space (as reserved ram), > > Is the hypervisor RAM mapped into the VMM process user address space, > or somewhere in the kernel address space? If the latter, where in the kernel > code, or what mechanism, does that? Just curious, as I wasn't aware that > this is happening .... mapped in kernel as normal ram and we reserve it very early in boot. i see that patch has not made it here yet, should be coming very soon. >> it is automatically >> collected into the vmcore without extra work. More details of the >> implementation are available in the file prologue. >> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >> --- >> arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ >> 1 file changed, 622 insertions(+) >> create mode 100644 arch/x86/hyperv/hv_crash.c >> >> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c >> new file mode 100644 >> index 000000000000..531bac79d598 >> --- /dev/null >> +++ b/arch/x86/hyperv/hv_crash.c >> @@ -0,0 +1,622 @@ >> +// SPDX-License-Identifier: GPL-2.0-only >> +/* >> + * X86 specific Hyper-V kdump/crash support module >> + * >> + * Copyright (C) 2025, Microsoft, Inc. >> + * >> + * This module implements hypervisor ram collection into vmcore for both >> + * cases of the hypervisor crash and linux dom0/root crash. > > For a hypervisor crash, does any of this apply to general guest VMs? I'm > thinking it does not. Hypervisor RAM is collected only into the vmcore > for the root partition, right? Maybe some additional clarification could be > added so there's no confusion in this regard. it would be odd for guests to collect hyp core, and target audience is assumed to be those who are somewhat familiar with basic concepts before getting here. > And what *does* happen to guest VMs after a hypervisor crash? they are gone... what else could we do? >> + * Hyper-V implements >> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This >> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram >> + * is already mapped in linux, it is automatically collected into linux vmcore, >> + * and can be examined by the crash command (raw ram dump) or windbg. >> + * >> + * At a high level: >> + * >> + * Hypervisor Crash: >> + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a >> + * restrictive mode with very limited hypercall and msr support. > > s/msr/MSR/ msr is used all over, seems acceptable. >> + * Each cpu then injects NMIs into dom0/root vcpus. > > The "Each cpu" part of this sentence is confusing to me -- which CPUs does > this refer to? Maybe it would be better to say "It then injects an NMI into > each dom0/root partition vCPU." without being specific as to which CPUs do > the injecting since that seems more like a hypervisor implementation detail > that's not relevant here. all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu. >> + * A shared page is used to check >> + * by linux in the nmi handler if the hypervisor has crashed. This shared > > s/nmi/NMI/ (multiple places) >> + * page is setup in hv_root_crash_init during boot. >> + * >> + * Linux Crash: >> + * In case of linux crash, the callback hv_crash_stop_other_cpus will send >> + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits >> + * for all cpus to be in NMI. >> + * >> + * NMI Handler (upon quorum): >> + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. > > s/hanlder/handler/ > > And maybe just drop the word "wil" (which is misspelled). > >> + * Hyper-V requires the disable hypervisor must be done from the bsp. So > > s/bsp/BSP (multiple places) > >> + * the bsp nmi handler saves current context, does some fixups and makes >> + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor >> + * at that point will suspend all vcpus (except the bsp), unlock all its >> + * ram, and return to linux at the 32bit mode entry RIP. >> + * >> + * Linux 32bit entry trampoline will then restore long mode and call C >> + * function here to restore context and continue execution to crash kexec. >> + */ >> + >> +#include <linux/delay.h> >> +#include <linux/kexec.h> >> +#include <linux/crash_dump.h> >> +#include <linux/panic.h> >> +#include <asm/apic.h> >> +#include <asm/desc.h> >> +#include <asm/page.h> >> +#include <asm/pgalloc.h> >> +#include <asm/mshyperv.h> >> +#include <asm/nmi.h> >> +#include <asm/idtentry.h> >> +#include <asm/reboot.h> >> +#include <asm/intel_pt.h> >> + >> +int hv_crash_enabled; > > Seems like this is conceptually a "bool", not an "int". yeah, can change it to bool if i do another iteration. >> +EXPORT_SYMBOL_GPL(hv_crash_enabled); >> + >> +struct hv_crash_ctxt { >> + ulong rsp; >> + ulong cr0; >> + ulong cr2; >> + ulong cr4; >> + ulong cr8; >> + >> + u16 cs; >> + u16 ss; >> + u16 ds; >> + u16 es; >> + u16 fs; >> + u16 gs; >> + >> + u16 gdt_fill; >> + struct desc_ptr gdtr; >> + char idt_fill[6]; >> + struct desc_ptr idtr; >> + >> + u64 gsbase; >> + u64 efer; >> + u64 pat; >> +}; >> +static struct hv_crash_ctxt hv_crash_ctxt; >> + >> +/* Shared hypervisor page that contains crash dump area we peek into. >> + * NB: windbg looks for "hv_cda" symbol so don't change it. >> + */ >> +static struct hv_crashdump_area *hv_cda; >> + >> +static u32 trampoline_pa, devirt_cr3arg; >> +static atomic_t crash_cpus_wait; >> +static void *hv_crash_ptpgs[4]; >> +static int hv_has_crashed, lx_has_crashed; > > These are conceptually "bool" as well. > >> + >> +/* This cannot be inlined as it needs stack */ >> +static noinline __noclone void hv_crash_restore_tss(void) >> +{ >> + load_TR_desc(); >> +} >> + >> +/* This cannot be inlined as it needs stack */ >> +static noinline void hv_crash_clear_kernpt(void) >> +{ >> + pgd_t *pgd; >> + p4d_t *p4d; >> + >> + /* Clear entry so it's not confusing to someone looking at the core */ >> + pgd = pgd_offset_k(trampoline_pa); >> + p4d = p4d_offset(pgd, trampoline_pa); >> + native_p4d_clear(p4d); >> +} >> + >> +/* >> + * This is the C entry point from the asm glue code after the devirt hypercall. >> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel >> + * page tables with our below 4G page identity mapped, but using a temporary >> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not >> + * available. We restore kernel GDT, and rest of the context, and continue >> + * to kexec. >> + */ >> +static asmlinkage void __noreturn hv_crash_c_entry(void) >> +{ >> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >> + >> + /* first thing, restore kernel gdt */ >> + native_load_gdt(&ctxt->gdtr); >> + >> + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); >> + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); >> + >> + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); >> + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); >> + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); >> + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); >> + >> + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); >> + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); >> + >> + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); >> + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); >> + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); >> + >> + native_load_idt(&ctxt->idtr); >> + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); >> + native_wrmsrq(MSR_EFER, ctxt->efer); >> + >> + /* restore the original kernel CS now via far return */ >> + asm volatile("movzwq %0, %%rax\n\t" >> + "pushq %%rax\n\t" >> + "pushq $1f\n\t" >> + "lretq\n\t" >> + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); >> + >> + /* We are in asmlinkage without stack frame, hence make a C function >> + * call which will buy stack frame to restore the tss or clear PT entry. >> + */ >> + hv_crash_restore_tss(); >> + hv_crash_clear_kernpt(); >> + >> + /* we are now fully in devirtualized normal kernel mode */ >> + __crash_kexec(NULL); > > The comments for __crash_kexec() say that "panic_cpu" should be set to > the current CPU. I don't see that such is the case here. if linux panic, it would be set by vpanic, if hyp crash, that is irrelevant. >> + >> + for (;;) >> + cpu_relax(); > > Is the intent that __crash_kexec() should never return, on any of the vCPUs, > because devirtualization isn't done unless there's a valid kdump image loaded? > I wonder if > > native_wrmsrq(HV_X64_MSR_RESET, 1); > > would be better than looping forever in case __crash_kexec() fails > somewhere along the way even if there's a kdump image loaded. yeah, i've gone thru all 3 possibilities here: o loop forever o reset o BUG() : this was in V0 reset is just bad because system would just reboot without any indication if hyp crashes. with loop at least there is a hang, and one could make note of it, and if internal, attach debugger. BUG is best imo because with hyp gone linux will try to redo panic and we would print something extra to help. I think i'll just go back to my V0: BUG() >> +} >> +/* Tell gcc we are using lretq long jump in the above function intentionally */ >> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); >> + >> +static void hv_mark_tss_not_busy(void) >> +{ >> + struct desc_struct *desc = get_current_gdt_rw(); >> + tss_desc tss; >> + >> + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); >> + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ >> + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); >> +} >> + >> +/* Save essential context */ >> +static void hv_hvcrash_ctxt_save(void) >> +{ >> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >> + >> + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); >> + >> + ctxt->cr0 = native_read_cr0(); >> + ctxt->cr4 = native_read_cr4(); >> + >> + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); >> + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); >> + >> + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); >> + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); >> + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); >> + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); >> + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); >> + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); >> + >> + native_store_gdt(&ctxt->gdtr); >> + store_idt(&ctxt->idtr); >> + >> + ctxt->gsbase = __rdmsr(MSR_GS_BASE); >> + ctxt->efer = __rdmsr(MSR_EFER); >> + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); >> +} >> + >> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ >> +static void hv_crash_fixup_kernpt(void) >> +{ >> + pgd_t *pgd; >> + p4d_t *p4d; >> + >> + pgd = pgd_offset_k(trampoline_pa); >> + p4d = p4d_offset(pgd, trampoline_pa); >> + >> + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ >> + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); >> + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ >> +} >> + >> +/* >> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has >> + * crashed and will collect core. This will cause the hyp to quiesce and >> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. >> + */ >> +static void hv_notify_prepare_hyp(void) >> +{ >> + u64 status; >> + struct hv_input_notify_partition_event *input; >> + struct hv_partition_event_root_crashdump_input *cda; >> + >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >> + cda = &input->input.crashdump_input; > > The code ordering here is a bit weird. I'd expect this line to be grouped > with cda->crashdump_action being set. we are setting two pointers, and using them later. setting pointers up front is pretty normal. >> + memset(input, 0, sizeof(*input)); >> + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; >> + >> + cda->crashdump_action = HV_CRASHDUMP_ENTRY; >> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >> + if (!hv_result_success(status)) >> + return; >> + >> + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; >> + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >> +} >> + >> +/* >> + * Common function for all cpus before devirtualization. >> + * >> + * Hypervisor crash: all cpus get here in nmi context. >> + * Linux crash: the panicing cpu gets here at base level, all others in nmi >> + * context. Note, panicing cpu may not be the bsp. >> + * >> + * The function is not inlined so it will show on the stack. It is named so >> + * because the crash cmd looks for certain well known function names on the >> + * stack before looking into the cpu saved note in the elf section, and >> + * that work is currently incomplete. >> + * >> + * Notes: >> + * Hypervisor crash: >> + * - the hypervisor is in a very restrictive mode at this point and any >> + * vmexit it cannot handle would result in reboot. For example, console >> + * output from here would result in synic ipi hcall, which would result >> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. >> + * >> + * Devirtualization is supported from the bsp only. >> + */ >> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) >> +{ >> + struct hv_input_disable_hyp_ex *input; >> + u64 status; >> + int msecs = 1000, ccpu = smp_processor_id(); >> + >> + if (ccpu == 0) { >> + /* crash_save_cpu() will be done in the kexec path */ >> + cpu_emergency_stop_pt(); /* disable performance trace */ >> + atomic_inc(&crash_cpus_wait); >> + } else { >> + crash_save_cpu(regs, ccpu); >> + cpu_emergency_stop_pt(); /* disable performance trace */ >> + atomic_inc(&crash_cpus_wait); >> + for (;;); /* cause no vmexits */ >> + } >> + >> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) >> + mdelay(1); >> + >> + stop_nmi(); >> + if (!hv_has_crashed) >> + hv_notify_prepare_hyp(); >> + >> + if (crashing_cpu == -1) >> + crashing_cpu = ccpu; /* crash cmd uses this */ > > Could just be "crashing_cpu = 0" since only the BSP gets here. a code change request has been open for while to remove the requirement of bsp.. >> + >> + hv_hvcrash_ctxt_save(); >> + hv_mark_tss_not_busy(); >> + hv_crash_fixup_kernpt(); >> + >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >> + memset(input, 0, sizeof(*input)); >> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ >> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ > > Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? > And just for clarification, Hyper-V treats this "arg" value as opaque and does > not access it. It only provides it in EDI when it invokes the trampoline > function, right? comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). right, comes in edi, i don't know what EDI is (just kidding!)... >> + >> + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); >> + >> + /* Devirt failed, just reboot as things are in very bad state now */ >> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ >> +} >> + >> +/* >> + * Generic nmi callback handler: could be called without any crash also. >> + * hv crash: hypervisor injects nmi's into all cpus >> + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus >> + */ >> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) >> +{ >> + int ccpu = smp_processor_id(); >> + >> + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) >> + hv_has_crashed = 1; >> + >> + if (!hv_has_crashed && !lx_has_crashed) >> + return NMI_DONE; /* ignore the nmi */ >> + >> + if (hv_has_crashed) { >> + if (!kexec_crash_loaded() || !hv_crash_enabled) { >> + if (ccpu == 0) { >> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ >> + } else >> + for (;;); /* cause no vmexits */ >> + } >> + } >> + >> + crash_nmi_callback(regs); >> + >> + return NMI_DONE; > > crash_nmi_callback() should never return, right? Normally one would > expect to return NMI_HANDLED here, but I guess it doesn't matter > if the return is never executed. correct. >> +} >> + >> +/* >> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus >> + * >> + * On normal linux panic, this is called twice: first from panic and then again >> + * from native_machine_crash_shutdown. >> + * >> + * In case of mshv, 3 ways to get here: >> + * 1. hv crash (only bsp will get here): >> + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry >> + * -> __crash_kexec -> native_machine_crash_shutdown >> + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus >> + * linux panic: >> + * 2. panic cpu x: panic() -> crash_smp_send_stop >> + * -> smp_ops.crash_stop_other_cpus >> + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop >> + * >> + * NB: noclone and non standard stack because of call to crash_setup_regs(). >> + */ >> +static void __noclone hv_crash_stop_other_cpus(void) >> +{ >> + static int crash_stop_done; >> + struct pt_regs lregs; >> + int ccpu = smp_processor_id(); >> + >> + if (hv_has_crashed) >> + return; /* all cpus already in nmi handler path */ >> + >> + if (!kexec_crash_loaded()) >> + return; > > If we're in a normal panic path (your Case #2 above) with no kdump kernel > loaded, why leave the other vCPUs running? Seems like that could violate > expectations in vpanic(), where it calls panic_other_cpus_shutdown() and > thereafter assumes other vCPUs are not running. no, there is lots of complexity here! if we hang vcpus here, hyp will note and may trigger its own watchdog. also, machine_crash_shutdown() does another ipi. I think the best thing to do here is go back to my V0 which did not have check for kexec_crash_loaded(), but had this in hv_crash_c_entry: + /* we are now fully in devirtualized normal kernel mode */ + __crash_kexec(NULL); + + BUG(); this way hyp would be disabled, ie, system devirtualized, and __crash_kernel() will return, resulting in BUG() that will cause it to go thru panic and honor panic= parameter with either hang or reset. instead of bug, i could just call panic() also. >> + >> + if (crash_stop_done) >> + return; >> + crash_stop_done = 1; > > Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called > from crash_smp_send_stop(), which has its own static variable > "cpus_stopped" that does the same thing. yes. for error paths. >> + >> + /* linux has crashed: hv is healthy, we can ipi safely */ >> + lx_has_crashed = 1; >> + wmb(); /* nmi handlers look at lx_has_crashed */ >> + >> + apic->send_IPI_allbutself(NMI_VECTOR); > > The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus(). > In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but > should disable_local_APIC() be done somewhere here as well? no, hyp does that. >> + >> + if (crashing_cpu == -1) >> + crashing_cpu = ccpu; /* crash cmd uses this */ >> + >> + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which >> + * is the bsp. We could be here on non-bsp cpu, collect regs if so. >> + */ >> + if (ccpu) >> + crash_setup_regs(&lregs, NULL); >> + >> + crash_nmi_callback(&lregs); >> +} >> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); >> + >> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ >> +struct hv_gdtreg_32 { >> + u16 fill; >> + u16 limit; >> + u32 address; >> +} __packed; >> + >> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ >> +struct hv_crash_tramp_gdt { >> + u64 null; /* index 0, selector 0, null selector */ >> + u64 cs64; /* index 1, selector 8, cs64 selector */ >> +} __packed; >> + >> +/* No stack, so jump via far ptr in memory to load the 64bit CS */ >> +struct hv_cs_jmptgt { >> + u32 address; >> + u16 csval; >> + u16 fill; >> +} __packed; >> + >> +/* This trampoline data is copied onto the trampoline page after the asm code */ >> +struct hv_crash_tramp_data { >> + u64 tramp32_cr3; >> + u64 kernel_cr3; >> + struct hv_gdtreg_32 gdtr32; >> + struct hv_crash_tramp_gdt tramp_gdt; >> + struct hv_cs_jmptgt cs_jmptgt; >> + u64 c_entry_addr; >> +} __packed; >> + >> +/* >> + * Setup a temporary gdt to allow the asm code to switch to the long mode. >> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip >> + * relative addressing, hence we must use trampoline_pa here. Also, save other >> + * info like jmp and C entry targets for same reasons. >> + * >> + * Returns: 0 on success, -1 on error >> + */ >> +static int hv_crash_setup_trampdata(u64 trampoline_va) >> +{ >> + int size, offs; >> + void *dest; >> + struct hv_crash_tramp_data *tramp; >> + >> + /* These must match exactly the ones in the corresponding asm file */ >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, >> + cs_jmptgt.address) != 40); > > It would be nice to pick up the constants from a #include file that is > shared with the asm code in Patch 4 of the series. yeah, could go either way, some don't like tiny headers... if there are no objections to new header for this, i could go that way too. >> + >> + /* hv_crash_asm_end is beyond last byte by 1 */ >> + size = &hv_crash_asm_end - &hv_crash_asm32; >> + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { >> + pr_err("%s: trampoline page overflow\n", __func__); >> + return -1; >> + } >> + >> + dest = (void *)trampoline_va; >> + memcpy(dest, &hv_crash_asm32, size); >> + >> + dest += size; >> + dest = (void *)round_up((ulong)dest, 16); >> + tramp = (struct hv_crash_tramp_data *)dest; >> + >> + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by >> + * non-PCID-aware users". Build cr3 with pcid 0 >> + */ >> + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); >> + >> + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ >> + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); >> + >> + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); >> + tramp->gdtr32.address = trampoline_pa + >> + (ulong)&tramp->tramp_gdt - trampoline_va; >> + >> + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ >> + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; >> + >> + tramp->cs_jmptgt.csval = 0x8; >> + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; >> + tramp->cs_jmptgt.address = trampoline_pa + offs; >> + >> + tramp->c_entry_addr = (u64)&hv_crash_c_entry; >> + >> + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; >> + >> + return 0; >> +} >> + >> +/* >> + * Build 32bit trampoline page table for transition from protected mode >> + * non-paging to long-mode paging. This transition needs pagetables below 4G. >> + */ >> +static void hv_crash_build_tramp_pt(void) >> +{ >> + p4d_t *p4d; >> + pud_t *pud; >> + pmd_t *pmd; >> + pte_t *pte; >> + u64 pa, addr = trampoline_pa; >> + >> + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); >> + pa = virt_to_phys(hv_crash_ptpgs[1]); >> + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); >> + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ >> + >> + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); >> + pa = virt_to_phys(hv_crash_ptpgs[2]); >> + set_pud(pud, __pud(_PAGE_TABLE | pa)); >> + >> + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); >> + pa = virt_to_phys(hv_crash_ptpgs[3]); >> + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); >> + >> + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); >> + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); >> +} >> + >> +/* >> + * Setup trampoline for devirtualization: >> + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to >> + * in protected mode. >> + * - 4 pages for a temporary page table that asm code uses to turn paging on >> + * - a temporary gdt to use in the compat mode. >> + * >> + * Returns: 0 on success >> + */ >> +static int hv_crash_trampoline_setup(void) >> +{ >> + int i, rc, order; >> + struct page *page; >> + u64 trampoline_va; >> + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; >> + >> + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ >> + page = alloc_page(flags32); >> + if (page == NULL) { >> + pr_err("%s: failed to alloc asm stub page\n", __func__); >> + return -1; >> + } >> + >> + trampoline_va = (u64)page_to_virt(page); >> + trampoline_pa = (u32)page_to_phys(page); >> + >> + order = 2; /* alloc 2^2 pages */ >> + page = alloc_pages(flags32, order); >> + if (page == NULL) { >> + pr_err("%s: failed to alloc pt pages\n", __func__); >> + free_page(trampoline_va); >> + return -1; >> + } >> + >> + for (i = 0; i < 4; i++, page++) >> + hv_crash_ptpgs[i] = page_to_virt(page); >> + >> + hv_crash_build_tramp_pt(); >> + >> + rc = hv_crash_setup_trampdata(trampoline_va); >> + if (rc) >> + goto errout; >> + >> + return 0; >> + >> +errout: >> + free_page(trampoline_va); >> + free_pages((ulong)hv_crash_ptpgs[0], order); >> + >> + return rc; >> +} >> + >> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ >> +void hv_root_crash_init(void) >> +{ >> + int rc; >> + struct hv_input_get_system_property *input; >> + struct hv_output_get_system_property *output; >> + unsigned long flags; >> + u64 status; >> + union hv_pfn_range cda_info; >> + >> + if (pgtable_l5_enabled()) { >> + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); >> + return; >> + } >> + >> + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, >> + "hv_crash_nmi"); >> + if (rc) { >> + pr_err("Hyper-V: failed to register crash nmi handler\n"); >> + return; >> + } >> + >> + local_irq_save(flags); >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >> + output = *this_cpu_ptr(hyperv_pcpu_output_arg); >> + >> + memset(input, 0, sizeof(*input)); >> + memset(output, 0, sizeof(*output)); > > Why zero the output area? This is one of those hypercall things that we're > inconsistent about. A few hypercall call sites zero the output area, and it's > not clear why they do. Hyper-V should be responsible for properly filling in > the output area. Linux should not need to do this zero'ing, unless there's some > known bug in Hyper-V for certain hypercalls, in which case there should be > a code comment stating "why". for the same reason sometimes you see char *p = NULL, either leftover code or someone was debugging or just copy and paste. this is just copy paste. i agree in general that we don't need to clear it at all, in fact, i'd like to remove them all! but i also understand people with different skills and junior members find it easier to debug, and also we were in early product development. for that reason, it doesn't have to be consistent either, if some complex hypercalls are failing repeatedly, just for ease of debug, one might leave it there temporarily. but now that things are stable, i think we should just remove them all and get used to a bit more inconvenient debugging... >> + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; >> + >> + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); >> + cda_info.as_uint64 = output->hv_cda_info.as_uint64; >> + local_irq_restore(flags); >> + >> + if (!hv_result_success(status)) { >> + pr_err("Hyper-V: %s: property:%d %s\n", __func__, >> + input->property_id, hv_result_to_string(status)); >> + goto err_out; >> + } >> + >> + if (cda_info.base_pfn == 0) { >> + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); >> + goto err_out; >> + } >> + >> + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); > > Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in > terms of the Hyper-V page size, which isn't necessarily the guest page size. > Yes, on x86 there's no difference, but for future robustness .... i don't know about guests, but we won't even boot if dom0 pg size didn't match.. but easier to change than to make the case.. >> + >> + rc = hv_crash_trampoline_setup(); >> + if (rc) >> + goto err_out; >> + >> + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; >> + >> + crash_kexec_post_notifiers = true; >> + hv_crash_enabled = 1; >> + pr_info("Hyper-V: linux and hv kdump support enabled\n"); > > This message and the message below aren't consistent. One refers > to "hv kdump" and the other to "hyp kdump". >> + >> + return; >> + >> +err_out: >> + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); >> + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); >> +} >> -- >> 2.36.1.vfs.0.0 >> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-17 1:13 ` Mukesh R @ 2025-09-17 20:37 ` Mukesh R 2025-09-18 23:53 ` Michael Kelley 1 sibling, 0 replies; 29+ messages in thread From: Mukesh R @ 2025-09-17 20:37 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/16/25 18:13, Mukesh R wrote: > On 9/15/25 10:55, Michael Kelley wrote: >> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >>> >>> Introduce a new file to implement collection of hypervisor ram into the >> >> s/ram/RAM/ (multiple places) > > a quick grep indicates using saying ram is common, i like ram over RAM > >>> vmcore collected by linux. By default, the hypervisor ram is locked, ie, >>> protected via hw page table. Hyper-V implements a disable hypercall which >> >> The terminology here is a bit confusing since you have two names for >> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to >> just use "devirtualize" everywhere, and drop the "disable" terminology? > > The concept is devirtualize and the actual hypercall was originally named > disable. so intermixing is natural imo. [snip] >>> + >>> +/* >>> + * Setup a temporary gdt to allow the asm code to switch to the long mode. >>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip >>> + * relative addressing, hence we must use trampoline_pa here. Also, save other >>> + * info like jmp and C entry targets for same reasons. >>> + * >>> + * Returns: 0 on success, -1 on error >>> + */ >>> +static int hv_crash_setup_trampdata(u64 trampoline_va) >>> +{ >>> + int size, offs; >>> + void *dest; >>> + struct hv_crash_tramp_data *tramp; >>> + >>> + /* These must match exactly the ones in the corresponding asm file */ >>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); >>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); >>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); >>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, >>> + cs_jmptgt.address) != 40); >> >> It would be nice to pick up the constants from a #include file that is >> shared with the asm code in Patch 4 of the series. > > yeah, could go either way, some don't like tiny headers... if there are > no objections to new header for this, i could go that way too. yeah, i experimented with creating a new header or try to add to existing. new header doesn't make sense for just 5 #defines, adding C struct there is not a great idea given it's scope is limited to the specific function in the c file. adding to another header results in ifdefs for ASM/KERNEL, so not really worth it. I think for now it is ok, we can live with it. If arm ends up adding more declarations, we can look into it. Thanks, -Mukesh [ .. deleted.. ] ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-17 1:13 ` Mukesh R 2025-09-17 20:37 ` Mukesh R @ 2025-09-18 23:53 ` Michael Kelley 2025-09-19 2:32 ` Mukesh R 1 sibling, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-18 23:53 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM > > On 9/15/25 10:55, Michael Kelley wrote: > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > >> > >> Introduce a new file to implement collection of hypervisor ram into the > > > > s/ram/RAM/ (multiple places) > > a quick grep indicates using saying ram is common, i like ram over RAM > > >> vmcore collected by linux. By default, the hypervisor ram is locked, ie, > >> protected via hw page table. Hyper-V implements a disable hypercall which > > > > The terminology here is a bit confusing since you have two names for > > the same thing: "disable" hypervisor, and "devirtualize". Is it possible to > > just use "devirtualize" everywhere, and drop the "disable" terminology? > > The concept is devirtualize and the actual hypercall was originally named > disable. so intermixing is natural imo. > > >> essentially devirtualizes the system on the fly. This mechanism makes the > >> hypervisor ram accessible to linux. Because the hypervisor ram is already > >> mapped into linux address space (as reserved ram), > > > > Is the hypervisor RAM mapped into the VMM process user address space, > > or somewhere in the kernel address space? If the latter, where in the kernel > > code, or what mechanism, does that? Just curious, as I wasn't aware that > > this is happening .... > > mapped in kernel as normal ram and we reserve it very early in boot. i > see that patch has not made it here yet, should be coming very soon. OK, that's fine. The answer to my question is coming soon .... > > >> it is automatically > >> collected into the vmcore without extra work. More details of the > >> implementation are available in the file prologue. > >> > >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > >> --- > >> arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ > >> 1 file changed, 622 insertions(+) > >> create mode 100644 arch/x86/hyperv/hv_crash.c > >> > >> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c > >> new file mode 100644 > >> index 000000000000..531bac79d598 > >> --- /dev/null > >> +++ b/arch/x86/hyperv/hv_crash.c > >> @@ -0,0 +1,622 @@ > >> +// SPDX-License-Identifier: GPL-2.0-only > >> +/* > >> + * X86 specific Hyper-V kdump/crash support module > >> + * > >> + * Copyright (C) 2025, Microsoft, Inc. > >> + * > >> + * This module implements hypervisor ram collection into vmcore for both > >> + * cases of the hypervisor crash and linux dom0/root crash. > > > > For a hypervisor crash, does any of this apply to general guest VMs? I'm > > thinking it does not. Hypervisor RAM is collected only into the vmcore > > for the root partition, right? Maybe some additional clarification could be > > added so there's no confusion in this regard. > > it would be odd for guests to collect hyp core, and target audience is > assumed to be those who are somewhat familiar with basic concepts before > getting here. I was unsure because I had not seen any code that adds the hypervisor memory to the Linux memory map. Thought maybe something was going on I hadn’t heard about, so I didn't know the scope of it. Of course, I'm one of those people who was *not* familiar with the basic concepts before getting here. And given that there's no spec available from Hyper-V, the comments in this patch set are all there is for anyone outside of Microsoft. In that vein, I think it's reasonable to provide some description of how this all works in the code comments. And you've done that, which is very helpful. But I encountered a few places where I was confused or unclear, and my suggestions here and in Patch 4 are just about making things as precise as possible without adding a huge amount of additional verbiage. For someone new, English text descriptions that the code can be checked against are helpful, and drawing hard boundaries ("this is only applicable to the root partition") is helpful. If you don't want to deal with it now, I could provide a follow-on patch later that tweaks or augments the wording a bit to clarify some of these places. You can review, like with any patch. I've done wording work over the years to many places in the VMBus code, and more broadly in providing most of the documentation in Documentation/virt/hyperv. > > > And what *does* happen to guest VMs after a hypervisor crash? > > they are gone... what else could we do? > > >> + * Hyper-V implements > >> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This > >> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram > >> + * is already mapped in linux, it is automatically collected into linux vmcore, > >> + * and can be examined by the crash command (raw ram dump) or windbg. > >> + * > >> + * At a high level: > >> + * > >> + * Hypervisor Crash: > >> + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a > >> + * restrictive mode with very limited hypercall and msr support. > > > > s/msr/MSR/ > > msr is used all over, seems acceptable. > > >> + * Each cpu then injects NMIs into dom0/root vcpus. > > > > The "Each cpu" part of this sentence is confusing to me -- which CPUs does > > this refer to? Maybe it would be better to say "It then injects an NMI into > > each dom0/root partition vCPU." without being specific as to which CPUs do > > the injecting since that seems more like a hypervisor implementation detail > > that's not relevant here. > > all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu. OK, that makes sense now that I think about it. Each physical CPU in the host has a corresponding vCPU in the dom0/root partition. And each of the vCPUs gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off running a vCPU in some guest VM. > > >> + * A shared page is used to check > >> + * by linux in the nmi handler if the hypervisor has crashed. This shared > > > > s/nmi/NMI/ (multiple places) > > >> + * page is setup in hv_root_crash_init during boot. > >> + * > >> + * Linux Crash: > >> + * In case of linux crash, the callback hv_crash_stop_other_cpus will send > >> + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits > >> + * for all cpus to be in NMI. > >> + * > >> + * NMI Handler (upon quorum): > >> + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. > > > > s/hanlder/handler/ > > > > And maybe just drop the word "wil" (which is misspelled). > > > >> + * Hyper-V requires the disable hypervisor must be done from the bsp. So > > > > s/bsp/BSP (multiple places) > > > >> + * the bsp nmi handler saves current context, does some fixups and makes > >> + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor > >> + * at that point will suspend all vcpus (except the bsp), unlock all its > >> + * ram, and return to linux at the 32bit mode entry RIP. > >> + * > >> + * Linux 32bit entry trampoline will then restore long mode and call C > >> + * function here to restore context and continue execution to crash kexec. > >> + */ > >> + > >> +#include <linux/delay.h> > >> +#include <linux/kexec.h> > >> +#include <linux/crash_dump.h> > >> +#include <linux/panic.h> > >> +#include <asm/apic.h> > >> +#include <asm/desc.h> > >> +#include <asm/page.h> > >> +#include <asm/pgalloc.h> > >> +#include <asm/mshyperv.h> > >> +#include <asm/nmi.h> > >> +#include <asm/idtentry.h> > >> +#include <asm/reboot.h> > >> +#include <asm/intel_pt.h> > >> + > >> +int hv_crash_enabled; > > > > Seems like this is conceptually a "bool", not an "int". > > yeah, can change it to bool if i do another iteration. > > >> +EXPORT_SYMBOL_GPL(hv_crash_enabled); > >> + > >> +struct hv_crash_ctxt { > >> + ulong rsp; > >> + ulong cr0; > >> + ulong cr2; > >> + ulong cr4; > >> + ulong cr8; > >> + > >> + u16 cs; > >> + u16 ss; > >> + u16 ds; > >> + u16 es; > >> + u16 fs; > >> + u16 gs; > >> + > >> + u16 gdt_fill; > >> + struct desc_ptr gdtr; > >> + char idt_fill[6]; > >> + struct desc_ptr idtr; > >> + > >> + u64 gsbase; > >> + u64 efer; > >> + u64 pat; > >> +}; > >> +static struct hv_crash_ctxt hv_crash_ctxt; > >> + > >> +/* Shared hypervisor page that contains crash dump area we peek into. > >> + * NB: windbg looks for "hv_cda" symbol so don't change it. > >> + */ > >> +static struct hv_crashdump_area *hv_cda; > >> + > >> +static u32 trampoline_pa, devirt_cr3arg; > >> +static atomic_t crash_cpus_wait; > >> +static void *hv_crash_ptpgs[4]; > >> +static int hv_has_crashed, lx_has_crashed; > > > > These are conceptually "bool" as well. > > > >> + > >> +/* This cannot be inlined as it needs stack */ > >> +static noinline __noclone void hv_crash_restore_tss(void) > >> +{ > >> + load_TR_desc(); > >> +} > >> + > >> +/* This cannot be inlined as it needs stack */ > >> +static noinline void hv_crash_clear_kernpt(void) > >> +{ > >> + pgd_t *pgd; > >> + p4d_t *p4d; > >> + > >> + /* Clear entry so it's not confusing to someone looking at the core */ > >> + pgd = pgd_offset_k(trampoline_pa); > >> + p4d = p4d_offset(pgd, trampoline_pa); > >> + native_p4d_clear(p4d); > >> +} > >> + > >> +/* > >> + * This is the C entry point from the asm glue code after the devirt hypercall. > >> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel > >> + * page tables with our below 4G page identity mapped, but using a temporary > >> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not > >> + * available. We restore kernel GDT, and rest of the context, and continue > >> + * to kexec. > >> + */ > >> +static asmlinkage void __noreturn hv_crash_c_entry(void) > >> +{ > >> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; > >> + > >> + /* first thing, restore kernel gdt */ > >> + native_load_gdt(&ctxt->gdtr); > >> + > >> + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); > >> + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); > >> + > >> + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); > >> + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); > >> + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); > >> + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); > >> + > >> + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); > >> + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); > >> + > >> + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); > >> + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); > >> + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); > >> + > >> + native_load_idt(&ctxt->idtr); > >> + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); > >> + native_wrmsrq(MSR_EFER, ctxt->efer); > >> + > >> + /* restore the original kernel CS now via far return */ > >> + asm volatile("movzwq %0, %%rax\n\t" > >> + "pushq %%rax\n\t" > >> + "pushq $1f\n\t" > >> + "lretq\n\t" > >> + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); > >> + > >> + /* We are in asmlinkage without stack frame, hence make a C function > >> + * call which will buy stack frame to restore the tss or clear PT entry. > >> + */ > >> + hv_crash_restore_tss(); > >> + hv_crash_clear_kernpt(); > >> + > >> + /* we are now fully in devirtualized normal kernel mode */ > >> + __crash_kexec(NULL); > > > > The comments for __crash_kexec() say that "panic_cpu" should be set to > > the current CPU. I don't see that such is the case here. > > if linux panic, it would be set by vpanic, if hyp crash, that is > irrelevant. > > >> + > >> + for (;;) > >> + cpu_relax(); > > > > Is the intent that __crash_kexec() should never return, on any of the vCPUs, > > because devirtualization isn't done unless there's a valid kdump image loaded? > > I wonder if > > > > native_wrmsrq(HV_X64_MSR_RESET, 1); > > > > would be better than looping forever in case __crash_kexec() fails > > somewhere along the way even if there's a kdump image loaded. > > yeah, i've gone thru all 3 possibilities here: > o loop forever > o reset > o BUG() : this was in V0 > > reset is just bad because system would just reboot without any indication > if hyp crashes. with loop at least there is a hang, and one could make > note of it, and if internal, attach debugger. > > BUG is best imo because with hyp gone linux will try to redo panic > and we would print something extra to help. I think i'll just go > back to my V0: BUG() > > >> +} > >> +/* Tell gcc we are using lretq long jump in the above function intentionally */ > >> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); > >> + > >> +static void hv_mark_tss_not_busy(void) > >> +{ > >> + struct desc_struct *desc = get_current_gdt_rw(); > >> + tss_desc tss; > >> + > >> + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); > >> + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ > >> + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); > >> +} > >> + > >> +/* Save essential context */ > >> +static void hv_hvcrash_ctxt_save(void) > >> +{ > >> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; > >> + > >> + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); > >> + > >> + ctxt->cr0 = native_read_cr0(); > >> + ctxt->cr4 = native_read_cr4(); > >> + > >> + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); > >> + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); > >> + > >> + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); > >> + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); > >> + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); > >> + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); > >> + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); > >> + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); > >> + > >> + native_store_gdt(&ctxt->gdtr); > >> + store_idt(&ctxt->idtr); > >> + > >> + ctxt->gsbase = __rdmsr(MSR_GS_BASE); > >> + ctxt->efer = __rdmsr(MSR_EFER); > >> + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); > >> +} > >> + > >> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ > >> +static void hv_crash_fixup_kernpt(void) > >> +{ > >> + pgd_t *pgd; > >> + p4d_t *p4d; > >> + > >> + pgd = pgd_offset_k(trampoline_pa); > >> + p4d = p4d_offset(pgd, trampoline_pa); > >> + > >> + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ > >> + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); > >> + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ > >> +} > >> + > >> +/* > >> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has > >> + * crashed and will collect core. This will cause the hyp to quiesce and > >> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. > >> + */ > >> +static void hv_notify_prepare_hyp(void) > >> +{ > >> + u64 status; > >> + struct hv_input_notify_partition_event *input; > >> + struct hv_partition_event_root_crashdump_input *cda; > >> + > >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > >> + cda = &input->input.crashdump_input; > > > > The code ordering here is a bit weird. I'd expect this line to be grouped > > with cda->crashdump_action being set. > > we are setting two pointers, and using them later. setting pointers > up front is pretty normal. > > >> + memset(input, 0, sizeof(*input)); > >> + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; > >> + > >> + cda->crashdump_action = HV_CRASHDUMP_ENTRY; > >> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); > >> + if (!hv_result_success(status)) > >> + return; > >> + > >> + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; > >> + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); > >> +} > >> + > >> +/* > >> + * Common function for all cpus before devirtualization. > >> + * > >> + * Hypervisor crash: all cpus get here in nmi context. > >> + * Linux crash: the panicing cpu gets here at base level, all others in nmi > >> + * context. Note, panicing cpu may not be the bsp. > >> + * > >> + * The function is not inlined so it will show on the stack. It is named so > >> + * because the crash cmd looks for certain well known function names on the > >> + * stack before looking into the cpu saved note in the elf section, and > >> + * that work is currently incomplete. > >> + * > >> + * Notes: > >> + * Hypervisor crash: > >> + * - the hypervisor is in a very restrictive mode at this point and any > >> + * vmexit it cannot handle would result in reboot. For example, console > >> + * output from here would result in synic ipi hcall, which would result > >> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. > >> + * > >> + * Devirtualization is supported from the bsp only. > >> + */ > >> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) > >> +{ > >> + struct hv_input_disable_hyp_ex *input; > >> + u64 status; > >> + int msecs = 1000, ccpu = smp_processor_id(); > >> + > >> + if (ccpu == 0) { > >> + /* crash_save_cpu() will be done in the kexec path */ > >> + cpu_emergency_stop_pt(); /* disable performance trace */ > >> + atomic_inc(&crash_cpus_wait); > >> + } else { > >> + crash_save_cpu(regs, ccpu); > >> + cpu_emergency_stop_pt(); /* disable performance trace */ > >> + atomic_inc(&crash_cpus_wait); > >> + for (;;); /* cause no vmexits */ > >> + } > >> + > >> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) > >> + mdelay(1); > >> + > >> + stop_nmi(); > >> + if (!hv_has_crashed) > >> + hv_notify_prepare_hyp(); > >> + > >> + if (crashing_cpu == -1) > >> + crashing_cpu = ccpu; /* crash cmd uses this */ > > > > Could just be "crashing_cpu = 0" since only the BSP gets here. > > a code change request has been open for while to remove the requirement > of bsp.. > > >> + > >> + hv_hvcrash_ctxt_save(); > >> + hv_mark_tss_not_busy(); > >> + hv_crash_fixup_kernpt(); > >> + > >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > >> + memset(input, 0, sizeof(*input)); > >> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ > >> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ > > > > Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? > > And just for clarification, Hyper-V treats this "arg" value as opaque and does > > not access it. It only provides it in EDI when it invokes the trampoline > > function, right? > > comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). Yes, the comment matches the name of the "devirt_cr3arg" variable. Unfortunately my previous comment was incomplete because the value stored in the static variable "devirt_cr3arg" isn’t the address of an L4 page table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the PA of struct hv_crash_tramp_data. The CR3 value is stored in the tramp32_cr3 field (at offset 0) of that structure, so there's an additional level of indirection. The (corrected) comment in the header to hv_crash_asm32() describes EDI as containing "PA of struct hv_crash_tramp_data", which ought to match what is described here. I'd say that "devirt_cr3arg" ought to be renamed to "tramp_data_pa" or something else parallel to "trampoline_pa". > > right, comes in edi, i don't know what EDI is (just kidding!)... > > >> + > >> + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); > >> + > >> + /* Devirt failed, just reboot as things are in very bad state now */ > >> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ > >> +} > >> + > >> +/* > >> + * Generic nmi callback handler: could be called without any crash also. > >> + * hv crash: hypervisor injects nmi's into all cpus > >> + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus > >> + */ > >> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) > >> +{ > >> + int ccpu = smp_processor_id(); > >> + > >> + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) > >> + hv_has_crashed = 1; > >> + > >> + if (!hv_has_crashed && !lx_has_crashed) > >> + return NMI_DONE; /* ignore the nmi */ > >> + > >> + if (hv_has_crashed) { > >> + if (!kexec_crash_loaded() || !hv_crash_enabled) { > >> + if (ccpu == 0) { > >> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ > >> + } else > >> + for (;;); /* cause no vmexits */ > >> + } > >> + } > >> + > >> + crash_nmi_callback(regs); > >> + > >> + return NMI_DONE; > > > > crash_nmi_callback() should never return, right? Normally one would > > expect to return NMI_HANDLED here, but I guess it doesn't matter > > if the return is never executed. > > correct. > > >> +} > >> + > >> +/* > >> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus > >> + * > >> + * On normal linux panic, this is called twice: first from panic and then again > >> + * from native_machine_crash_shutdown. > >> + * > >> + * In case of mshv, 3 ways to get here: > >> + * 1. hv crash (only bsp will get here): > >> + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry > >> + * -> __crash_kexec -> native_machine_crash_shutdown > >> + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus > >> + * linux panic: > >> + * 2. panic cpu x: panic() -> crash_smp_send_stop > >> + * -> smp_ops.crash_stop_other_cpus > >> + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop > >> + * > >> + * NB: noclone and non standard stack because of call to crash_setup_regs(). > >> + */ > >> +static void __noclone hv_crash_stop_other_cpus(void) > >> +{ > >> + static int crash_stop_done; > >> + struct pt_regs lregs; > >> + int ccpu = smp_processor_id(); > >> + > >> + if (hv_has_crashed) > >> + return; /* all cpus already in nmi handler path */ > >> + > >> + if (!kexec_crash_loaded()) > >> + return; > > > > If we're in a normal panic path (your Case #2 above) with no kdump kernel > > loaded, why leave the other vCPUs running? Seems like that could violate > > expectations in vpanic(), where it calls panic_other_cpus_shutdown() and > > thereafter assumes other vCPUs are not running. > > no, there is lots of complexity here! > > if we hang vcpus here, hyp will note and may trigger its own watchdog. > also, machine_crash_shutdown() does another ipi. > > I think the best thing to do here is go back to my V0 which did not > have check for kexec_crash_loaded(), but had this in hv_crash_c_entry: > > + /* we are now fully in devirtualized normal kernel mode */ > + __crash_kexec(NULL); > + > + BUG(); > > > this way hyp would be disabled, ie, system devirtualized, and > __crash_kernel() will return, resulting in BUG() that will cause > it to go thru panic and honor panic= parameter with either hang > or reset. instead of bug, i could just call panic() also. > > >> + > >> + if (crash_stop_done) > >> + return; > >> + crash_stop_done = 1; > > > > Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called > > from crash_smp_send_stop(), which has its own static variable > > "cpus_stopped" that does the same thing. > > yes. for error paths. > > >> + > >> + /* linux has crashed: hv is healthy, we can ipi safely */ > >> + lx_has_crashed = 1; > >> + wmb(); /* nmi handlers look at lx_has_crashed */ > >> + > >> + apic->send_IPI_allbutself(NMI_VECTOR); > > > > The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus(). > > In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but > > should disable_local_APIC() be done somewhere here as well? > > no, hyp does that. As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX hypercall in crash_nmi_callback()? This gets back to an earlier question/comment where I was trying to figure out if the APIC is still enabled, and in what mode, when hv_crash_asm32() is invoked. > > >> + > >> + if (crashing_cpu == -1) > >> + crashing_cpu = ccpu; /* crash cmd uses this */ > >> + > >> + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which > >> + * is the bsp. We could be here on non-bsp cpu, collect regs if so. > >> + */ > >> + if (ccpu) > >> + crash_setup_regs(&lregs, NULL); > >> + > >> + crash_nmi_callback(&lregs); > >> +} > >> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); > >> + > >> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ > >> +struct hv_gdtreg_32 { > >> + u16 fill; > >> + u16 limit; > >> + u32 address; > >> +} __packed; > >> + > >> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ > >> +struct hv_crash_tramp_gdt { > >> + u64 null; /* index 0, selector 0, null selector */ > >> + u64 cs64; /* index 1, selector 8, cs64 selector */ > >> +} __packed; > >> + > >> +/* No stack, so jump via far ptr in memory to load the 64bit CS */ > >> +struct hv_cs_jmptgt { > >> + u32 address; > >> + u16 csval; > >> + u16 fill; > >> +} __packed; > >> + > >> +/* This trampoline data is copied onto the trampoline page after the asm code */ > >> +struct hv_crash_tramp_data { > >> + u64 tramp32_cr3; > >> + u64 kernel_cr3; > >> + struct hv_gdtreg_32 gdtr32; > >> + struct hv_crash_tramp_gdt tramp_gdt; > >> + struct hv_cs_jmptgt cs_jmptgt; > >> + u64 c_entry_addr; > >> +} __packed; > >> + > >> +/* > >> + * Setup a temporary gdt to allow the asm code to switch to the long mode. > >> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip > >> + * relative addressing, hence we must use trampoline_pa here. Also, save other > >> + * info like jmp and C entry targets for same reasons. > >> + * > >> + * Returns: 0 on success, -1 on error > >> + */ > >> +static int hv_crash_setup_trampdata(u64 trampoline_va) > >> +{ > >> + int size, offs; > >> + void *dest; > >> + struct hv_crash_tramp_data *tramp; > >> + > >> + /* These must match exactly the ones in the corresponding asm file */ > >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); > >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); > >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); > >> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, > >> + cs_jmptgt.address) != 40); > > > > It would be nice to pick up the constants from a #include file that is > > shared with the asm code in Patch 4 of the series. > > yeah, could go either way, some don't like tiny headers... if there are > no objections to new header for this, i could go that way too. Saw your follow-on comments about this as well. The tiny header is ugly. It's a judgment call that can go either way, so go with your preference. > > >> + > >> + /* hv_crash_asm_end is beyond last byte by 1 */ > >> + size = &hv_crash_asm_end - &hv_crash_asm32; > >> + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { > >> + pr_err("%s: trampoline page overflow\n", __func__); > >> + return -1; > >> + } > >> + > >> + dest = (void *)trampoline_va; > >> + memcpy(dest, &hv_crash_asm32, size); > >> + > >> + dest += size; > >> + dest = (void *)round_up((ulong)dest, 16); > >> + tramp = (struct hv_crash_tramp_data *)dest; > >> + > >> + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by > >> + * non-PCID-aware users". Build cr3 with pcid 0 > >> + */ > >> + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); > >> + > >> + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ > >> + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); > >> + > >> + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); > >> + tramp->gdtr32.address = trampoline_pa + > >> + (ulong)&tramp->tramp_gdt - trampoline_va; > >> + > >> + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ > >> + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; > >> + > >> + tramp->cs_jmptgt.csval = 0x8; > >> + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; > >> + tramp->cs_jmptgt.address = trampoline_pa + offs; > >> + > >> + tramp->c_entry_addr = (u64)&hv_crash_c_entry; > >> + > >> + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; > >> + > >> + return 0; > >> +} > >> + > >> +/* > >> + * Build 32bit trampoline page table for transition from protected mode > >> + * non-paging to long-mode paging. This transition needs pagetables below 4G. > >> + */ > >> +static void hv_crash_build_tramp_pt(void) > >> +{ > >> + p4d_t *p4d; > >> + pud_t *pud; > >> + pmd_t *pmd; > >> + pte_t *pte; > >> + u64 pa, addr = trampoline_pa; > >> + > >> + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); > >> + pa = virt_to_phys(hv_crash_ptpgs[1]); > >> + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); > >> + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ > >> + > >> + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); > >> + pa = virt_to_phys(hv_crash_ptpgs[2]); > >> + set_pud(pud, __pud(_PAGE_TABLE | pa)); > >> + > >> + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); > >> + pa = virt_to_phys(hv_crash_ptpgs[3]); > >> + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); > >> + > >> + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); > >> + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); > >> +} > >> + > >> +/* > >> + * Setup trampoline for devirtualization: > >> + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to > >> + * in protected mode. > >> + * - 4 pages for a temporary page table that asm code uses to turn paging on > >> + * - a temporary gdt to use in the compat mode. > >> + * > >> + * Returns: 0 on success > >> + */ > >> +static int hv_crash_trampoline_setup(void) > >> +{ > >> + int i, rc, order; > >> + struct page *page; > >> + u64 trampoline_va; > >> + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; > >> + > >> + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ > >> + page = alloc_page(flags32); > >> + if (page == NULL) { > >> + pr_err("%s: failed to alloc asm stub page\n", __func__); > >> + return -1; > >> + } > >> + > >> + trampoline_va = (u64)page_to_virt(page); > >> + trampoline_pa = (u32)page_to_phys(page); > >> + > >> + order = 2; /* alloc 2^2 pages */ > >> + page = alloc_pages(flags32, order); > >> + if (page == NULL) { > >> + pr_err("%s: failed to alloc pt pages\n", __func__); > >> + free_page(trampoline_va); > >> + return -1; > >> + } > >> + > >> + for (i = 0; i < 4; i++, page++) > >> + hv_crash_ptpgs[i] = page_to_virt(page); > >> + > >> + hv_crash_build_tramp_pt(); > >> + > >> + rc = hv_crash_setup_trampdata(trampoline_va); > >> + if (rc) > >> + goto errout; > >> + > >> + return 0; > >> + > >> +errout: > >> + free_page(trampoline_va); > >> + free_pages((ulong)hv_crash_ptpgs[0], order); > >> + > >> + return rc; > >> +} > >> + > >> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ > >> +void hv_root_crash_init(void) > >> +{ > >> + int rc; > >> + struct hv_input_get_system_property *input; > >> + struct hv_output_get_system_property *output; > >> + unsigned long flags; > >> + u64 status; > >> + union hv_pfn_range cda_info; > >> + > >> + if (pgtable_l5_enabled()) { > >> + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); > >> + return; > >> + } > >> + > >> + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, > >> + "hv_crash_nmi"); > >> + if (rc) { > >> + pr_err("Hyper-V: failed to register crash nmi handler\n"); > >> + return; > >> + } > >> + > >> + local_irq_save(flags); > >> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > >> + output = *this_cpu_ptr(hyperv_pcpu_output_arg); > >> + > >> + memset(input, 0, sizeof(*input)); > >> + memset(output, 0, sizeof(*output)); > > > > Why zero the output area? This is one of those hypercall things that we're > > inconsistent about. A few hypercall call sites zero the output area, and it's > > not clear why they do. Hyper-V should be responsible for properly filling in > > the output area. Linux should not need to do this zero'ing, unless there's some > > known bug in Hyper-V for certain hypercalls, in which case there should be > > a code comment stating "why". > > for the same reason sometimes you see char *p = NULL, either leftover > code or someone was debugging or just copy and paste. this is just copy > paste. i agree in general that we don't need to clear it at all, in fact, > i'd like to remove them all! but i also understand people with different > skills and junior members find it easier to debug, and also we were in > early product development. for that reason, it doesn't have to be > consistent either, if some complex hypercalls are failing repeatedly, > just for ease of debug, one might leave it there temporarily. but > now that things are stable, i think we should just remove them all and > get used to a bit more inconvenient debugging... I see your point about debugging, but on balance I agree that they should all be removed. If there's some debug case, add it back temporarily to debug, but leave upstream without it. The zero'ing is also unnecessary code in the interrupt disabled window, which you have expressed concern about in a different thread. > > >> + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; > >> + > >> + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); > >> + cda_info.as_uint64 = output->hv_cda_info.as_uint64; > >> + local_irq_restore(flags); > >> + > >> + if (!hv_result_success(status)) { > >> + pr_err("Hyper-V: %s: property:%d %s\n", __func__, > >> + input->property_id, hv_result_to_string(status)); > >> + goto err_out; > >> + } > >> + > >> + if (cda_info.base_pfn == 0) { > >> + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); > >> + goto err_out; > >> + } > >> + > >> + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); > > > > Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in > > terms of the Hyper-V page size, which isn't necessarily the guest page size. > > Yes, on x86 there's no difference, but for future robustness .... > > i don't know about guests, but we won't even boot if dom0 pg size > didn't match.. but easier to change than to make the case.. FWIW, a normal Linux guest on ARM64 works just fine with a page size of 16K or 64K, even though the underlying Hyper-V page size is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in the first place. Using it properly really matters for normal guests. (Having the guest page size smaller than the Hyper-V page size does *not* work, but there are no such use cases.) Even on ARM64, I know the root partition page size is required to match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is still appropriate just to not leave code that will go wrong if the match requirement should ever change. > > >> + > >> + rc = hv_crash_trampoline_setup(); > >> + if (rc) > >> + goto err_out; > >> + > >> + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; > >> + > >> + crash_kexec_post_notifiers = true; > >> + hv_crash_enabled = 1; > >> + pr_info("Hyper-V: linux and hv kdump support enabled\n"); > > > > This message and the message below aren't consistent. One refers > > to "hv kdump" and the other to "hyp kdump". > > >> + > >> + return; > >> + > >> +err_out: > >> + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); > >> + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); > >> +} > >> -- > >> 2.36.1.vfs.0.0 > >> > > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-18 23:53 ` Michael Kelley @ 2025-09-19 2:32 ` Mukesh R 2025-09-19 19:48 ` Michael Kelley 2025-09-20 1:42 ` Mukesh R 0 siblings, 2 replies; 29+ messages in thread From: Mukesh R @ 2025-09-19 2:32 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/18/25 16:53, Michael Kelley wrote: > From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM >> >> On 9/15/25 10:55, Michael Kelley wrote: >>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >>>> >>>> Introduce a new file to implement collection of hypervisor ram into the >>> >>> s/ram/RAM/ (multiple places) >> >> a quick grep indicates using saying ram is common, i like ram over RAM >> >>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie, >>>> protected via hw page table. Hyper-V implements a disable hypercall which >>> >>> The terminology here is a bit confusing since you have two names for >>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to >>> just use "devirtualize" everywhere, and drop the "disable" terminology? >> >> The concept is devirtualize and the actual hypercall was originally named >> disable. so intermixing is natural imo. >> >>>> essentially devirtualizes the system on the fly. This mechanism makes the >>>> hypervisor ram accessible to linux. Because the hypervisor ram is already >>>> mapped into linux address space (as reserved ram), >>> >>> Is the hypervisor RAM mapped into the VMM process user address space, >>> or somewhere in the kernel address space? If the latter, where in the kernel >>> code, or what mechanism, does that? Just curious, as I wasn't aware that >>> this is happening .... >> >> mapped in kernel as normal ram and we reserve it very early in boot. i >> see that patch has not made it here yet, should be coming very soon. > > OK, that's fine. The answer to my question is coming soon .... > >> >>>> it is automatically >>>> collected into the vmcore without extra work. More details of the >>>> implementation are available in the file prologue. >>>> >>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >>>> --- >>>> arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ >>>> 1 file changed, 622 insertions(+) >>>> create mode 100644 arch/x86/hyperv/hv_crash.c >>>> >>>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c >>>> new file mode 100644 >>>> index 000000000000..531bac79d598 >>>> --- /dev/null >>>> +++ b/arch/x86/hyperv/hv_crash.c >>>> @@ -0,0 +1,622 @@ >>>> +// SPDX-License-Identifier: GPL-2.0-only >>>> +/* >>>> + * X86 specific Hyper-V kdump/crash support module >>>> + * >>>> + * Copyright (C) 2025, Microsoft, Inc. >>>> + * >>>> + * This module implements hypervisor ram collection into vmcore for both >>>> + * cases of the hypervisor crash and linux dom0/root crash. >>> >>> For a hypervisor crash, does any of this apply to general guest VMs? I'm >>> thinking it does not. Hypervisor RAM is collected only into the vmcore >>> for the root partition, right? Maybe some additional clarification could be >>> added so there's no confusion in this regard. >> >> it would be odd for guests to collect hyp core, and target audience is >> assumed to be those who are somewhat familiar with basic concepts before >> getting here. > > I was unsure because I had not seen any code that adds the hypervisor memory > to the Linux memory map. Thought maybe something was going on I hadn?t > heard about, so I didn't know the scope of it. > > Of course, I'm one of those people who was *not* familiar with the basic concepts > before getting here. And given that there's no spec available from Hyper-V, > the comments in this patch set are all there is for anyone outside of Microsoft. > In that vein, I think it's reasonable to provide some description of how this > all works in the code comments. And you've done that, which is very > helpful. But I encountered a few places where I was confused or unclear, and > my suggestions here and in Patch 4 are just about making things as precise as > possible without adding a huge amount of additional verbiage. For someone > new, English text descriptions that the code can be checked against are > helpful, and drawing hard boundaries ("this is only applicable to the root > partition") is helpful. > > If you don't want to deal with it now, I could provide a follow-on patch later > that tweaks or augments the wording a bit to clarify some of these places. > You can review, like with any patch. I've done wording work over the years > to many places in the VMBus code, and more broadly in providing most of > the documentation in Documentation/virt/hyperv. with time, things will start making sense... i find comment pretty clear that it collects core for both cases of hv crash and dom0 crash, and no mention of guest implies has nothing to do with guests. >> >>> And what *does* happen to guest VMs after a hypervisor crash? >> >> they are gone... what else could we do? >> >>>> + * Hyper-V implements >>>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This >>>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram >>>> + * is already mapped in linux, it is automatically collected into linux vmcore, >>>> + * and can be examined by the crash command (raw ram dump) or windbg. >>>> + * >>>> + * At a high level: >>>> + * >>>> + * Hypervisor Crash: >>>> + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a >>>> + * restrictive mode with very limited hypercall and msr support. >>> >>> s/msr/MSR/ >> >> msr is used all over, seems acceptable. >> >>>> + * Each cpu then injects NMIs into dom0/root vcpus. >>> >>> The "Each cpu" part of this sentence is confusing to me -- which CPUs does >>> this refer to? Maybe it would be better to say "It then injects an NMI into >>> each dom0/root partition vCPU." without being specific as to which CPUs do >>> the injecting since that seems more like a hypervisor implementation detail >>> that's not relevant here. >> >> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu. > > OK, that makes sense now that I think about it. Each physical CPU in the host > has a corresponding vCPU in the dom0/root partition. And each of the vCPUs > gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off > running a vCPU in some guest VM. > >> >>>> + * A shared page is used to check >>>> + * by linux in the nmi handler if the hypervisor has crashed. This shared >>> >>> s/nmi/NMI/ (multiple places) >> >>>> + * page is setup in hv_root_crash_init during boot. >>>> + * >>>> + * Linux Crash: >>>> + * In case of linux crash, the callback hv_crash_stop_other_cpus will send >>>> + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits >>>> + * for all cpus to be in NMI. >>>> + * >>>> + * NMI Handler (upon quorum): >>>> + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. >>> >>> s/hanlder/handler/ >>> >>> And maybe just drop the word "wil" (which is misspelled). >>> >>>> + * Hyper-V requires the disable hypervisor must be done from the bsp. So >>> >>> s/bsp/BSP (multiple places) >>> >>>> + * the bsp nmi handler saves current context, does some fixups and makes >>>> + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor >>>> + * at that point will suspend all vcpus (except the bsp), unlock all its >>>> + * ram, and return to linux at the 32bit mode entry RIP. >>>> + * >>>> + * Linux 32bit entry trampoline will then restore long mode and call C >>>> + * function here to restore context and continue execution to crash kexec. >>>> + */ >>>> + >>>> +#include <linux/delay.h> >>>> +#include <linux/kexec.h> >>>> +#include <linux/crash_dump.h> >>>> +#include <linux/panic.h> >>>> +#include <asm/apic.h> >>>> +#include <asm/desc.h> >>>> +#include <asm/page.h> >>>> +#include <asm/pgalloc.h> >>>> +#include <asm/mshyperv.h> >>>> +#include <asm/nmi.h> >>>> +#include <asm/idtentry.h> >>>> +#include <asm/reboot.h> >>>> +#include <asm/intel_pt.h> >>>> + >>>> +int hv_crash_enabled; >>> >>> Seems like this is conceptually a "bool", not an "int". >> >> yeah, can change it to bool if i do another iteration. >> >>>> +EXPORT_SYMBOL_GPL(hv_crash_enabled); >>>> + >>>> +struct hv_crash_ctxt { >>>> + ulong rsp; >>>> + ulong cr0; >>>> + ulong cr2; >>>> + ulong cr4; >>>> + ulong cr8; >>>> + >>>> + u16 cs; >>>> + u16 ss; >>>> + u16 ds; >>>> + u16 es; >>>> + u16 fs; >>>> + u16 gs; >>>> + >>>> + u16 gdt_fill; >>>> + struct desc_ptr gdtr; >>>> + char idt_fill[6]; >>>> + struct desc_ptr idtr; >>>> + >>>> + u64 gsbase; >>>> + u64 efer; >>>> + u64 pat; >>>> +}; >>>> +static struct hv_crash_ctxt hv_crash_ctxt; >>>> + >>>> +/* Shared hypervisor page that contains crash dump area we peek into. >>>> + * NB: windbg looks for "hv_cda" symbol so don't change it. >>>> + */ >>>> +static struct hv_crashdump_area *hv_cda; >>>> + >>>> +static u32 trampoline_pa, devirt_cr3arg; >>>> +static atomic_t crash_cpus_wait; >>>> +static void *hv_crash_ptpgs[4]; >>>> +static int hv_has_crashed, lx_has_crashed; >>> >>> These are conceptually "bool" as well. >>> >>>> + >>>> +/* This cannot be inlined as it needs stack */ >>>> +static noinline __noclone void hv_crash_restore_tss(void) >>>> +{ >>>> + load_TR_desc(); >>>> +} >>>> + >>>> +/* This cannot be inlined as it needs stack */ >>>> +static noinline void hv_crash_clear_kernpt(void) >>>> +{ >>>> + pgd_t *pgd; >>>> + p4d_t *p4d; >>>> + >>>> + /* Clear entry so it's not confusing to someone looking at the core */ >>>> + pgd = pgd_offset_k(trampoline_pa); >>>> + p4d = p4d_offset(pgd, trampoline_pa); >>>> + native_p4d_clear(p4d); >>>> +} >>>> + >>>> +/* >>>> + * This is the C entry point from the asm glue code after the devirt hypercall. >>>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel >>>> + * page tables with our below 4G page identity mapped, but using a temporary >>>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not >>>> + * available. We restore kernel GDT, and rest of the context, and continue >>>> + * to kexec. >>>> + */ >>>> +static asmlinkage void __noreturn hv_crash_c_entry(void) >>>> +{ >>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >>>> + >>>> + /* first thing, restore kernel gdt */ >>>> + native_load_gdt(&ctxt->gdtr); >>>> + >>>> + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); >>>> + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); >>>> + >>>> + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); >>>> + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); >>>> + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); >>>> + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); >>>> + >>>> + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); >>>> + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); >>>> + >>>> + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); >>>> + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); >>>> + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); >>>> + >>>> + native_load_idt(&ctxt->idtr); >>>> + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); >>>> + native_wrmsrq(MSR_EFER, ctxt->efer); >>>> + >>>> + /* restore the original kernel CS now via far return */ >>>> + asm volatile("movzwq %0, %%rax\n\t" >>>> + "pushq %%rax\n\t" >>>> + "pushq $1f\n\t" >>>> + "lretq\n\t" >>>> + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); >>>> + >>>> + /* We are in asmlinkage without stack frame, hence make a C function >>>> + * call which will buy stack frame to restore the tss or clear PT entry. >>>> + */ >>>> + hv_crash_restore_tss(); >>>> + hv_crash_clear_kernpt(); >>>> + >>>> + /* we are now fully in devirtualized normal kernel mode */ >>>> + __crash_kexec(NULL); >>> >>> The comments for __crash_kexec() say that "panic_cpu" should be set to >>> the current CPU. I don't see that such is the case here. >> >> if linux panic, it would be set by vpanic, if hyp crash, that is >> irrelevant. >> >>>> + >>>> + for (;;) >>>> + cpu_relax(); >>> >>> Is the intent that __crash_kexec() should never return, on any of the vCPUs, >>> because devirtualization isn't done unless there's a valid kdump image loaded? >>> I wonder if >>> >>> native_wrmsrq(HV_X64_MSR_RESET, 1); >>> >>> would be better than looping forever in case __crash_kexec() fails >>> somewhere along the way even if there's a kdump image loaded. >> >> yeah, i've gone thru all 3 possibilities here: >> o loop forever >> o reset >> o BUG() : this was in V0 >> >> reset is just bad because system would just reboot without any indication >> if hyp crashes. with loop at least there is a hang, and one could make >> note of it, and if internal, attach debugger. >> >> BUG is best imo because with hyp gone linux will try to redo panic >> and we would print something extra to help. I think i'll just go >> back to my V0: BUG() >> >>>> +} >>>> +/* Tell gcc we are using lretq long jump in the above function intentionally */ >>>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); >>>> + >>>> +static void hv_mark_tss_not_busy(void) >>>> +{ >>>> + struct desc_struct *desc = get_current_gdt_rw(); >>>> + tss_desc tss; >>>> + >>>> + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); >>>> + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ >>>> + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); >>>> +} >>>> + >>>> +/* Save essential context */ >>>> +static void hv_hvcrash_ctxt_save(void) >>>> +{ >>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >>>> + >>>> + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); >>>> + >>>> + ctxt->cr0 = native_read_cr0(); >>>> + ctxt->cr4 = native_read_cr4(); >>>> + >>>> + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); >>>> + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); >>>> + >>>> + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); >>>> + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); >>>> + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); >>>> + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); >>>> + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); >>>> + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); >>>> + >>>> + native_store_gdt(&ctxt->gdtr); >>>> + store_idt(&ctxt->idtr); >>>> + >>>> + ctxt->gsbase = __rdmsr(MSR_GS_BASE); >>>> + ctxt->efer = __rdmsr(MSR_EFER); >>>> + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); >>>> +} >>>> + >>>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ >>>> +static void hv_crash_fixup_kernpt(void) >>>> +{ >>>> + pgd_t *pgd; >>>> + p4d_t *p4d; >>>> + >>>> + pgd = pgd_offset_k(trampoline_pa); >>>> + p4d = p4d_offset(pgd, trampoline_pa); >>>> + >>>> + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ >>>> + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); >>>> + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ >>>> +} >>>> + >>>> +/* >>>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has >>>> + * crashed and will collect core. This will cause the hyp to quiesce and >>>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. >>>> + */ >>>> +static void hv_notify_prepare_hyp(void) >>>> +{ >>>> + u64 status; >>>> + struct hv_input_notify_partition_event *input; >>>> + struct hv_partition_event_root_crashdump_input *cda; >>>> + >>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>> + cda = &input->input.crashdump_input; >>> >>> The code ordering here is a bit weird. I'd expect this line to be grouped >>> with cda->crashdump_action being set. >> >> we are setting two pointers, and using them later. setting pointers >> up front is pretty normal. >> >>>> + memset(input, 0, sizeof(*input)); >>>> + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; >>>> + >>>> + cda->crashdump_action = HV_CRASHDUMP_ENTRY; >>>> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >>>> + if (!hv_result_success(status)) >>>> + return; >>>> + >>>> + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; >>>> + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >>>> +} >>>> + >>>> +/* >>>> + * Common function for all cpus before devirtualization. >>>> + * >>>> + * Hypervisor crash: all cpus get here in nmi context. >>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi >>>> + * context. Note, panicing cpu may not be the bsp. >>>> + * >>>> + * The function is not inlined so it will show on the stack. It is named so >>>> + * because the crash cmd looks for certain well known function names on the >>>> + * stack before looking into the cpu saved note in the elf section, and >>>> + * that work is currently incomplete. >>>> + * >>>> + * Notes: >>>> + * Hypervisor crash: >>>> + * - the hypervisor is in a very restrictive mode at this point and any >>>> + * vmexit it cannot handle would result in reboot. For example, console >>>> + * output from here would result in synic ipi hcall, which would result >>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. >>>> + * >>>> + * Devirtualization is supported from the bsp only. >>>> + */ >>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) >>>> +{ >>>> + struct hv_input_disable_hyp_ex *input; >>>> + u64 status; >>>> + int msecs = 1000, ccpu = smp_processor_id(); >>>> + >>>> + if (ccpu == 0) { >>>> + /* crash_save_cpu() will be done in the kexec path */ >>>> + cpu_emergency_stop_pt(); /* disable performance trace */ >>>> + atomic_inc(&crash_cpus_wait); >>>> + } else { >>>> + crash_save_cpu(regs, ccpu); >>>> + cpu_emergency_stop_pt(); /* disable performance trace */ >>>> + atomic_inc(&crash_cpus_wait); >>>> + for (;;); /* cause no vmexits */ >>>> + } >>>> + >>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) >>>> + mdelay(1); >>>> + >>>> + stop_nmi(); >>>> + if (!hv_has_crashed) >>>> + hv_notify_prepare_hyp(); >>>> + >>>> + if (crashing_cpu == -1) >>>> + crashing_cpu = ccpu; /* crash cmd uses this */ >>> >>> Could just be "crashing_cpu = 0" since only the BSP gets here. >> >> a code change request has been open for while to remove the requirement >> of bsp.. >> >>>> + >>>> + hv_hvcrash_ctxt_save(); >>>> + hv_mark_tss_not_busy(); >>>> + hv_crash_fixup_kernpt(); >>>> + >>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>> + memset(input, 0, sizeof(*input)); >>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ >>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ >>> >>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? >>> And just for clarification, Hyper-V treats this "arg" value as opaque and does >>> not access it. It only provides it in EDI when it invokes the trampoline >>> function, right? >> >> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). > > Yes, the comment matches the name of the "devirt_cr3arg" variable. > Unfortunately my previous comment was incomplete because the value > stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page > table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the > PA of struct hv_crash_tramp_data. The CR3 value is stored in the > tramp32_cr3 field (at offset 0) of that structure, so there's an additional level > of indirection. The (corrected) comment in the header to hv_crash_asm32() > describes EDI as containing "PA of struct hv_crash_tramp_data", which > ought to match what is described here. I'd say that "devirt_cr3arg" ought > to be renamed to "tramp_data_pa" or something else parallel to > "trampoline_pa". hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy back extra information for ourselves needed in trampoline.S. so it's all good. >> >> right, comes in edi, i don't know what EDI is (just kidding!)... >> >>>> + >>>> + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); >>>> + >>>> + /* Devirt failed, just reboot as things are in very bad state now */ >>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ >>>> +} >>>> + >>>> +/* >>>> + * Generic nmi callback handler: could be called without any crash also. >>>> + * hv crash: hypervisor injects nmi's into all cpus >>>> + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus >>>> + */ >>>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) >>>> +{ >>>> + int ccpu = smp_processor_id(); >>>> + >>>> + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) >>>> + hv_has_crashed = 1; >>>> + >>>> + if (!hv_has_crashed && !lx_has_crashed) >>>> + return NMI_DONE; /* ignore the nmi */ >>>> + >>>> + if (hv_has_crashed) { >>>> + if (!kexec_crash_loaded() || !hv_crash_enabled) { >>>> + if (ccpu == 0) { >>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ >>>> + } else >>>> + for (;;); /* cause no vmexits */ >>>> + } >>>> + } >>>> + >>>> + crash_nmi_callback(regs); >>>> + >>>> + return NMI_DONE; >>> >>> crash_nmi_callback() should never return, right? Normally one would >>> expect to return NMI_HANDLED here, but I guess it doesn't matter >>> if the return is never executed. >> >> correct. >> >>>> +} >>>> + >>>> +/* >>>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus >>>> + * >>>> + * On normal linux panic, this is called twice: first from panic and then again >>>> + * from native_machine_crash_shutdown. >>>> + * >>>> + * In case of mshv, 3 ways to get here: >>>> + * 1. hv crash (only bsp will get here): >>>> + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry >>>> + * -> __crash_kexec -> native_machine_crash_shutdown >>>> + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus >>>> + * linux panic: >>>> + * 2. panic cpu x: panic() -> crash_smp_send_stop >>>> + * -> smp_ops.crash_stop_other_cpus >>>> + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop >>>> + * >>>> + * NB: noclone and non standard stack because of call to crash_setup_regs(). >>>> + */ >>>> +static void __noclone hv_crash_stop_other_cpus(void) >>>> +{ >>>> + static int crash_stop_done; >>>> + struct pt_regs lregs; >>>> + int ccpu = smp_processor_id(); >>>> + >>>> + if (hv_has_crashed) >>>> + return; /* all cpus already in nmi handler path */ >>>> + >>>> + if (!kexec_crash_loaded()) >>>> + return; >>> >>> If we're in a normal panic path (your Case #2 above) with no kdump kernel >>> loaded, why leave the other vCPUs running? Seems like that could violate >>> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and >>> thereafter assumes other vCPUs are not running. >> >> no, there is lots of complexity here! >> >> if we hang vcpus here, hyp will note and may trigger its own watchdog. >> also, machine_crash_shutdown() does another ipi. >> >> I think the best thing to do here is go back to my V0 which did not >> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry: >> >> + /* we are now fully in devirtualized normal kernel mode */ >> + __crash_kexec(NULL); >> + >> + BUG(); >> >> >> this way hyp would be disabled, ie, system devirtualized, and >> __crash_kernel() will return, resulting in BUG() that will cause >> it to go thru panic and honor panic= parameter with either hang >> or reset. instead of bug, i could just call panic() also. >> >>>> + >>>> + if (crash_stop_done) >>>> + return; >>>> + crash_stop_done = 1; >>> >>> Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called >>> from crash_smp_send_stop(), which has its own static variable >>> "cpus_stopped" that does the same thing. >> >> yes. for error paths. >> >>>> + >>>> + /* linux has crashed: hv is healthy, we can ipi safely */ >>>> + lx_has_crashed = 1; >>>> + wmb(); /* nmi handlers look at lx_has_crashed */ >>>> + >>>> + apic->send_IPI_allbutself(NMI_VECTOR); >>> >>> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus(). >>> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but >>> should disable_local_APIC() be done somewhere here as well? >> >> no, hyp does that. > > As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX > hypercall in crash_nmi_callback()? This gets back to an earlier question/comment > where I was trying to figure out if the APIC is still enabled, and in what mode, > when hv_crash_asm32() is invoked. >> >>>> + >>>> + if (crashing_cpu == -1) >>>> + crashing_cpu = ccpu; /* crash cmd uses this */ >>>> + >>>> + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which >>>> + * is the bsp. We could be here on non-bsp cpu, collect regs if so. >>>> + */ >>>> + if (ccpu) >>>> + crash_setup_regs(&lregs, NULL); >>>> + >>>> + crash_nmi_callback(&lregs); >>>> +} >>>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); >>>> + >>>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ >>>> +struct hv_gdtreg_32 { >>>> + u16 fill; >>>> + u16 limit; >>>> + u32 address; >>>> +} __packed; >>>> + >>>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ >>>> +struct hv_crash_tramp_gdt { >>>> + u64 null; /* index 0, selector 0, null selector */ >>>> + u64 cs64; /* index 1, selector 8, cs64 selector */ >>>> +} __packed; >>>> + >>>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */ >>>> +struct hv_cs_jmptgt { >>>> + u32 address; >>>> + u16 csval; >>>> + u16 fill; >>>> +} __packed; >>>> + >>>> +/* This trampoline data is copied onto the trampoline page after the asm code */ >>>> +struct hv_crash_tramp_data { >>>> + u64 tramp32_cr3; >>>> + u64 kernel_cr3; >>>> + struct hv_gdtreg_32 gdtr32; >>>> + struct hv_crash_tramp_gdt tramp_gdt; >>>> + struct hv_cs_jmptgt cs_jmptgt; >>>> + u64 c_entry_addr; >>>> +} __packed; >>>> + >>>> +/* >>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode. >>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip >>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other >>>> + * info like jmp and C entry targets for same reasons. >>>> + * >>>> + * Returns: 0 on success, -1 on error >>>> + */ >>>> +static int hv_crash_setup_trampdata(u64 trampoline_va) >>>> +{ >>>> + int size, offs; >>>> + void *dest; >>>> + struct hv_crash_tramp_data *tramp; >>>> + >>>> + /* These must match exactly the ones in the corresponding asm file */ >>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); >>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); >>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); >>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, >>>> + cs_jmptgt.address) != 40); >>> >>> It would be nice to pick up the constants from a #include file that is >>> shared with the asm code in Patch 4 of the series. >> >> yeah, could go either way, some don't like tiny headers... if there are >> no objections to new header for this, i could go that way too. > > Saw your follow-on comments about this as well. The tiny header > is ugly. It's a judgment call that can go either way, so go with your > preference. > >> >>>> + >>>> + /* hv_crash_asm_end is beyond last byte by 1 */ >>>> + size = &hv_crash_asm_end - &hv_crash_asm32; >>>> + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { >>>> + pr_err("%s: trampoline page overflow\n", __func__); >>>> + return -1; >>>> + } >>>> + >>>> + dest = (void *)trampoline_va; >>>> + memcpy(dest, &hv_crash_asm32, size); >>>> + >>>> + dest += size; >>>> + dest = (void *)round_up((ulong)dest, 16); >>>> + tramp = (struct hv_crash_tramp_data *)dest; >>>> + >>>> + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by >>>> + * non-PCID-aware users". Build cr3 with pcid 0 >>>> + */ >>>> + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); >>>> + >>>> + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ >>>> + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); >>>> + >>>> + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); >>>> + tramp->gdtr32.address = trampoline_pa + >>>> + (ulong)&tramp->tramp_gdt - trampoline_va; >>>> + >>>> + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ >>>> + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; >>>> + >>>> + tramp->cs_jmptgt.csval = 0x8; >>>> + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; >>>> + tramp->cs_jmptgt.address = trampoline_pa + offs; >>>> + >>>> + tramp->c_entry_addr = (u64)&hv_crash_c_entry; >>>> + >>>> + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; >>>> + >>>> + return 0; >>>> +} >>>> + >>>> +/* >>>> + * Build 32bit trampoline page table for transition from protected mode >>>> + * non-paging to long-mode paging. This transition needs pagetables below 4G. >>>> + */ >>>> +static void hv_crash_build_tramp_pt(void) >>>> +{ >>>> + p4d_t *p4d; >>>> + pud_t *pud; >>>> + pmd_t *pmd; >>>> + pte_t *pte; >>>> + u64 pa, addr = trampoline_pa; >>>> + >>>> + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); >>>> + pa = virt_to_phys(hv_crash_ptpgs[1]); >>>> + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); >>>> + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ >>>> + >>>> + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); >>>> + pa = virt_to_phys(hv_crash_ptpgs[2]); >>>> + set_pud(pud, __pud(_PAGE_TABLE | pa)); >>>> + >>>> + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); >>>> + pa = virt_to_phys(hv_crash_ptpgs[3]); >>>> + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); >>>> + >>>> + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); >>>> + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); >>>> +} >>>> + >>>> +/* >>>> + * Setup trampoline for devirtualization: >>>> + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to >>>> + * in protected mode. >>>> + * - 4 pages for a temporary page table that asm code uses to turn paging on >>>> + * - a temporary gdt to use in the compat mode. >>>> + * >>>> + * Returns: 0 on success >>>> + */ >>>> +static int hv_crash_trampoline_setup(void) >>>> +{ >>>> + int i, rc, order; >>>> + struct page *page; >>>> + u64 trampoline_va; >>>> + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; >>>> + >>>> + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ >>>> + page = alloc_page(flags32); >>>> + if (page == NULL) { >>>> + pr_err("%s: failed to alloc asm stub page\n", __func__); >>>> + return -1; >>>> + } >>>> + >>>> + trampoline_va = (u64)page_to_virt(page); >>>> + trampoline_pa = (u32)page_to_phys(page); >>>> + >>>> + order = 2; /* alloc 2^2 pages */ >>>> + page = alloc_pages(flags32, order); >>>> + if (page == NULL) { >>>> + pr_err("%s: failed to alloc pt pages\n", __func__); >>>> + free_page(trampoline_va); >>>> + return -1; >>>> + } >>>> + >>>> + for (i = 0; i < 4; i++, page++) >>>> + hv_crash_ptpgs[i] = page_to_virt(page); >>>> + >>>> + hv_crash_build_tramp_pt(); >>>> + >>>> + rc = hv_crash_setup_trampdata(trampoline_va); >>>> + if (rc) >>>> + goto errout; >>>> + >>>> + return 0; >>>> + >>>> +errout: >>>> + free_page(trampoline_va); >>>> + free_pages((ulong)hv_crash_ptpgs[0], order); >>>> + >>>> + return rc; >>>> +} >>>> + >>>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ >>>> +void hv_root_crash_init(void) >>>> +{ >>>> + int rc; >>>> + struct hv_input_get_system_property *input; >>>> + struct hv_output_get_system_property *output; >>>> + unsigned long flags; >>>> + u64 status; >>>> + union hv_pfn_range cda_info; >>>> + >>>> + if (pgtable_l5_enabled()) { >>>> + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); >>>> + return; >>>> + } >>>> + >>>> + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, >>>> + "hv_crash_nmi"); >>>> + if (rc) { >>>> + pr_err("Hyper-V: failed to register crash nmi handler\n"); >>>> + return; >>>> + } >>>> + >>>> + local_irq_save(flags); >>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>> + output = *this_cpu_ptr(hyperv_pcpu_output_arg); >>>> + >>>> + memset(input, 0, sizeof(*input)); >>>> + memset(output, 0, sizeof(*output)); >>> >>> Why zero the output area? This is one of those hypercall things that we're >>> inconsistent about. A few hypercall call sites zero the output area, and it's >>> not clear why they do. Hyper-V should be responsible for properly filling in >>> the output area. Linux should not need to do this zero'ing, unless there's some >>> known bug in Hyper-V for certain hypercalls, in which case there should be >>> a code comment stating "why". >> >> for the same reason sometimes you see char *p = NULL, either leftover >> code or someone was debugging or just copy and paste. this is just copy >> paste. i agree in general that we don't need to clear it at all, in fact, >> i'd like to remove them all! but i also understand people with different >> skills and junior members find it easier to debug, and also we were in >> early product development. for that reason, it doesn't have to be >> consistent either, if some complex hypercalls are failing repeatedly, >> just for ease of debug, one might leave it there temporarily. but >> now that things are stable, i think we should just remove them all and >> get used to a bit more inconvenient debugging... > > I see your point about debugging, but on balance I agree that they > should all be removed. If there's some debug case, add it back > temporarily to debug, but leave upstream without it. The zero'ing is > also unnecessary code in the interrupt disabled window, which you > have expressed concern about in a different thread. yeah, i've been extremely busy so not able to pay much attention to upstreaming, but imo they should have been removed before upstreaming. a simple patch that just removes memset of output would be welcome. >> >>>> + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; >>>> + >>>> + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); >>>> + cda_info.as_uint64 = output->hv_cda_info.as_uint64; >>>> + local_irq_restore(flags); >>>> + >>>> + if (!hv_result_success(status)) { >>>> + pr_err("Hyper-V: %s: property:%d %s\n", __func__, >>>> + input->property_id, hv_result_to_string(status)); >>>> + goto err_out; >>>> + } >>>> + >>>> + if (cda_info.base_pfn == 0) { >>>> + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); >>>> + goto err_out; >>>> + } >>>> + >>>> + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); >>> >>> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in >>> terms of the Hyper-V page size, which isn't necessarily the guest page size. >>> Yes, on x86 there's no difference, but for future robustness .... >> >> i don't know about guests, but we won't even boot if dom0 pg size >> didn't match.. but easier to change than to make the case.. > > FWIW, a normal Linux guest on ARM64 works just fine with a page > size of 16K or 64K, even though the underlying Hyper-V page size > is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in > the first place. Using it properly really matters for normal guests. > (Having the guest page size smaller than the Hyper-V page size > does *not* work, but there are no such use cases.) > > Even on ARM64, I know the root partition page size is required to > match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is > still appropriate just to not leave code that will go wrong if the > match requirement should ever change. > >> >>>> + >>>> + rc = hv_crash_trampoline_setup(); >>>> + if (rc) >>>> + goto err_out; >>>> + >>>> + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; >>>> + >>>> + crash_kexec_post_notifiers = true; >>>> + hv_crash_enabled = 1; >>>> + pr_info("Hyper-V: linux and hv kdump support enabled\n"); >>> >>> This message and the message below aren't consistent. One refers >>> to "hv kdump" and the other to "hyp kdump". >> >>>> + >>>> + return; >>>> + >>>> +err_out: >>>> + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); >>>> + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); >>>> +} >>>> -- >>>> 2.36.1.vfs.0.0 >>>> >>> > ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-19 2:32 ` Mukesh R @ 2025-09-19 19:48 ` Michael Kelley 2025-09-20 1:42 ` Mukesh R 1 sibling, 0 replies; 29+ messages in thread From: Michael Kelley @ 2025-09-19 19:48 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Thursday, September 18, 2025 7:32 PM > > On 9/18/25 16:53, Michael Kelley wrote: > > From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM > >> > >> On 9/15/25 10:55, Michael Kelley wrote: > >>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM [snip] > >>>> + > >>>> +/* > >>>> + * Common function for all cpus before devirtualization. > >>>> + * > >>>> + * Hypervisor crash: all cpus get here in nmi context. > >>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi > >>>> + * context. Note, panicing cpu may not be the bsp. > >>>> + * > >>>> + * The function is not inlined so it will show on the stack. It is named so > >>>> + * because the crash cmd looks for certain well known function names on the > >>>> + * stack before looking into the cpu saved note in the elf section, and > >>>> + * that work is currently incomplete. > >>>> + * > >>>> + * Notes: > >>>> + * Hypervisor crash: > >>>> + * - the hypervisor is in a very restrictive mode at this point and any > >>>> + * vmexit it cannot handle would result in reboot. For example, console > >>>> + * output from here would result in synic ipi hcall, which would result > >>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. > >>>> + * > >>>> + * Devirtualization is supported from the bsp only. > >>>> + */ > >>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) > >>>> +{ > >>>> + struct hv_input_disable_hyp_ex *input; > >>>> + u64 status; > >>>> + int msecs = 1000, ccpu = smp_processor_id(); > >>>> + > >>>> + if (ccpu == 0) { > >>>> + /* crash_save_cpu() will be done in the kexec path */ > >>>> + cpu_emergency_stop_pt(); /* disable performance trace */ > >>>> + atomic_inc(&crash_cpus_wait); > >>>> + } else { > >>>> + crash_save_cpu(regs, ccpu); > >>>> + cpu_emergency_stop_pt(); /* disable performance trace */ > >>>> + atomic_inc(&crash_cpus_wait); > >>>> + for (;;); /* cause no vmexits */ > >>>> + } > >>>> + > >>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) > >>>> + mdelay(1); > >>>> + > >>>> + stop_nmi(); > >>>> + if (!hv_has_crashed) > >>>> + hv_notify_prepare_hyp(); > >>>> + > >>>> + if (crashing_cpu == -1) > >>>> + crashing_cpu = ccpu; /* crash cmd uses this */ > >>> > >>> Could just be "crashing_cpu = 0" since only the BSP gets here. > >> > >> a code change request has been open for while to remove the requirement > >> of bsp.. > >> > >>>> + > >>>> + hv_hvcrash_ctxt_save(); > >>>> + hv_mark_tss_not_busy(); > >>>> + hv_crash_fixup_kernpt(); > >>>> + > >>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > >>>> + memset(input, 0, sizeof(*input)); > >>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ > >>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ > >>> > >>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? > >>> And just for clarification, Hyper-V treats this "arg" value as opaque and does > >>> not access it. It only provides it in EDI when it invokes the trampoline > >>> function, right? > >> > >> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). > > > > Yes, the comment matches the name of the "devirt_cr3arg" variable. > > Unfortunately my previous comment was incomplete because the value > > stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page > > table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the > > PA of struct hv_crash_tramp_data. The CR3 value is stored in the > > tramp32_cr3 field (at offset 0) of that structure, so there's an additional level > > of indirection. The (corrected) comment in the header to hv_crash_asm32() > > describes EDI as containing "PA of struct hv_crash_tramp_data", which > > ought to match what is described here. I'd say that "devirt_cr3arg" ought > > to be renamed to "tramp_data_pa" or something else parallel to > > "trampoline_pa". > > hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy > back extra information for ourselves needed in trampoline.S. so it's > all good. > That's a pretty important "detail" that hasn't heretofore been mentioned. It means the layout of struct hv_crash_tramp_data is not entirely at Linux's discretion. The tramp32_cr3 field must be first so the hypervisor finds it where it expects it. Please add code comments describing that the hypervisor uses the tramp32_cr3 field. With this new information, I agree the code works. But the devirt_cr3arg variable is still named incorrectly, and the "PA of trampoline page table L4" comment is still incorrect. The value in "devirt_cr3arg" is the PA of a memory location in the trampoline page that contains the devirt CR3 (which itself is the PA of trampoline page table L4). The CR3 value is in the tramp32_cr3 field of struct hv_crash_tramp_data in the trampoline page. The CR3 value is not in static variable devirt_cr3arg, which is why I object to the naming of that variable. So rename devirt_cr3arg to devirt_cr3arg_pa. And the comment becomes "PA of PA of trampoline page table L4", which is rather unwieldy, so could be shortened to "PA of devirt CR3 value" or something similar. You could also use "PA of struct hv_crash_tramp_data" as the comment, as I suggested previously. Michael ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-19 2:32 ` Mukesh R 2025-09-19 19:48 ` Michael Kelley @ 2025-09-20 1:42 ` Mukesh R 2025-09-23 1:35 ` Michael Kelley 1 sibling, 1 reply; 29+ messages in thread From: Mukesh R @ 2025-09-20 1:42 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/18/25 19:32, Mukesh R wrote: > On 9/18/25 16:53, Michael Kelley wrote: >> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM >>> >>> On 9/15/25 10:55, Michael Kelley wrote: >>>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >>>>> >>>>> Introduce a new file to implement collection of hypervisor ram into the >>>> >>>> s/ram/RAM/ (multiple places) >>> >>> a quick grep indicates using saying ram is common, i like ram over RAM >>> >>>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie, >>>>> protected via hw page table. Hyper-V implements a disable hypercall which >>>> >>>> The terminology here is a bit confusing since you have two names for >>>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to >>>> just use "devirtualize" everywhere, and drop the "disable" terminology? >>> >>> The concept is devirtualize and the actual hypercall was originally named >>> disable. so intermixing is natural imo. >>> >>>>> essentially devirtualizes the system on the fly. This mechanism makes the >>>>> hypervisor ram accessible to linux. Because the hypervisor ram is already >>>>> mapped into linux address space (as reserved ram), >>>> >>>> Is the hypervisor RAM mapped into the VMM process user address space, >>>> or somewhere in the kernel address space? If the latter, where in the kernel >>>> code, or what mechanism, does that? Just curious, as I wasn't aware that >>>> this is happening .... >>> >>> mapped in kernel as normal ram and we reserve it very early in boot. i >>> see that patch has not made it here yet, should be coming very soon. >> >> OK, that's fine. The answer to my question is coming soon .... >> >>> >>>>> it is automatically >>>>> collected into the vmcore without extra work. More details of the >>>>> implementation are available in the file prologue. >>>>> >>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >>>>> --- >>>>> arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++ >>>>> 1 file changed, 622 insertions(+) >>>>> create mode 100644 arch/x86/hyperv/hv_crash.c >>>>> >>>>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c >>>>> new file mode 100644 >>>>> index 000000000000..531bac79d598 >>>>> --- /dev/null >>>>> +++ b/arch/x86/hyperv/hv_crash.c >>>>> @@ -0,0 +1,622 @@ >>>>> +// SPDX-License-Identifier: GPL-2.0-only >>>>> +/* >>>>> + * X86 specific Hyper-V kdump/crash support module >>>>> + * >>>>> + * Copyright (C) 2025, Microsoft, Inc. >>>>> + * >>>>> + * This module implements hypervisor ram collection into vmcore for both >>>>> + * cases of the hypervisor crash and linux dom0/root crash. >>>> >>>> For a hypervisor crash, does any of this apply to general guest VMs? I'm >>>> thinking it does not. Hypervisor RAM is collected only into the vmcore >>>> for the root partition, right? Maybe some additional clarification could be >>>> added so there's no confusion in this regard. >>> >>> it would be odd for guests to collect hyp core, and target audience is >>> assumed to be those who are somewhat familiar with basic concepts before >>> getting here. >> >> I was unsure because I had not seen any code that adds the hypervisor memory >> to the Linux memory map. Thought maybe something was going on I hadn?t >> heard about, so I didn't know the scope of it. >> >> Of course, I'm one of those people who was *not* familiar with the basic concepts >> before getting here. And given that there's no spec available from Hyper-V, >> the comments in this patch set are all there is for anyone outside of Microsoft. >> In that vein, I think it's reasonable to provide some description of how this >> all works in the code comments. And you've done that, which is very >> helpful. But I encountered a few places where I was confused or unclear, and >> my suggestions here and in Patch 4 are just about making things as precise as >> possible without adding a huge amount of additional verbiage. For someone >> new, English text descriptions that the code can be checked against are >> helpful, and drawing hard boundaries ("this is only applicable to the root >> partition") is helpful. >> >> If you don't want to deal with it now, I could provide a follow-on patch later >> that tweaks or augments the wording a bit to clarify some of these places. >> You can review, like with any patch. I've done wording work over the years >> to many places in the VMBus code, and more broadly in providing most of >> the documentation in Documentation/virt/hyperv. > > with time, things will start making sense... i find comment pretty clear > that it collects core for both cases of hv crash and dom0 crash, and no > mention of guest implies has nothing to do with guests. > >>> >>>> And what *does* happen to guest VMs after a hypervisor crash? >>> >>> they are gone... what else could we do? >>> >>>>> + * Hyper-V implements >>>>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This >>>>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram >>>>> + * is already mapped in linux, it is automatically collected into linux vmcore, >>>>> + * and can be examined by the crash command (raw ram dump) or windbg. >>>>> + * >>>>> + * At a high level: >>>>> + * >>>>> + * Hypervisor Crash: >>>>> + * Upon crash, hypervisor goes into an emergency minimal dispatch loop, a >>>>> + * restrictive mode with very limited hypercall and msr support. >>>> >>>> s/msr/MSR/ >>> >>> msr is used all over, seems acceptable. >>> >>>>> + * Each cpu then injects NMIs into dom0/root vcpus. >>>> >>>> The "Each cpu" part of this sentence is confusing to me -- which CPUs does >>>> this refer to? Maybe it would be better to say "It then injects an NMI into >>>> each dom0/root partition vCPU." without being specific as to which CPUs do >>>> the injecting since that seems more like a hypervisor implementation detail >>>> that's not relevant here. >>> >>> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu. >> >> OK, that makes sense now that I think about it. Each physical CPU in the host >> has a corresponding vCPU in the dom0/root partition. And each of the vCPUs >> gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off >> running a vCPU in some guest VM. >> >>> >>>>> + * A shared page is used to check >>>>> + * by linux in the nmi handler if the hypervisor has crashed. This shared >>>> >>>> s/nmi/NMI/ (multiple places) >>> >>>>> + * page is setup in hv_root_crash_init during boot. >>>>> + * >>>>> + * Linux Crash: >>>>> + * In case of linux crash, the callback hv_crash_stop_other_cpus will send >>>>> + * NMIs to all cpus, then proceed to the crash_nmi_callback where it waits >>>>> + * for all cpus to be in NMI. >>>>> + * >>>>> + * NMI Handler (upon quorum): >>>>> + * Eventually, in both cases, all cpus wil end up in the nmi hanlder. >>>> >>>> s/hanlder/handler/ >>>> >>>> And maybe just drop the word "wil" (which is misspelled). >>>> >>>>> + * Hyper-V requires the disable hypervisor must be done from the bsp. So >>>> >>>> s/bsp/BSP (multiple places) >>>> >>>>> + * the bsp nmi handler saves current context, does some fixups and makes >>>>> + * the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor >>>>> + * at that point will suspend all vcpus (except the bsp), unlock all its >>>>> + * ram, and return to linux at the 32bit mode entry RIP. >>>>> + * >>>>> + * Linux 32bit entry trampoline will then restore long mode and call C >>>>> + * function here to restore context and continue execution to crash kexec. >>>>> + */ >>>>> + >>>>> +#include <linux/delay.h> >>>>> +#include <linux/kexec.h> >>>>> +#include <linux/crash_dump.h> >>>>> +#include <linux/panic.h> >>>>> +#include <asm/apic.h> >>>>> +#include <asm/desc.h> >>>>> +#include <asm/page.h> >>>>> +#include <asm/pgalloc.h> >>>>> +#include <asm/mshyperv.h> >>>>> +#include <asm/nmi.h> >>>>> +#include <asm/idtentry.h> >>>>> +#include <asm/reboot.h> >>>>> +#include <asm/intel_pt.h> >>>>> + >>>>> +int hv_crash_enabled; >>>> >>>> Seems like this is conceptually a "bool", not an "int". >>> >>> yeah, can change it to bool if i do another iteration. >>> >>>>> +EXPORT_SYMBOL_GPL(hv_crash_enabled); >>>>> + >>>>> +struct hv_crash_ctxt { >>>>> + ulong rsp; >>>>> + ulong cr0; >>>>> + ulong cr2; >>>>> + ulong cr4; >>>>> + ulong cr8; >>>>> + >>>>> + u16 cs; >>>>> + u16 ss; >>>>> + u16 ds; >>>>> + u16 es; >>>>> + u16 fs; >>>>> + u16 gs; >>>>> + >>>>> + u16 gdt_fill; >>>>> + struct desc_ptr gdtr; >>>>> + char idt_fill[6]; >>>>> + struct desc_ptr idtr; >>>>> + >>>>> + u64 gsbase; >>>>> + u64 efer; >>>>> + u64 pat; >>>>> +}; >>>>> +static struct hv_crash_ctxt hv_crash_ctxt; >>>>> + >>>>> +/* Shared hypervisor page that contains crash dump area we peek into. >>>>> + * NB: windbg looks for "hv_cda" symbol so don't change it. >>>>> + */ >>>>> +static struct hv_crashdump_area *hv_cda; >>>>> + >>>>> +static u32 trampoline_pa, devirt_cr3arg; >>>>> +static atomic_t crash_cpus_wait; >>>>> +static void *hv_crash_ptpgs[4]; >>>>> +static int hv_has_crashed, lx_has_crashed; >>>> >>>> These are conceptually "bool" as well. >>>> >>>>> + >>>>> +/* This cannot be inlined as it needs stack */ >>>>> +static noinline __noclone void hv_crash_restore_tss(void) >>>>> +{ >>>>> + load_TR_desc(); >>>>> +} >>>>> + >>>>> +/* This cannot be inlined as it needs stack */ >>>>> +static noinline void hv_crash_clear_kernpt(void) >>>>> +{ >>>>> + pgd_t *pgd; >>>>> + p4d_t *p4d; >>>>> + >>>>> + /* Clear entry so it's not confusing to someone looking at the core */ >>>>> + pgd = pgd_offset_k(trampoline_pa); >>>>> + p4d = p4d_offset(pgd, trampoline_pa); >>>>> + native_p4d_clear(p4d); >>>>> +} >>>>> + >>>>> +/* >>>>> + * This is the C entry point from the asm glue code after the devirt hypercall. >>>>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel >>>>> + * page tables with our below 4G page identity mapped, but using a temporary >>>>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not >>>>> + * available. We restore kernel GDT, and rest of the context, and continue >>>>> + * to kexec. >>>>> + */ >>>>> +static asmlinkage void __noreturn hv_crash_c_entry(void) >>>>> +{ >>>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >>>>> + >>>>> + /* first thing, restore kernel gdt */ >>>>> + native_load_gdt(&ctxt->gdtr); >>>>> + >>>>> + asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss)); >>>>> + asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp)); >>>>> + >>>>> + asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds)); >>>>> + asm volatile("movw %%ax, %%es" : : "a"(ctxt->es)); >>>>> + asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs)); >>>>> + asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs)); >>>>> + >>>>> + native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat); >>>>> + asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0)); >>>>> + >>>>> + asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8)); >>>>> + asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4)); >>>>> + asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4)); >>>>> + >>>>> + native_load_idt(&ctxt->idtr); >>>>> + native_wrmsrq(MSR_GS_BASE, ctxt->gsbase); >>>>> + native_wrmsrq(MSR_EFER, ctxt->efer); >>>>> + >>>>> + /* restore the original kernel CS now via far return */ >>>>> + asm volatile("movzwq %0, %%rax\n\t" >>>>> + "pushq %%rax\n\t" >>>>> + "pushq $1f\n\t" >>>>> + "lretq\n\t" >>>>> + "1:nop\n\t" : : "m"(ctxt->cs) : "rax"); >>>>> + >>>>> + /* We are in asmlinkage without stack frame, hence make a C function >>>>> + * call which will buy stack frame to restore the tss or clear PT entry. >>>>> + */ >>>>> + hv_crash_restore_tss(); >>>>> + hv_crash_clear_kernpt(); >>>>> + >>>>> + /* we are now fully in devirtualized normal kernel mode */ >>>>> + __crash_kexec(NULL); >>>> >>>> The comments for __crash_kexec() say that "panic_cpu" should be set to >>>> the current CPU. I don't see that such is the case here. >>> >>> if linux panic, it would be set by vpanic, if hyp crash, that is >>> irrelevant. >>> >>>>> + >>>>> + for (;;) >>>>> + cpu_relax(); >>>> >>>> Is the intent that __crash_kexec() should never return, on any of the vCPUs, >>>> because devirtualization isn't done unless there's a valid kdump image loaded? >>>> I wonder if >>>> >>>> native_wrmsrq(HV_X64_MSR_RESET, 1); >>>> >>>> would be better than looping forever in case __crash_kexec() fails >>>> somewhere along the way even if there's a kdump image loaded. >>> >>> yeah, i've gone thru all 3 possibilities here: >>> o loop forever >>> o reset >>> o BUG() : this was in V0 >>> >>> reset is just bad because system would just reboot without any indication >>> if hyp crashes. with loop at least there is a hang, and one could make >>> note of it, and if internal, attach debugger. >>> >>> BUG is best imo because with hyp gone linux will try to redo panic >>> and we would print something extra to help. I think i'll just go >>> back to my V0: BUG() >>> >>>>> +} >>>>> +/* Tell gcc we are using lretq long jump in the above function intentionally */ >>>>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry); >>>>> + >>>>> +static void hv_mark_tss_not_busy(void) >>>>> +{ >>>>> + struct desc_struct *desc = get_current_gdt_rw(); >>>>> + tss_desc tss; >>>>> + >>>>> + memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc)); >>>>> + tss.type = 0x9; /* available 64-bit TSS. 0xB is busy TSS */ >>>>> + write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS); >>>>> +} >>>>> + >>>>> +/* Save essential context */ >>>>> +static void hv_hvcrash_ctxt_save(void) >>>>> +{ >>>>> + struct hv_crash_ctxt *ctxt = &hv_crash_ctxt; >>>>> + >>>>> + asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp)); >>>>> + >>>>> + ctxt->cr0 = native_read_cr0(); >>>>> + ctxt->cr4 = native_read_cr4(); >>>>> + >>>>> + asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2)); >>>>> + asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8)); >>>>> + >>>>> + asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs)); >>>>> + asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss)); >>>>> + asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds)); >>>>> + asm volatile("movl %%es, %%eax" : "=a"(ctxt->es)); >>>>> + asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs)); >>>>> + asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs)); >>>>> + >>>>> + native_store_gdt(&ctxt->gdtr); >>>>> + store_idt(&ctxt->idtr); >>>>> + >>>>> + ctxt->gsbase = __rdmsr(MSR_GS_BASE); >>>>> + ctxt->efer = __rdmsr(MSR_EFER); >>>>> + ctxt->pat = __rdmsr(MSR_IA32_CR_PAT); >>>>> +} >>>>> + >>>>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */ >>>>> +static void hv_crash_fixup_kernpt(void) >>>>> +{ >>>>> + pgd_t *pgd; >>>>> + p4d_t *p4d; >>>>> + >>>>> + pgd = pgd_offset_k(trampoline_pa); >>>>> + p4d = p4d_offset(pgd, trampoline_pa); >>>>> + >>>>> + /* trampoline_pa is below 4G, so no pre-existing entry to clobber */ >>>>> + p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]); >>>>> + p4d->p4d = p4d->p4d & ~(_PAGE_NX); /* enable execute */ >>>>> +} >>>>> + >>>>> +/* >>>>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has >>>>> + * crashed and will collect core. This will cause the hyp to quiesce and >>>>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp. >>>>> + */ >>>>> +static void hv_notify_prepare_hyp(void) >>>>> +{ >>>>> + u64 status; >>>>> + struct hv_input_notify_partition_event *input; >>>>> + struct hv_partition_event_root_crashdump_input *cda; >>>>> + >>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>>> + cda = &input->input.crashdump_input; >>>> >>>> The code ordering here is a bit weird. I'd expect this line to be grouped >>>> with cda->crashdump_action being set. >>> >>> we are setting two pointers, and using them later. setting pointers >>> up front is pretty normal. >>> >>>>> + memset(input, 0, sizeof(*input)); >>>>> + input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP; >>>>> + >>>>> + cda->crashdump_action = HV_CRASHDUMP_ENTRY; >>>>> + status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >>>>> + if (!hv_result_success(status)) >>>>> + return; >>>>> + >>>>> + cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS; >>>>> + hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL); >>>>> +} >>>>> + >>>>> +/* >>>>> + * Common function for all cpus before devirtualization. >>>>> + * >>>>> + * Hypervisor crash: all cpus get here in nmi context. >>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi >>>>> + * context. Note, panicing cpu may not be the bsp. >>>>> + * >>>>> + * The function is not inlined so it will show on the stack. It is named so >>>>> + * because the crash cmd looks for certain well known function names on the >>>>> + * stack before looking into the cpu saved note in the elf section, and >>>>> + * that work is currently incomplete. >>>>> + * >>>>> + * Notes: >>>>> + * Hypervisor crash: >>>>> + * - the hypervisor is in a very restrictive mode at this point and any >>>>> + * vmexit it cannot handle would result in reboot. For example, console >>>>> + * output from here would result in synic ipi hcall, which would result >>>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. >>>>> + * >>>>> + * Devirtualization is supported from the bsp only. >>>>> + */ >>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) >>>>> +{ >>>>> + struct hv_input_disable_hyp_ex *input; >>>>> + u64 status; >>>>> + int msecs = 1000, ccpu = smp_processor_id(); >>>>> + >>>>> + if (ccpu == 0) { >>>>> + /* crash_save_cpu() will be done in the kexec path */ >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */ >>>>> + atomic_inc(&crash_cpus_wait); >>>>> + } else { >>>>> + crash_save_cpu(regs, ccpu); >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */ >>>>> + atomic_inc(&crash_cpus_wait); >>>>> + for (;;); /* cause no vmexits */ >>>>> + } >>>>> + >>>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) >>>>> + mdelay(1); >>>>> + >>>>> + stop_nmi(); >>>>> + if (!hv_has_crashed) >>>>> + hv_notify_prepare_hyp(); >>>>> + >>>>> + if (crashing_cpu == -1) >>>>> + crashing_cpu = ccpu; /* crash cmd uses this */ >>>> >>>> Could just be "crashing_cpu = 0" since only the BSP gets here. >>> >>> a code change request has been open for while to remove the requirement >>> of bsp.. >>> >>>>> + >>>>> + hv_hvcrash_ctxt_save(); >>>>> + hv_mark_tss_not_busy(); >>>>> + hv_crash_fixup_kernpt(); >>>>> + >>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>>> + memset(input, 0, sizeof(*input)); >>>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ >>>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ >>>> >>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? >>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does >>>> not access it. It only provides it in EDI when it invokes the trampoline >>>> function, right? >>> >>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). >> >> Yes, the comment matches the name of the "devirt_cr3arg" variable. >> Unfortunately my previous comment was incomplete because the value >> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page >> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the >> PA of struct hv_crash_tramp_data. The CR3 value is stored in the >> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level >> of indirection. The (corrected) comment in the header to hv_crash_asm32() >> describes EDI as containing "PA of struct hv_crash_tramp_data", which >> ought to match what is described here. I'd say that "devirt_cr3arg" ought >> to be renamed to "tramp_data_pa" or something else parallel to >> "trampoline_pa". > > hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy > back extra information for ourselves needed in trampoline.S. so it's > all good. actually, what i said earlier was true, not above. that the arg is opaque and hyp does not use it (we are transitioning paging off after all!). i did this all almost two years ago, so had vague recollections but finally had time today to go back to square one and old notes, and remember things now. so final answer: the hypercall calls it TrampolineCr3, i guess this is how windows uses it (they have customized kernel code for core collection). doing that was becoming too intrusive on linux, so i decided to use the arg to pass the info i needed in the trampoline code. Since the hypercall calls the arg TrampolineCr3, i must have just used that name for the arg to match it, probably falsely assuming hypervisor somehow looked at it. (actually, the windows hypercall wrapper does look at it to make sure it is a ram address). since the hypercall doesn't use the arg, it could just call it devirtArg, but maybe in the past they used it somehow. in my latest version, i just call it devirt_arg. >>> right, comes in edi, i don't know what EDI is (just kidding!)... >>> >>>>> + >>>>> + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); >>>>> + >>>>> + /* Devirt failed, just reboot as things are in very bad state now */ >>>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ >>>>> +} >>>>> + >>>>> +/* >>>>> + * Generic nmi callback handler: could be called without any crash also. >>>>> + * hv crash: hypervisor injects nmi's into all cpus >>>>> + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus >>>>> + */ >>>>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) >>>>> +{ >>>>> + int ccpu = smp_processor_id(); >>>>> + >>>>> + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) >>>>> + hv_has_crashed = 1; >>>>> + >>>>> + if (!hv_has_crashed && !lx_has_crashed) >>>>> + return NMI_DONE; /* ignore the nmi */ >>>>> + >>>>> + if (hv_has_crashed) { >>>>> + if (!kexec_crash_loaded() || !hv_crash_enabled) { >>>>> + if (ccpu == 0) { >>>>> + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ >>>>> + } else >>>>> + for (;;); /* cause no vmexits */ >>>>> + } >>>>> + } >>>>> + >>>>> + crash_nmi_callback(regs); >>>>> + >>>>> + return NMI_DONE; >>>> >>>> crash_nmi_callback() should never return, right? Normally one would >>>> expect to return NMI_HANDLED here, but I guess it doesn't matter >>>> if the return is never executed. >>> >>> correct. >>> >>>>> +} >>>>> + >>>>> +/* >>>>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus >>>>> + * >>>>> + * On normal linux panic, this is called twice: first from panic and then again >>>>> + * from native_machine_crash_shutdown. >>>>> + * >>>>> + * In case of mshv, 3 ways to get here: >>>>> + * 1. hv crash (only bsp will get here): >>>>> + * BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry >>>>> + * -> __crash_kexec -> native_machine_crash_shutdown >>>>> + * -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus >>>>> + * linux panic: >>>>> + * 2. panic cpu x: panic() -> crash_smp_send_stop >>>>> + * -> smp_ops.crash_stop_other_cpus >>>>> + * 3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop >>>>> + * >>>>> + * NB: noclone and non standard stack because of call to crash_setup_regs(). >>>>> + */ >>>>> +static void __noclone hv_crash_stop_other_cpus(void) >>>>> +{ >>>>> + static int crash_stop_done; >>>>> + struct pt_regs lregs; >>>>> + int ccpu = smp_processor_id(); >>>>> + >>>>> + if (hv_has_crashed) >>>>> + return; /* all cpus already in nmi handler path */ >>>>> + >>>>> + if (!kexec_crash_loaded()) >>>>> + return; >>>> >>>> If we're in a normal panic path (your Case #2 above) with no kdump kernel >>>> loaded, why leave the other vCPUs running? Seems like that could violate >>>> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and >>>> thereafter assumes other vCPUs are not running. >>> >>> no, there is lots of complexity here! >>> >>> if we hang vcpus here, hyp will note and may trigger its own watchdog. >>> also, machine_crash_shutdown() does another ipi. >>> >>> I think the best thing to do here is go back to my V0 which did not >>> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry: >>> >>> + /* we are now fully in devirtualized normal kernel mode */ >>> + __crash_kexec(NULL); >>> + >>> + BUG(); >>> >>> >>> this way hyp would be disabled, ie, system devirtualized, and >>> __crash_kernel() will return, resulting in BUG() that will cause >>> it to go thru panic and honor panic= parameter with either hang >>> or reset. instead of bug, i could just call panic() also. >>> >>>>> + >>>>> + if (crash_stop_done) >>>>> + return; >>>>> + crash_stop_done = 1; >>>> >>>> Is crash_stop_done necessary? hv_crash_stop_other_cpus() is called >>>> from crash_smp_send_stop(), which has its own static variable >>>> "cpus_stopped" that does the same thing. >>> >>> yes. for error paths. >>> >>>>> + >>>>> + /* linux has crashed: hv is healthy, we can ipi safely */ >>>>> + lx_has_crashed = 1; >>>>> + wmb(); /* nmi handlers look at lx_has_crashed */ >>>>> + >>>>> + apic->send_IPI_allbutself(NMI_VECTOR); >>>> >>>> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus(). >>>> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but >>>> should disable_local_APIC() be done somewhere here as well? >>> >>> no, hyp does that. >> >> As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX >> hypercall in crash_nmi_callback()? This gets back to an earlier question/comment >> where I was trying to figure out if the APIC is still enabled, and in what mode, >> when hv_crash_asm32() is invoked. > >>> >>>>> + >>>>> + if (crashing_cpu == -1) >>>>> + crashing_cpu = ccpu; /* crash cmd uses this */ >>>>> + >>>>> + /* crash_setup_regs() happens in kexec also, but for the kexec cpu which >>>>> + * is the bsp. We could be here on non-bsp cpu, collect regs if so. >>>>> + */ >>>>> + if (ccpu) >>>>> + crash_setup_regs(&lregs, NULL); >>>>> + >>>>> + crash_nmi_callback(&lregs); >>>>> +} >>>>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus); >>>>> + >>>>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */ >>>>> +struct hv_gdtreg_32 { >>>>> + u16 fill; >>>>> + u16 limit; >>>>> + u32 address; >>>>> +} __packed; >>>>> + >>>>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */ >>>>> +struct hv_crash_tramp_gdt { >>>>> + u64 null; /* index 0, selector 0, null selector */ >>>>> + u64 cs64; /* index 1, selector 8, cs64 selector */ >>>>> +} __packed; >>>>> + >>>>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */ >>>>> +struct hv_cs_jmptgt { >>>>> + u32 address; >>>>> + u16 csval; >>>>> + u16 fill; >>>>> +} __packed; >>>>> + >>>>> +/* This trampoline data is copied onto the trampoline page after the asm code */ >>>>> +struct hv_crash_tramp_data { >>>>> + u64 tramp32_cr3; >>>>> + u64 kernel_cr3; >>>>> + struct hv_gdtreg_32 gdtr32; >>>>> + struct hv_crash_tramp_gdt tramp_gdt; >>>>> + struct hv_cs_jmptgt cs_jmptgt; >>>>> + u64 c_entry_addr; >>>>> +} __packed; >>>>> + >>>>> +/* >>>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode. >>>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip >>>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other >>>>> + * info like jmp and C entry targets for same reasons. >>>>> + * >>>>> + * Returns: 0 on success, -1 on error >>>>> + */ >>>>> +static int hv_crash_setup_trampdata(u64 trampoline_va) >>>>> +{ >>>>> + int size, offs; >>>>> + void *dest; >>>>> + struct hv_crash_tramp_data *tramp; >>>>> + >>>>> + /* These must match exactly the ones in the corresponding asm file */ >>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0); >>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8); >>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18); >>>>> + BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, >>>>> + cs_jmptgt.address) != 40); >>>> >>>> It would be nice to pick up the constants from a #include file that is >>>> shared with the asm code in Patch 4 of the series. >>> >>> yeah, could go either way, some don't like tiny headers... if there are >>> no objections to new header for this, i could go that way too. >> >> Saw your follow-on comments about this as well. The tiny header >> is ugly. It's a judgment call that can go either way, so go with your >> preference. >> >>> >>>>> + >>>>> + /* hv_crash_asm_end is beyond last byte by 1 */ >>>>> + size = &hv_crash_asm_end - &hv_crash_asm32; >>>>> + if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) { >>>>> + pr_err("%s: trampoline page overflow\n", __func__); >>>>> + return -1; >>>>> + } >>>>> + >>>>> + dest = (void *)trampoline_va; >>>>> + memcpy(dest, &hv_crash_asm32, size); >>>>> + >>>>> + dest += size; >>>>> + dest = (void *)round_up((ulong)dest, 16); >>>>> + tramp = (struct hv_crash_tramp_data *)dest; >>>>> + >>>>> + /* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by >>>>> + * non-PCID-aware users". Build cr3 with pcid 0 >>>>> + */ >>>>> + tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]); >>>>> + >>>>> + /* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */ >>>>> + tramp->kernel_cr3 = __sme_pa(init_mm.pgd); >>>>> + >>>>> + tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt); >>>>> + tramp->gdtr32.address = trampoline_pa + >>>>> + (ulong)&tramp->tramp_gdt - trampoline_va; >>>>> + >>>>> + /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */ >>>>> + tramp->tramp_gdt.cs64 = 0x00af9a000000ffff; >>>>> + >>>>> + tramp->cs_jmptgt.csval = 0x8; >>>>> + offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32; >>>>> + tramp->cs_jmptgt.address = trampoline_pa + offs; >>>>> + >>>>> + tramp->c_entry_addr = (u64)&hv_crash_c_entry; >>>>> + >>>>> + devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va; >>>>> + >>>>> + return 0; >>>>> +} >>>>> + >>>>> +/* >>>>> + * Build 32bit trampoline page table for transition from protected mode >>>>> + * non-paging to long-mode paging. This transition needs pagetables below 4G. >>>>> + */ >>>>> +static void hv_crash_build_tramp_pt(void) >>>>> +{ >>>>> + p4d_t *p4d; >>>>> + pud_t *pud; >>>>> + pmd_t *pmd; >>>>> + pte_t *pte; >>>>> + u64 pa, addr = trampoline_pa; >>>>> + >>>>> + p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d); >>>>> + pa = virt_to_phys(hv_crash_ptpgs[1]); >>>>> + set_p4d(p4d, __p4d(_PAGE_TABLE | pa)); >>>>> + p4d->p4d &= ~(_PAGE_NX); /* disable no execute */ >>>>> + >>>>> + pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud); >>>>> + pa = virt_to_phys(hv_crash_ptpgs[2]); >>>>> + set_pud(pud, __pud(_PAGE_TABLE | pa)); >>>>> + >>>>> + pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd); >>>>> + pa = virt_to_phys(hv_crash_ptpgs[3]); >>>>> + set_pmd(pmd, __pmd(_PAGE_TABLE | pa)); >>>>> + >>>>> + pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte); >>>>> + set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC)); >>>>> +} >>>>> + >>>>> +/* >>>>> + * Setup trampoline for devirtualization: >>>>> + * - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to >>>>> + * in protected mode. >>>>> + * - 4 pages for a temporary page table that asm code uses to turn paging on >>>>> + * - a temporary gdt to use in the compat mode. >>>>> + * >>>>> + * Returns: 0 on success >>>>> + */ >>>>> +static int hv_crash_trampoline_setup(void) >>>>> +{ >>>>> + int i, rc, order; >>>>> + struct page *page; >>>>> + u64 trampoline_va; >>>>> + gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO; >>>>> + >>>>> + /* page for 32bit trampoline assembly code + hv_crash_tramp_data */ >>>>> + page = alloc_page(flags32); >>>>> + if (page == NULL) { >>>>> + pr_err("%s: failed to alloc asm stub page\n", __func__); >>>>> + return -1; >>>>> + } >>>>> + >>>>> + trampoline_va = (u64)page_to_virt(page); >>>>> + trampoline_pa = (u32)page_to_phys(page); >>>>> + >>>>> + order = 2; /* alloc 2^2 pages */ >>>>> + page = alloc_pages(flags32, order); >>>>> + if (page == NULL) { >>>>> + pr_err("%s: failed to alloc pt pages\n", __func__); >>>>> + free_page(trampoline_va); >>>>> + return -1; >>>>> + } >>>>> + >>>>> + for (i = 0; i < 4; i++, page++) >>>>> + hv_crash_ptpgs[i] = page_to_virt(page); >>>>> + >>>>> + hv_crash_build_tramp_pt(); >>>>> + >>>>> + rc = hv_crash_setup_trampdata(trampoline_va); >>>>> + if (rc) >>>>> + goto errout; >>>>> + >>>>> + return 0; >>>>> + >>>>> +errout: >>>>> + free_page(trampoline_va); >>>>> + free_pages((ulong)hv_crash_ptpgs[0], order); >>>>> + >>>>> + return rc; >>>>> +} >>>>> + >>>>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */ >>>>> +void hv_root_crash_init(void) >>>>> +{ >>>>> + int rc; >>>>> + struct hv_input_get_system_property *input; >>>>> + struct hv_output_get_system_property *output; >>>>> + unsigned long flags; >>>>> + u64 status; >>>>> + union hv_pfn_range cda_info; >>>>> + >>>>> + if (pgtable_l5_enabled()) { >>>>> + pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n"); >>>>> + return; >>>>> + } >>>>> + >>>>> + rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST, >>>>> + "hv_crash_nmi"); >>>>> + if (rc) { >>>>> + pr_err("Hyper-V: failed to register crash nmi handler\n"); >>>>> + return; >>>>> + } >>>>> + >>>>> + local_irq_save(flags); >>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); >>>>> + output = *this_cpu_ptr(hyperv_pcpu_output_arg); >>>>> + >>>>> + memset(input, 0, sizeof(*input)); >>>>> + memset(output, 0, sizeof(*output)); >>>> >>>> Why zero the output area? This is one of those hypercall things that we're >>>> inconsistent about. A few hypercall call sites zero the output area, and it's >>>> not clear why they do. Hyper-V should be responsible for properly filling in >>>> the output area. Linux should not need to do this zero'ing, unless there's some >>>> known bug in Hyper-V for certain hypercalls, in which case there should be >>>> a code comment stating "why". >>> >>> for the same reason sometimes you see char *p = NULL, either leftover >>> code or someone was debugging or just copy and paste. this is just copy >>> paste. i agree in general that we don't need to clear it at all, in fact, >>> i'd like to remove them all! but i also understand people with different >>> skills and junior members find it easier to debug, and also we were in >>> early product development. for that reason, it doesn't have to be >>> consistent either, if some complex hypercalls are failing repeatedly, >>> just for ease of debug, one might leave it there temporarily. but >>> now that things are stable, i think we should just remove them all and >>> get used to a bit more inconvenient debugging... >> >> I see your point about debugging, but on balance I agree that they >> should all be removed. If there's some debug case, add it back >> temporarily to debug, but leave upstream without it. The zero'ing is >> also unnecessary code in the interrupt disabled window, which you >> have expressed concern about in a different thread. > > yeah, i've been extremely busy so not able to pay much attention to > upstreaming, but imo they should have been removed before upstreaming. > a simple patch that just removes memset of output would be welcome. > >>> >>>>> + input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA; >>>>> + >>>>> + status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output); >>>>> + cda_info.as_uint64 = output->hv_cda_info.as_uint64; >>>>> + local_irq_restore(flags); >>>>> + >>>>> + if (!hv_result_success(status)) { >>>>> + pr_err("Hyper-V: %s: property:%d %s\n", __func__, >>>>> + input->property_id, hv_result_to_string(status)); >>>>> + goto err_out; >>>>> + } >>>>> + >>>>> + if (cda_info.base_pfn == 0) { >>>>> + pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n"); >>>>> + goto err_out; >>>>> + } >>>>> + >>>>> + hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT); >>>> >>>> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in >>>> terms of the Hyper-V page size, which isn't necessarily the guest page size. >>>> Yes, on x86 there's no difference, but for future robustness .... >>> >>> i don't know about guests, but we won't even boot if dom0 pg size >>> didn't match.. but easier to change than to make the case.. >> >> FWIW, a normal Linux guest on ARM64 works just fine with a page >> size of 16K or 64K, even though the underlying Hyper-V page size >> is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in >> the first place. Using it properly really matters for normal guests. >> (Having the guest page size smaller than the Hyper-V page size >> does *not* work, but there are no such use cases.) >> >> Even on ARM64, I know the root partition page size is required to >> match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is >> still appropriate just to not leave code that will go wrong if the >> match requirement should ever change. >> >>> >>>>> + >>>>> + rc = hv_crash_trampoline_setup(); >>>>> + if (rc) >>>>> + goto err_out; >>>>> + >>>>> + smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus; >>>>> + >>>>> + crash_kexec_post_notifiers = true; >>>>> + hv_crash_enabled = 1; >>>>> + pr_info("Hyper-V: linux and hv kdump support enabled\n"); >>>> >>>> This message and the message below aren't consistent. One refers >>>> to "hv kdump" and the other to "hyp kdump". >>> >>>>> + >>>>> + return; >>>>> + >>>>> +err_out: >>>>> + unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi"); >>>>> + pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n"); >>>>> +} >>>>> -- >>>>> 2.36.1.vfs.0.0 ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-20 1:42 ` Mukesh R @ 2025-09-23 1:35 ` Michael Kelley 0 siblings, 0 replies; 29+ messages in thread From: Michael Kelley @ 2025-09-23 1:35 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Friday, September 19, 2025 6:43 PM > > On 9/18/25 19:32, Mukesh R wrote: > > On 9/18/25 16:53, Michael Kelley wrote: > >> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM > >>> > >>> On 9/15/25 10:55, Michael Kelley wrote: [snip] > >>>>> +/* > >>>>> + * Common function for all cpus before devirtualization. > >>>>> + * > >>>>> + * Hypervisor crash: all cpus get here in nmi context. > >>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi > >>>>> + * context. Note, panicing cpu may not be the bsp. > >>>>> + * > >>>>> + * The function is not inlined so it will show on the stack. It is named so > >>>>> + * because the crash cmd looks for certain well known function names on the > >>>>> + * stack before looking into the cpu saved note in the elf section, and > >>>>> + * that work is currently incomplete. > >>>>> + * > >>>>> + * Notes: > >>>>> + * Hypervisor crash: > >>>>> + * - the hypervisor is in a very restrictive mode at this point and any > >>>>> + * vmexit it cannot handle would result in reboot. For example, console > >>>>> + * output from here would result in synic ipi hcall, which would result > >>>>> + * in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible. > >>>>> + * > >>>>> + * Devirtualization is supported from the bsp only. > >>>>> + */ > >>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) > >>>>> +{ > >>>>> + struct hv_input_disable_hyp_ex *input; > >>>>> + u64 status; > >>>>> + int msecs = 1000, ccpu = smp_processor_id(); > >>>>> + > >>>>> + if (ccpu == 0) { > >>>>> + /* crash_save_cpu() will be done in the kexec path */ > >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */ > >>>>> + atomic_inc(&crash_cpus_wait); > >>>>> + } else { > >>>>> + crash_save_cpu(regs, ccpu); > >>>>> + cpu_emergency_stop_pt(); /* disable performance trace */ > >>>>> + atomic_inc(&crash_cpus_wait); > >>>>> + for (;;); /* cause no vmexits */ > >>>>> + } > >>>>> + > >>>>> + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) > >>>>> + mdelay(1); > >>>>> + > >>>>> + stop_nmi(); > >>>>> + if (!hv_has_crashed) > >>>>> + hv_notify_prepare_hyp(); > >>>>> + > >>>>> + if (crashing_cpu == -1) > >>>>> + crashing_cpu = ccpu; /* crash cmd uses this */ > >>>> > >>>> Could just be "crashing_cpu = 0" since only the BSP gets here. > >>> > >>> a code change request has been open for while to remove the requirement > >>> of bsp.. > >>> > >>>>> + > >>>>> + hv_hvcrash_ctxt_save(); > >>>>> + hv_mark_tss_not_busy(); > >>>>> + hv_crash_fixup_kernpt(); > >>>>> + > >>>>> + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > >>>>> + memset(input, 0, sizeof(*input)); > >>>>> + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ > >>>>> + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ > >>>> > >>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data? > >>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does > >>>> not access it. It only provides it in EDI when it invokes the trampoline > >>>> function, right? > >>> > >>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables). > >> > >> Yes, the comment matches the name of the "devirt_cr3arg" variable. > >> Unfortunately my previous comment was incomplete because the value > >> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page > >> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the > >> PA of struct hv_crash_tramp_data. The CR3 value is stored in the > >> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level > >> of indirection. The (corrected) comment in the header to hv_crash_asm32() > >> describes EDI as containing "PA of struct hv_crash_tramp_data", which > >> ought to match what is described here. I'd say that "devirt_cr3arg" ought > >> to be renamed to "tramp_data_pa" or something else parallel to > >> "trampoline_pa". > > > > hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy > > back extra information for ourselves needed in trampoline.S. so it's > > all good. > > actually, what i said earlier was true, not above. that the arg is > opaque and hyp does not use it (we are transitioning paging off after > all!). i did this all almost two years ago, so had vague recollections > but finally had time today to go back to square one and old notes, > and remember things now. so final answer: > > the hypercall calls it TrampolineCr3, i guess this is how windows uses it > (they have customized kernel code for core collection). doing that was > becoming too intrusive on linux, so i decided to use the arg to pass the > info i needed in the trampoline code. Since the hypercall calls the arg > TrampolineCr3, i must have just used that name for the arg to match it, > probably falsely assuming hypervisor somehow looked at it. (actually, > the windows hypercall wrapper does look at it to make sure it is a > ram address). > > since the hypercall doesn't use the arg, it could just call it > devirtArg, but maybe in the past they used it somehow. in my latest > version, i just call it devirt_arg. OK. Good to get this all straightened out. Please leave a code comment to the effect that the hypercall doesn't use the arg, and that the value is provided solely to be passed to hv_crash_asm32() for it to use. That means that struct hv_crash_tramp_data is owned by Linux and can be changed/updated as needed. The assignment statement to the hypercall input could look like: input->arg = devirt_arg; /* PA of struct hv_crash_tramp_data */ which would align with the comment in the header of hv_crash_asm32(). Michael ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore 2025-09-10 0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor 2025-09-15 17:55 ` Michael Kelley @ 2025-09-18 17:11 ` Stanislav Kinsburskii 1 sibling, 0 replies; 29+ messages in thread From: Stanislav Kinsburskii @ 2025-09-18 17:11 UTC (permalink / raw) To: Mukesh Rathor Cc: linux-hyperv, linux-kernel, linux-arch, kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd On Tue, Sep 09, 2025 at 05:10:08PM -0700, Mukesh Rathor wrote: <snip> > +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs) > +{ > + struct hv_input_disable_hyp_ex *input; > + u64 status; > + int msecs = 1000, ccpu = smp_processor_id(); > + > + if (ccpu == 0) { > + /* crash_save_cpu() will be done in the kexec path */ > + cpu_emergency_stop_pt(); /* disable performance trace */ > + atomic_inc(&crash_cpus_wait); > + } else { > + crash_save_cpu(regs, ccpu); > + cpu_emergency_stop_pt(); /* disable performance trace */ > + atomic_inc(&crash_cpus_wait); > + for (;;); /* cause no vmexits */ > + } > + > + while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--) > + mdelay(1); > + > + stop_nmi(); > + if (!hv_has_crashed) > + hv_notify_prepare_hyp(); > + > + if (crashing_cpu == -1) > + crashing_cpu = ccpu; /* crash cmd uses this */ > + > + hv_hvcrash_ctxt_save(); > + hv_mark_tss_not_busy(); > + hv_crash_fixup_kernpt(); > + > + input = *this_cpu_ptr(hyperv_pcpu_input_arg); > + memset(input, 0, sizeof(*input)); > + input->rip = trampoline_pa; /* PA of hv_crash_asm32 */ > + input->arg = devirt_cr3arg; /* PA of trampoline page table L4 */ > + > + status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL); > + > + /* Devirt failed, just reboot as things are in very bad state now */ > + native_wrmsrq(HV_X64_MSR_RESET, 1); /* get hv to reboot */ AFAIU here ... > +} > + > +/* > + * Generic nmi callback handler: could be called without any crash also. > + * hv crash: hypervisor injects nmi's into all cpus > + * lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus > + */ > +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs) > +{ > + int ccpu = smp_processor_id(); > + > + if (!hv_has_crashed && hv_cda && hv_cda->cda_valid) > + hv_has_crashed = 1; > + > + if (!hv_has_crashed && !lx_has_crashed) > + return NMI_DONE; /* ignore the nmi */ > + > + if (hv_has_crashed) { > + if (!kexec_crash_loaded() || !hv_crash_enabled) { > + if (ccpu == 0) { > + native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */ and here the machine will be reset, which in both cases won't allow to collect the VMRS file, thus not allowing to debug nested hypervisor failures. Perhaps it worth keeping the state for any case (not just nested), but the nested state should be preserved. Thanks, Stanislav > -- > 2.36.1.vfs.0.0 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor ` (4 preceding siblings ...) 2025-09-10 0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor @ 2025-09-10 0:10 ` Mukesh Rathor 2025-09-13 4:53 ` kernel test robot ` (2 more replies) 5 siblings, 3 replies; 29+ messages in thread From: Mukesh Rathor @ 2025-09-10 0:10 UTC (permalink / raw) To: linux-hyperv, linux-kernel, linux-arch Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Enable build of the new files introduced in the earlier commits and add call to do the setup during boot. Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> --- arch/x86/hyperv/Makefile | 6 ++++++ arch/x86/hyperv/hv_init.c | 1 + include/asm-generic/mshyperv.h | 9 +++++++++ 3 files changed, 16 insertions(+) diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile index d55f494f471d..6f5d97cddd80 100644 --- a/arch/x86/hyperv/Makefile +++ b/arch/x86/hyperv/Makefile @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o ifdef CONFIG_X86_64 obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o + + ifdef CONFIG_MSHV_ROOT + CFLAGS_REMOVE_hv_trampoline.o += -pg + CFLAGS_hv_trampoline.o += -fno-stack-protector + obj-$(CONFIG_CRASH_DUMP) += hv_crash.o hv_trampoline.o + endif endif diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index afdbda2dd7b7..577bbd143527 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -510,6 +510,7 @@ void __init hyperv_init(void) memunmap(src); hv_remap_tsc_clocksource(); + hv_root_crash_init(); } else { hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h index dbd4c2f3aee3..952c221765f5 100644 --- a/include/asm-generic/mshyperv.h +++ b/include/asm-generic/mshyperv.h @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages); int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id); int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags); +#if CONFIG_CRASH_DUMP +void hv_root_crash_init(void); +void hv_crash_asm32(void); +void hv_crash_asm64_lbl(void); +void hv_crash_asm_end(void); +#else /* CONFIG_CRASH_DUMP */ +static inline void hv_root_crash_init(void) {} +#endif /* CONFIG_CRASH_DUMP */ + #else /* CONFIG_MSHV_ROOT */ static inline bool hv_root_partition(void) { return false; } static inline bool hv_l1vh_partition(void) { return false; } -- 2.36.1.vfs.0.0 ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor @ 2025-09-13 4:53 ` kernel test robot 2025-09-13 5:57 ` kernel test robot 2025-09-15 17:56 ` Michael Kelley 2 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2025-09-13 4:53 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv, linux-kernel, linux-arch Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Hi Mukesh, kernel test robot noticed the following build warnings: [auto build test WARNING on next-20250909] [also build test WARNING on v6.17-rc5] [cannot apply to tip/x86/core tip/master linus/master arnd-asm-generic/master tip/auto-latest v6.17-rc5 v6.17-rc4 v6.17-rc3] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Mukesh-Rathor/x86-hyperv-Rename-guest-crash-shutdown-function/20250910-081309 base: next-20250909 patch link: https://lore.kernel.org/r/20250910001009.2651481-7-mrathor%40linux.microsoft.com patch subject: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files config: x86_64-randconfig-073-20250913 (https://download.01.org/0day-ci/archive/20250913/202509131228.naboUNkE-lkp@intel.com/config) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250913/202509131228.naboUNkE-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202509131228.naboUNkE-lkp@intel.com/ All warnings (new ones prefixed by >>): In file included from arch/x86/include/asm/mshyperv.h:272, from arch/x86/hyperv/hv_apic.c:29: >> include/asm-generic/mshyperv.h:370:5: warning: "CONFIG_CRASH_DUMP" is not defined, evaluates to 0 [-Wundef] 370 | #if CONFIG_CRASH_DUMP | ^~~~~~~~~~~~~~~~~ vim +/CONFIG_CRASH_DUMP +370 include/asm-generic/mshyperv.h 369 > 370 #if CONFIG_CRASH_DUMP 371 void hv_root_crash_init(void); 372 void hv_crash_asm32(void); 373 void hv_crash_asm64_lbl(void); 374 void hv_crash_asm_end(void); 375 #else /* CONFIG_CRASH_DUMP */ 376 static inline void hv_root_crash_init(void) {} 377 #endif /* CONFIG_CRASH_DUMP */ 378 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor 2025-09-13 4:53 ` kernel test robot @ 2025-09-13 5:57 ` kernel test robot 2025-09-15 17:56 ` Michael Kelley 2 siblings, 0 replies; 29+ messages in thread From: kernel test robot @ 2025-09-13 5:57 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv, linux-kernel, linux-arch Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd Hi Mukesh, kernel test robot noticed the following build errors: [auto build test ERROR on next-20250909] [also build test ERROR on v6.17-rc5] [cannot apply to tip/x86/core tip/master linus/master arnd-asm-generic/master tip/auto-latest v6.17-rc5 v6.17-rc4 v6.17-rc3] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Mukesh-Rathor/x86-hyperv-Rename-guest-crash-shutdown-function/20250910-081309 base: next-20250909 patch link: https://lore.kernel.org/r/20250910001009.2651481-7-mrathor%40linux.microsoft.com patch subject: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20250913/202509131304.WGYf1Sx7-lkp@intel.com/config) compiler: gcc-14 (Debian 14.2.0-19) 14.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250913/202509131304.WGYf1Sx7-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202509131304.WGYf1Sx7-lkp@intel.com/ All errors (new ones prefixed by >>): arch/x86/hyperv/hv_init.c: In function 'hyperv_init': >> arch/x86/hyperv/hv_init.c:550:17: error: implicit declaration of function 'hv_root_crash_init' [-Wimplicit-function-declaration] 550 | hv_root_crash_init(); | ^~~~~~~~~~~~~~~~~~ vim +/hv_root_crash_init +550 arch/x86/hyperv/hv_init.c 431 432 /* 433 * This function is to be invoked early in the boot sequence after the 434 * hypervisor has been detected. 435 * 436 * 1. Setup the hypercall page. 437 * 2. Register Hyper-V specific clocksource. 438 * 3. Setup Hyper-V specific APIC entry points. 439 */ 440 void __init hyperv_init(void) 441 { 442 u64 guest_id; 443 union hv_x64_msr_hypercall_contents hypercall_msr; 444 int cpuhp; 445 446 if (x86_hyper_type != X86_HYPER_MS_HYPERV) 447 return; 448 449 if (hv_common_init()) 450 return; 451 452 /* 453 * The VP assist page is useless to a TDX guest: the only use we 454 * would have for it is lazy EOI, which can not be used with TDX. 455 */ 456 if (hv_isolation_type_tdx()) 457 hv_vp_assist_page = NULL; 458 else 459 hv_vp_assist_page = kcalloc(nr_cpu_ids, 460 sizeof(*hv_vp_assist_page), 461 GFP_KERNEL); 462 if (!hv_vp_assist_page) { 463 ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED; 464 465 if (!hv_isolation_type_tdx()) 466 goto common_free; 467 } 468 469 if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) { 470 /* Negotiate GHCB Version. */ 471 if (!hv_ghcb_negotiate_protocol()) 472 hv_ghcb_terminate(SEV_TERM_SET_GEN, 473 GHCB_SEV_ES_PROT_UNSUPPORTED); 474 475 hv_ghcb_pg = alloc_percpu(union hv_ghcb *); 476 if (!hv_ghcb_pg) 477 goto free_vp_assist_page; 478 } 479 480 cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online", 481 hv_cpu_init, hv_cpu_die); 482 if (cpuhp < 0) 483 goto free_ghcb_page; 484 485 /* 486 * Setup the hypercall page and enable hypercalls. 487 * 1. Register the guest ID 488 * 2. Enable the hypercall and register the hypercall page 489 * 490 * A TDX VM with no paravisor only uses TDX GHCI rather than hv_hypercall_pg: 491 * when the hypercall input is a page, such a VM must pass a decrypted 492 * page to Hyper-V, e.g. hv_post_message() uses the per-CPU page 493 * hyperv_pcpu_input_arg, which is decrypted if no paravisor is present. 494 * 495 * A TDX VM with the paravisor uses hv_hypercall_pg for most hypercalls, 496 * which are handled by the paravisor and the VM must use an encrypted 497 * input page: in such a VM, the hyperv_pcpu_input_arg is encrypted and 498 * used in the hypercalls, e.g. see hv_mark_gpa_visibility() and 499 * hv_arch_irq_unmask(). Such a VM uses TDX GHCI for two hypercalls: 500 * 1. HVCALL_SIGNAL_EVENT: see vmbus_set_event() and _hv_do_fast_hypercall8(). 501 * 2. HVCALL_POST_MESSAGE: the input page must be a decrypted page, i.e. 502 * hv_post_message() in such a VM can't use the encrypted hyperv_pcpu_input_arg; 503 * instead, hv_post_message() uses the post_msg_page, which is decrypted 504 * in such a VM and is only used in such a VM. 505 */ 506 guest_id = hv_generate_guest_id(LINUX_VERSION_CODE); 507 wrmsrq(HV_X64_MSR_GUEST_OS_ID, guest_id); 508 509 /* With the paravisor, the VM must also write the ID via GHCB/GHCI */ 510 hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id); 511 512 /* A TDX VM with no paravisor only uses TDX GHCI rather than hv_hypercall_pg */ 513 if (hv_isolation_type_tdx() && !ms_hyperv.paravisor_present) 514 goto skip_hypercall_pg_init; 515 516 hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, MODULES_VADDR, 517 MODULES_END, GFP_KERNEL, PAGE_KERNEL_ROX, 518 VM_FLUSH_RESET_PERMS, NUMA_NO_NODE, 519 __builtin_return_address(0)); 520 if (hv_hypercall_pg == NULL) 521 goto clean_guest_os_id; 522 523 rdmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); 524 hypercall_msr.enable = 1; 525 526 if (hv_root_partition()) { 527 struct page *pg; 528 void *src; 529 530 /* 531 * For the root partition, the hypervisor will set up its 532 * hypercall page. The hypervisor guarantees it will not show 533 * up in the root's address space. The root can't change the 534 * location of the hypercall page. 535 * 536 * Order is important here. We must enable the hypercall page 537 * so it is populated with code, then copy the code to an 538 * executable page. 539 */ 540 wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); 541 542 pg = vmalloc_to_page(hv_hypercall_pg); 543 src = memremap(hypercall_msr.guest_physical_address << PAGE_SHIFT, PAGE_SIZE, 544 MEMREMAP_WB); 545 BUG_ON(!src); 546 memcpy_to_page(pg, 0, src, HV_HYP_PAGE_SIZE); 547 memunmap(src); 548 549 hv_remap_tsc_clocksource(); > 550 hv_root_crash_init(); 551 } else { 552 hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); 553 wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); 554 } 555 556 hv_set_hypercall_pg(hv_hypercall_pg); 557 558 skip_hypercall_pg_init: 559 /* 560 * hyperv_init() is called before LAPIC is initialized: see 561 * apic_intr_mode_init() -> x86_platform.apic_post_init() and 562 * apic_bsp_setup() -> setup_local_APIC(). The direct-mode STIMER 563 * depends on LAPIC, so hv_stimer_alloc() should be called from 564 * x86_init.timers.setup_percpu_clockev. 565 */ 566 old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev; 567 x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev; 568 569 hv_apic_init(); 570 571 x86_init.pci.arch_init = hv_pci_init; 572 573 register_syscore_ops(&hv_syscore_ops); 574 575 if (ms_hyperv.priv_high & HV_ACCESS_PARTITION_ID) 576 hv_get_partition_id(); 577 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor 2025-09-13 4:53 ` kernel test robot 2025-09-13 5:57 ` kernel test robot @ 2025-09-15 17:56 ` Michael Kelley 2025-09-17 1:15 ` Mukesh R 2 siblings, 1 reply; 29+ messages in thread From: Michael Kelley @ 2025-09-15 17:56 UTC (permalink / raw) To: Mukesh Rathor, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > > Enable build of the new files introduced in the earlier commits and add > call to do the setup during boot. > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > --- > arch/x86/hyperv/Makefile | 6 ++++++ > arch/x86/hyperv/hv_init.c | 1 + > include/asm-generic/mshyperv.h | 9 +++++++++ > 3 files changed, 16 insertions(+) > > diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile > index d55f494f471d..6f5d97cddd80 100644 > --- a/arch/x86/hyperv/Makefile > +++ b/arch/x86/hyperv/Makefile > @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o > > ifdef CONFIG_X86_64 > obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o > + > + ifdef CONFIG_MSHV_ROOT > + CFLAGS_REMOVE_hv_trampoline.o += -pg > + CFLAGS_hv_trampoline.o += -fno-stack-protector > + obj-$(CONFIG_CRASH_DUMP) += hv_crash.o hv_trampoline.o > + endif > endif > diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > index afdbda2dd7b7..577bbd143527 100644 > --- a/arch/x86/hyperv/hv_init.c > +++ b/arch/x86/hyperv/hv_init.c > @@ -510,6 +510,7 @@ void __init hyperv_init(void) > memunmap(src); > > hv_remap_tsc_clocksource(); > + hv_root_crash_init(); > } else { > hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); > wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); > diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h > index dbd4c2f3aee3..952c221765f5 100644 > --- a/include/asm-generic/mshyperv.h > +++ b/include/asm-generic/mshyperv.h > @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 > num_pages); > int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id); > int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags); > > +#if CONFIG_CRASH_DUMP > +void hv_root_crash_init(void); > +void hv_crash_asm32(void); > +void hv_crash_asm64_lbl(void); > +void hv_crash_asm_end(void); > +#else /* CONFIG_CRASH_DUMP */ > +static inline void hv_root_crash_init(void) {} > +#endif /* CONFIG_CRASH_DUMP */ > + The hv_crash_asm* functions are x86 specific. Seems like their declarations should go in arch/x86/include/asm/mshyperv.h, not in the architecture-neutral include/asm-generic/mshyperv.h. > #else /* CONFIG_MSHV_ROOT */ > static inline bool hv_root_partition(void) { return false; } > static inline bool hv_l1vh_partition(void) { return false; } > -- > 2.36.1.vfs.0.0 > ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-15 17:56 ` Michael Kelley @ 2025-09-17 1:15 ` Mukesh R 2025-09-18 23:53 ` Michael Kelley 0 siblings, 1 reply; 29+ messages in thread From: Mukesh R @ 2025-09-17 1:15 UTC (permalink / raw) To: Michael Kelley, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de On 9/15/25 10:56, Michael Kelley wrote: > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM >> >> Enable build of the new files introduced in the earlier commits and add >> call to do the setup during boot. >> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> >> --- >> arch/x86/hyperv/Makefile | 6 ++++++ >> arch/x86/hyperv/hv_init.c | 1 + >> include/asm-generic/mshyperv.h | 9 +++++++++ >> 3 files changed, 16 insertions(+) >> >> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile >> index d55f494f471d..6f5d97cddd80 100644 >> --- a/arch/x86/hyperv/Makefile >> +++ b/arch/x86/hyperv/Makefile >> @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o >> >> ifdef CONFIG_X86_64 >> obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o >> + >> + ifdef CONFIG_MSHV_ROOT >> + CFLAGS_REMOVE_hv_trampoline.o += -pg >> + CFLAGS_hv_trampoline.o += -fno-stack-protector >> + obj-$(CONFIG_CRASH_DUMP) += hv_crash.o hv_trampoline.o >> + endif >> endif >> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c >> index afdbda2dd7b7..577bbd143527 100644 >> --- a/arch/x86/hyperv/hv_init.c >> +++ b/arch/x86/hyperv/hv_init.c >> @@ -510,6 +510,7 @@ void __init hyperv_init(void) >> memunmap(src); >> >> hv_remap_tsc_clocksource(); >> + hv_root_crash_init(); >> } else { >> hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); >> wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h >> index dbd4c2f3aee3..952c221765f5 100644 >> --- a/include/asm-generic/mshyperv.h >> +++ b/include/asm-generic/mshyperv.h >> @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 >> num_pages); >> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id); >> int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags); >> >> +#if CONFIG_CRASH_DUMP >> +void hv_root_crash_init(void); >> +void hv_crash_asm32(void); >> +void hv_crash_asm64_lbl(void); >> +void hv_crash_asm_end(void); >> +#else /* CONFIG_CRASH_DUMP */ >> +static inline void hv_root_crash_init(void) {} >> +#endif /* CONFIG_CRASH_DUMP */ >> + > > The hv_crash_asm* functions are x86 specific. Seems like their > declarations should go in arch/x86/include/asm/mshyperv.h, not in > the architecture-neutral include/asm-generic/mshyperv.h. well, arm port is going on. i suppose i could move it to x86 and they can move it back here in their patch submissions. hopefully they will remember or someone will catch it. >> #else /* CONFIG_MSHV_ROOT */ >> static inline bool hv_root_partition(void) { return false; } >> static inline bool hv_l1vh_partition(void) { return false; } >> -- >> 2.36.1.vfs.0.0 >> ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files 2025-09-17 1:15 ` Mukesh R @ 2025-09-18 23:53 ` Michael Kelley 0 siblings, 0 replies; 29+ messages in thread From: Michael Kelley @ 2025-09-18 23:53 UTC (permalink / raw) To: Mukesh R, linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com, bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com, arnd@arndb.de From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:16 PM > > On 9/15/25 10:56, Michael Kelley wrote: > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM > >> > >> Enable build of the new files introduced in the earlier commits and add > >> call to do the setup during boot. > >> > >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com> > >> --- > >> arch/x86/hyperv/Makefile | 6 ++++++ > >> arch/x86/hyperv/hv_init.c | 1 + > >> include/asm-generic/mshyperv.h | 9 +++++++++ > >> 3 files changed, 16 insertions(+) > >> > >> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile > >> index d55f494f471d..6f5d97cddd80 100644 > >> --- a/arch/x86/hyperv/Makefile > >> +++ b/arch/x86/hyperv/Makefile > >> @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE) += hv_vtl.o > >> > >> ifdef CONFIG_X86_64 > >> obj-$(CONFIG_PARAVIRT_SPINLOCKS) += hv_spinlock.o > >> + > >> + ifdef CONFIG_MSHV_ROOT > >> + CFLAGS_REMOVE_hv_trampoline.o += -pg > >> + CFLAGS_hv_trampoline.o += -fno-stack-protector > >> + obj-$(CONFIG_CRASH_DUMP) += hv_crash.o hv_trampoline.o > >> + endif > >> endif > >> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c > >> index afdbda2dd7b7..577bbd143527 100644 > >> --- a/arch/x86/hyperv/hv_init.c > >> +++ b/arch/x86/hyperv/hv_init.c > >> @@ -510,6 +510,7 @@ void __init hyperv_init(void) > >> memunmap(src); > >> > >> hv_remap_tsc_clocksource(); > >> + hv_root_crash_init(); > >> } else { > >> hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg); > >> wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64); > >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h > >> index dbd4c2f3aee3..952c221765f5 100644 > >> --- a/include/asm-generic/mshyperv.h > >> +++ b/include/asm-generic/mshyperv.h > >> @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 > >> num_pages); > >> int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id); > >> int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags); > >> > >> +#if CONFIG_CRASH_DUMP > >> +void hv_root_crash_init(void); > >> +void hv_crash_asm32(void); > >> +void hv_crash_asm64_lbl(void); > >> +void hv_crash_asm_end(void); > >> +#else /* CONFIG_CRASH_DUMP */ > >> +static inline void hv_root_crash_init(void) {} > >> +#endif /* CONFIG_CRASH_DUMP */ > >> + > > > > The hv_crash_asm* functions are x86 specific. Seems like their > > declarations should go in arch/x86/include/asm/mshyperv.h, not in > > the architecture-neutral include/asm-generic/mshyperv.h. > > well, arm port is going on. i suppose i could move it to x86 and > they can move it back here in their patch submissions. hopefully > they will remember or someone will catch it. I could see the ARM64 implementation implementing its own version of hv_root_crash_init() since that's a generic name. But sharing the "asm" function names across architectures seems more questionable. I doubt there would be hv_crash_asm32() on ARM64. :-) > > >> #else /* CONFIG_MSHV_ROOT */ > >> static inline bool hv_root_partition(void) { return false; } > >> static inline bool hv_l1vh_partition(void) { return false; } > >> -- > >> 2.36.1.vfs.0.0 > >> ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2025-09-23 1:35 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-10 0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor 2025-09-10 0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor 2025-09-15 17:54 ` Michael Kelley 2025-09-16 1:15 ` Mukesh R 2025-09-18 23:52 ` Michael Kelley 2025-09-10 0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor 2025-09-15 17:55 ` Michael Kelley 2025-09-16 21:30 ` Mukesh R 2025-09-18 23:52 ` Michael Kelley 2025-09-19 9:06 ` Borislav Petkov 2025-09-19 19:09 ` Mukesh R 2025-09-10 0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor 2025-09-15 17:55 ` Michael Kelley 2025-09-17 1:13 ` Mukesh R 2025-09-17 20:37 ` Mukesh R 2025-09-18 23:53 ` Michael Kelley 2025-09-19 2:32 ` Mukesh R 2025-09-19 19:48 ` Michael Kelley 2025-09-20 1:42 ` Mukesh R 2025-09-23 1:35 ` Michael Kelley 2025-09-18 17:11 ` Stanislav Kinsburskii 2025-09-10 0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor 2025-09-13 4:53 ` kernel test robot 2025-09-13 5:57 ` kernel test robot 2025-09-15 17:56 ` Michael Kelley 2025-09-17 1:15 ` Mukesh R 2025-09-18 23:53 ` Michael Kelley
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).