[PATCH v1 0/6] Hyper-V: Implement hypervisor core collection

linux-arch.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection
@ 2025-09-10  0:10 Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor
                   ` (5 more replies)
  0 siblings, 6 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

This patch series implements hypervisor core collection when running
under linux as root (aka dom0). By default initial hypervisor ram is
already mapped into linux as reserved. Further any ram deposited comes
from linux memory heap. The hypervisor locks all that ram to protect
it from dom0 or any other domains. At a high level, the methodology
involes devirtualizing the system on the fly upon either linux crash
or the hypervisor crash, then collecting ram as usual. This means
hypervisor ram is automatically collected into the vmcore.

Hypervisor pages are then accessible via crash command (using raw mem
dump) or windbg which has the ability to read hypervisor pdb symbol
file.

V1:
 o Describe changes in imperative mood. Remove "This commit"
 o Remove pr_emerg: causing unnecessary review noise
 o Add missing kexec_crash_loaded()
 o Remove leftover unnecessary memcpy in hv_crash_setup_trampdata
 o Address objtool warnings via annotations

Mukesh Rathor (6):
  x86/hyperv: Rename guest crash shutdown function
  hyperv: Add two new hypercall numbers to guest ABI public header
  hyperv: Add definitions for hypervisor crash dump support
  x86/hyperv: Add trampoline asm code to transition from hypervisor
  x86/hyperv: Implement hypervisor ram collection into vmcore
  x86/hyperv: Enable build of hypervisor crashdump collection files

 arch/x86/hyperv/Makefile        |   6 +
 arch/x86/hyperv/hv_crash.c      | 622 ++++++++++++++++++++++++++++++++
 arch/x86/hyperv/hv_init.c       |   1 +
 arch/x86/hyperv/hv_trampoline.S | 105 ++++++
 arch/x86/kernel/cpu/mshyperv.c  |   5 +-
 include/asm-generic/mshyperv.h  |   9 +
 include/hyperv/hvgdk_mini.h     |   2 +
 include/hyperv/hvhdk_mini.h     |  55 +++
 8 files changed, 803 insertions(+), 2 deletions(-)
 create mode 100644 arch/x86/hyperv/hv_crash.c
 create mode 100644 arch/x86/hyperv/hv_trampoline.S

-- 
2.36.1.vfs.0.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

Rename hv_machine_crash_shutdown to more appropriate
hv_guest_crash_shutdown and make it applicable to guests only. This
in preparation for the subsequent hypervisor root/dom0 crash support
patches.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/kernel/cpu/mshyperv.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 25773af116bc..1c6ec9b6107f 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -219,7 +219,7 @@ static void hv_machine_shutdown(void)
 #endif /* CONFIG_KEXEC_CORE */
 
 #ifdef CONFIG_CRASH_DUMP
-static void hv_machine_crash_shutdown(struct pt_regs *regs)
+static void hv_guest_crash_shutdown(struct pt_regs *regs)
 {
 	if (hv_crash_handler)
 		hv_crash_handler(regs);
@@ -562,7 +562,8 @@ static void __init ms_hyperv_init_platform(void)
 	machine_ops.shutdown = hv_machine_shutdown;
 #endif
 #if defined(CONFIG_CRASH_DUMP)
-	machine_ops.crash_shutdown = hv_machine_crash_shutdown;
+	if (!hv_root_partition())
+		machine_ops.crash_shutdown = hv_guest_crash_shutdown;
 #endif
 #endif
 	/*
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

In preparation for the subsequent crashdump patches, copy two hypercall
numbers to the guest ABI header published by Hyper-V. One to notify
hypervisor of an event that occurs in the root partition, other to ask
hypervisor to disable the hypervisor.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 include/hyperv/hvgdk_mini.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/hyperv/hvgdk_mini.h b/include/hyperv/hvgdk_mini.h
index 1be7f6a02304..5441bf47059a 100644
--- a/include/hyperv/hvgdk_mini.h
+++ b/include/hyperv/hvgdk_mini.h
@@ -469,6 +469,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_MAP_DEVICE_INTERRUPT			0x007c
 #define HVCALL_UNMAP_DEVICE_INTERRUPT			0x007d
 #define HVCALL_RETARGET_INTERRUPT			0x007e
+#define HVCALL_NOTIFY_PARTITION_EVENT                   0x0087
 #define HVCALL_NOTIFY_PORT_RING_EMPTY			0x008b
 #define HVCALL_REGISTER_INTERCEPT_RESULT		0x0091
 #define HVCALL_ASSERT_VIRTUAL_INTERRUPT			0x0094
@@ -492,6 +493,7 @@ union hv_vp_assist_msr_contents {	 /* HV_REGISTER_VP_ASSIST_PAGE */
 #define HVCALL_GET_VP_CPUID_VALUES			0x00f4
 #define HVCALL_MMIO_READ				0x0106
 #define HVCALL_MMIO_WRITE				0x0107
+#define HVCALL_DISABLE_HYP_EX                           0x010f
 
 /* HV_HYPERCALL_INPUT */
 #define HV_HYPERCALL_RESULT_MASK	GENMASK_ULL(15, 0)
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-15 17:54   ` Michael Kelley
  2025-09-10  0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

Add data structures for hypervisor crash dump support to the hypervisor
host ABI header file. Details of their usages are in subsequent commits.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
index 858f6a3925b3..ad9a8048fb4e 100644
--- a/include/hyperv/hvhdk_mini.h
+++ b/include/hyperv/hvhdk_mini.h
@@ -116,6 +116,17 @@ enum hv_system_property {
 	/* Add more values when needed */
 	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
 	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
+	HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47,
+};
+
+#define HV_PFN_RANGE_PGBITS 24  /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */
+union hv_pfn_range {            /* HV_SPA_PAGE_RANGE */
+	u64 as_uint64;
+	struct {
+		/* 39:0: base pfn.  63:40: additional pages */
+		u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS;
+		u64 add_pfns : HV_PFN_RANGE_PGBITS;
+	} __packed;
 };
 
 enum hv_dynamic_processor_feature_property {
@@ -142,6 +153,8 @@ struct hv_output_get_system_property {
 #if IS_ENABLED(CONFIG_X86)
 		u64 hv_processor_feature_value;
 #endif
+		union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */
+		u64 hv_tramp_pa;                /* CrashdumpTrampolineAddress */
 	};
 } __packed;
 
@@ -234,6 +247,48 @@ union hv_gpa_page_access_state {
 	u8 as_uint8;
 } __packed;
 
+enum hv_crashdump_action {
+	HV_CRASHDUMP_NONE = 0,
+	HV_CRASHDUMP_SUSPEND_ALL_VPS,
+	HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE,
+	HV_CRASHDUMP_STATE_SAVED,
+	HV_CRASHDUMP_ENTRY,
+};
+
+struct hv_partition_event_root_crashdump_input {
+	u32 crashdump_action; /* enum hv_crashdump_action */
+} __packed;
+
+struct hv_input_disable_hyp_ex {   /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */
+	u64 rip;
+	u64 arg;
+} __packed;
+
+struct hv_crashdump_area {	   /* HV_CRASHDUMP_AREA */
+	u32 version;
+	union {
+		u32 flags_as_uint32;
+		struct {
+			u32 cda_valid : 1;
+			u32 cda_unused : 31;
+		} __packed;
+	};
+	/* more unused fields */
+} __packed;
+
+union hv_partition_event_input {
+	struct hv_partition_event_root_crashdump_input crashdump_input;
+};
+
+enum hv_partition_event {
+	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
+};
+
+struct hv_input_notify_partition_event {
+	u32 event;      /* enum hv_partition_event */
+	union hv_partition_event_input input;
+} __packed;
+
 struct hv_lp_startup_status {
 	u64 hv_status;
 	u64 substatus1;
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
                   ` (2 preceding siblings ...)
  2025-09-10  0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-15 17:55   ` Michael Kelley
  2025-09-10  0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor
  2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
  5 siblings, 1 reply; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

Introduce a small asm stub to transition from the hypervisor to linux
upon devirtualization.

At a high level, during panic of either the hypervisor or the dom0 (aka
root), the nmi handler asks hypervisor to devirtualize. As part of that,
the arguments include an entry point to return back to linux. This asm
stub implements that entry point.

The stub is entered in protected mode, uses temporary gdt and page table
to enable long mode and get to kernel entry point which then restores full
kernel context to resume execution to kexec.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++
 1 file changed, 105 insertions(+)
 create mode 100644 arch/x86/hyperv/hv_trampoline.S

diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S
new file mode 100644
index 000000000000..27a755401a42
--- /dev/null
+++ b/arch/x86/hyperv/hv_trampoline.S
@@ -0,0 +1,105 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * X86 specific Hyper-V kdump/crash related code.
+ *
+ * Copyright (C) 2025, Microsoft, Inc.
+ *
+ */
+#include <linux/linkage.h>
+#include <asm/alternative.h>
+#include <asm/msr.h>
+#include <asm/processor-flags.h>
+#include <asm/nospec-branch.h>
+
+/*
+ * void noreturn hv_crash_asm32(arg1)
+ *    arg1 == edi == 32bit PA of struct hv_crash_trdata
+ *
+ * The hypervisor jumps here upon devirtualization in protected mode. This
+ * code gets copied to a page in the low 4G ie, 32bit space so it can run
+ * in the protected mode. Hence we cannot use any compile/link time offsets or
+ * addresses. It restores long mode via temporary gdt and page tables and
+ * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry.
+ *
+ * PreCondition (ie, Hypervisor call back ABI):
+ *  o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled
+ *  o CR4 is set to 0x0
+ *  o IA32_EFER is set to 0x901 (SCE and NXE are set)
+ *  o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX.
+ *  o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF
+ *  o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF
+ *  o LDTR is initialized as invalid (limit of 0)
+ *  o MSR PAT is power on default.
+ *  o Other state/registers are cleared. All TLBs flushed.
+ *
+ * See Intel SDM 10.8.5
+ */
+
+#define HV_CRASHDATA_OFFS_TRAMPCR3    0x0    /*	 0 */
+#define HV_CRASHDATA_OFFS_KERNCR3     0x8    /*	 8 */
+#define HV_CRASHDATA_OFFS_GDTRLIMIT  0x12    /* 18 */
+#define HV_CRASHDATA_OFFS_CS_JMPTGT  0x28    /* 40 */
+#define HV_CRASHDATA_OFFS_C_entry    0x30    /* 48 */
+#define HV_CRASHDATA_TRAMPOLINE_CS    0x8
+
+	.text
+	.code32
+
+SYM_CODE_START(hv_crash_asm32)
+	UNWIND_HINT_UNDEFINED
+	ANNOTATE_NOENDBR
+	movl	$X86_CR4_PAE, %ecx
+	movl	%ecx, %cr4
+
+	movl %edi, %ebx
+	add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx
+	movl %cs:(%ebx), %eax
+	movl %eax, %cr3
+
+	# Setup EFER for long mode now.
+	movl	$MSR_EFER, %ecx
+	rdmsr
+	btsl	$_EFER_LME, %eax
+	wrmsr
+
+	# Turn paging on using the temp 32bit trampoline page table.
+	movl %cr0, %eax
+	orl $(X86_CR0_PG), %eax
+	movl %eax, %cr0
+
+	/* since kernel cr3 could be above 4G, we need to be in the long mode
+	 * before we can load 64bits of the kernel cr3. We use a temp gdt for
+	 * that with CS.L=1 and CS.D=0 */
+	mov %edi, %eax
+	add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax
+	lgdtl %cs:(%eax)
+
+	/* not done yet, restore CS now to switch to CS.L=1 */
+	mov %edi, %eax
+	add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax
+	ljmp %cs:*(%eax)
+SYM_CODE_END(hv_crash_asm32)
+
+	/* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */
+	.code64
+	.balign 8
+SYM_CODE_START(hv_crash_asm64)
+	UNWIND_HINT_UNDEFINED
+	ANNOTATE_NOENDBR
+SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL)
+	/* restore kernel page tables so we can jump to kernel code */
+	mov %edi, %eax
+	add $HV_CRASHDATA_OFFS_KERNCR3, %eax
+	movq %cs:(%eax), %rbx
+	movq %rbx, %cr3
+
+	mov %edi, %eax
+	add $HV_CRASHDATA_OFFS_C_entry, %eax
+	movq %cs:(%eax), %rbx
+	ANNOTATE_RETPOLINE_SAFE
+	jmp *%rbx
+
+	int $3
+
+SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL)
+SYM_CODE_END(hv_crash_asm64)
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
                   ` (3 preceding siblings ...)
  2025-09-10  0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-15 17:55   ` Michael Kelley
  2025-09-18 17:11   ` Stanislav Kinsburskii
  2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
  5 siblings, 2 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

Introduce a new file to implement collection of hypervisor ram into the
vmcore collected by linux. By default, the hypervisor ram is locked, ie,
protected via hw page table. Hyper-V implements a disable hypercall which
essentially devirtualizes the system on the fly. This mechanism makes the
hypervisor ram accessible to linux. Because the hypervisor ram is already
mapped into linux address space (as reserved ram), it is automatically
collected into the vmcore without extra work. More details of the
implementation are available in the file prologue.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
 1 file changed, 622 insertions(+)
 create mode 100644 arch/x86/hyperv/hv_crash.c

diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
new file mode 100644
index 000000000000..531bac79d598
--- /dev/null
+++ b/arch/x86/hyperv/hv_crash.c
@@ -0,0 +1,622 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * X86 specific Hyper-V kdump/crash support module
+ *
+ * Copyright (C) 2025, Microsoft, Inc.
+ *
+ * This module implements hypervisor ram collection into vmcore for both
+ * cases of the hypervisor crash and linux dom0/root crash. Hyper-V implements
+ * a devirtualization hypercall with a 32bit protected mode ABI callback. This
+ * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
+ * is already mapped in linux, it is automatically collected into linux vmcore,
+ * and can be examined by the crash command (raw ram dump) or windbg.
+ *
+ * At a high level:
+ *
+ *  Hypervisor Crash:
+ *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
+ *    restrictive mode with very limited hypercall and msr support. Each cpu
+ *    then injects NMIs into dom0/root vcpus. A shared page is used to check
+ *    by linux in the nmi handler if the hypervisor has crashed. This shared
+ *    page is setup in hv_root_crash_init during boot.
+ *
+ *  Linux Crash:
+ *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
+ *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
+ *    for all cpus to be in NMI.
+ *
+ *  NMI Handler (upon quorum):
+ *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.
+ *    Hyper-V requires the disable hypervisor must be done from the bsp. So
+ *    the bsp nmi handler saves current context, does some fixups and makes
+ *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
+ *    at that point will suspend all vcpus (except the bsp), unlock all its
+ *    ram, and return to linux at the 32bit mode entry RIP.
+ *
+ *  Linux 32bit entry trampoline will then restore long mode and call C
+ *  function here to restore context and continue execution to crash kexec.
+ */
+
+#include <linux/delay.h>
+#include <linux/kexec.h>
+#include <linux/crash_dump.h>
+#include <linux/panic.h>
+#include <asm/apic.h>
+#include <asm/desc.h>
+#include <asm/page.h>
+#include <asm/pgalloc.h>
+#include <asm/mshyperv.h>
+#include <asm/nmi.h>
+#include <asm/idtentry.h>
+#include <asm/reboot.h>
+#include <asm/intel_pt.h>
+
+int hv_crash_enabled;
+EXPORT_SYMBOL_GPL(hv_crash_enabled);
+
+struct hv_crash_ctxt {
+	ulong rsp;
+	ulong cr0;
+	ulong cr2;
+	ulong cr4;
+	ulong cr8;
+
+	u16 cs;
+	u16 ss;
+	u16 ds;
+	u16 es;
+	u16 fs;
+	u16 gs;
+
+	u16 gdt_fill;
+	struct desc_ptr gdtr;
+	char idt_fill[6];
+	struct desc_ptr idtr;
+
+	u64 gsbase;
+	u64 efer;
+	u64 pat;
+};
+static struct hv_crash_ctxt hv_crash_ctxt;
+
+/* Shared hypervisor page that contains crash dump area we peek into.
+ * NB: windbg looks for "hv_cda" symbol so don't change it.
+ */
+static struct hv_crashdump_area *hv_cda;
+
+static u32 trampoline_pa, devirt_cr3arg;
+static atomic_t crash_cpus_wait;
+static void *hv_crash_ptpgs[4];
+static int hv_has_crashed, lx_has_crashed;
+
+/* This cannot be inlined as it needs stack */
+static noinline __noclone void hv_crash_restore_tss(void)
+{
+	load_TR_desc();
+}
+
+/* This cannot be inlined as it needs stack */
+static noinline void hv_crash_clear_kernpt(void)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+
+	/* Clear entry so it's not confusing to someone looking at the core */
+	pgd = pgd_offset_k(trampoline_pa);
+	p4d = p4d_offset(pgd, trampoline_pa);
+	native_p4d_clear(p4d);
+}
+
+/*
+ * This is the C entry point from the asm glue code after the devirt hypercall.
+ * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
+ * page tables with our below 4G page identity mapped, but using a temporary
+ * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
+ * available. We restore kernel GDT, and rest of the context, and continue
+ * to kexec.
+ */
+static asmlinkage void __noreturn hv_crash_c_entry(void)
+{
+	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
+
+	/* first thing, restore kernel gdt */
+	native_load_gdt(&ctxt->gdtr);
+
+	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
+	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
+
+	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
+	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
+	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
+	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
+
+	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
+	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
+
+	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
+	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
+	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
+
+	native_load_idt(&ctxt->idtr);
+	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
+	native_wrmsrq(MSR_EFER, ctxt->efer);
+
+	/* restore the original kernel CS now via far return */
+	asm volatile("movzwq %0, %%rax\n\t"
+		     "pushq %%rax\n\t"
+		     "pushq $1f\n\t"
+		     "lretq\n\t"
+		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
+
+	/* We are in asmlinkage without stack frame, hence make a C function
+	 * call which will buy stack frame to restore the tss or clear PT entry.
+	 */
+	hv_crash_restore_tss();
+	hv_crash_clear_kernpt();
+
+	/* we are now fully in devirtualized normal kernel mode */
+	__crash_kexec(NULL);
+
+	for (;;)
+		cpu_relax();
+}
+/* Tell gcc we are using lretq long jump in the above function intentionally */
+STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
+
+static void hv_mark_tss_not_busy(void)
+{
+	struct desc_struct *desc = get_current_gdt_rw();
+	tss_desc tss;
+
+	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
+	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
+	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
+}
+
+/* Save essential context */
+static void hv_hvcrash_ctxt_save(void)
+{
+	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
+
+	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
+
+	ctxt->cr0 = native_read_cr0();
+	ctxt->cr4 = native_read_cr4();
+
+	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
+	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
+
+	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
+	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
+	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
+	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
+	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
+	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
+
+	native_store_gdt(&ctxt->gdtr);
+	store_idt(&ctxt->idtr);
+
+	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
+	ctxt->efer = __rdmsr(MSR_EFER);
+	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
+}
+
+/* Add trampoline page to the kernel pagetable for transition to kernel PT */
+static void hv_crash_fixup_kernpt(void)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+
+	pgd = pgd_offset_k(trampoline_pa);
+	p4d = p4d_offset(pgd, trampoline_pa);
+
+	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
+	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
+	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
+}
+
+/*
+ * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
+ * crashed and will collect core. This will cause the hyp to quiesce and
+ * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
+ */
+static void hv_notify_prepare_hyp(void)
+{
+	u64 status;
+	struct hv_input_notify_partition_event *input;
+	struct hv_partition_event_root_crashdump_input *cda;
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	cda = &input->input.crashdump_input;
+	memset(input, 0, sizeof(*input));
+	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
+
+	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
+	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
+	if (!hv_result_success(status))
+		return;
+
+	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
+	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
+}
+
+/*
+ * Common function for all cpus before devirtualization.
+ *
+ * Hypervisor crash: all cpus get here in nmi context.
+ * Linux crash: the panicing cpu gets here at base level, all others in nmi
+ *		context. Note, panicing cpu may not be the bsp.
+ *
+ * The function is not inlined so it will show on the stack. It is named so
+ * because the crash cmd looks for certain well known function names on the
+ * stack before looking into the cpu saved note in the elf section, and
+ * that work is currently incomplete.
+ *
+ * Notes:
+ *  Hypervisor crash:
+ *    - the hypervisor is in a very restrictive mode at this point and any
+ *	vmexit it cannot handle would result in reboot. For example, console
+ *	output from here would result in synic ipi hcall, which would result
+ *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
+ *
+ *  Devirtualization is supported from the bsp only.
+ */
+static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
+{
+	struct hv_input_disable_hyp_ex *input;
+	u64 status;
+	int msecs = 1000, ccpu = smp_processor_id();
+
+	if (ccpu == 0) {
+		/* crash_save_cpu() will be done in the kexec path */
+		cpu_emergency_stop_pt();	/* disable performance trace */
+		atomic_inc(&crash_cpus_wait);
+	} else {
+		crash_save_cpu(regs, ccpu);
+		cpu_emergency_stop_pt();	/* disable performance trace */
+		atomic_inc(&crash_cpus_wait);
+		for (;;);			/* cause no vmexits */
+	}
+
+	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
+		mdelay(1);
+
+	stop_nmi();
+	if (!hv_has_crashed)
+		hv_notify_prepare_hyp();
+
+	if (crashing_cpu == -1)
+		crashing_cpu = ccpu;		/* crash cmd uses this */
+
+	hv_hvcrash_ctxt_save();
+	hv_mark_tss_not_busy();
+	hv_crash_fixup_kernpt();
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	memset(input, 0, sizeof(*input));
+	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
+	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
+
+	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
+
+	/* Devirt failed, just reboot as things are in very bad state now */
+	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
+}
+
+/*
+ * Generic nmi callback handler: could be called without any crash also.
+ *   hv crash: hypervisor injects nmi's into all cpus
+ *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
+ */
+static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
+{
+	int ccpu = smp_processor_id();
+
+	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
+		hv_has_crashed = 1;
+
+	if (!hv_has_crashed && !lx_has_crashed)
+		return NMI_DONE;	/* ignore the nmi */
+
+	if (hv_has_crashed) {
+		if (!kexec_crash_loaded() || !hv_crash_enabled) {
+			if (ccpu == 0) {
+				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
+			} else
+				for (;;);	/* cause no vmexits */
+		}
+	}
+
+	crash_nmi_callback(regs);
+
+	return NMI_DONE;
+}
+
+/*
+ * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
+ *
+ * On normal linux panic, this is called twice: first from panic and then again
+ * from native_machine_crash_shutdown.
+ *
+ * In case of mshv, 3 ways to get here:
+ *  1. hv crash (only bsp will get here):
+ *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
+ *		  -> __crash_kexec -> native_machine_crash_shutdown
+ *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
+ *  linux panic:
+ *	2. panic cpu x: panic() -> crash_smp_send_stop
+ *				     -> smp_ops.crash_stop_other_cpus
+ *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
+ *
+ * NB: noclone and non standard stack because of call to crash_setup_regs().
+ */
+static void __noclone hv_crash_stop_other_cpus(void)
+{
+	static int crash_stop_done;
+	struct pt_regs lregs;
+	int ccpu = smp_processor_id();
+
+	if (hv_has_crashed)
+		return;		/* all cpus already in nmi handler path */
+
+	if (!kexec_crash_loaded())
+		return;
+
+	if (crash_stop_done)
+		return;
+	crash_stop_done = 1;
+
+	/* linux has crashed: hv is healthy, we can ipi safely */
+	lx_has_crashed = 1;
+	wmb();			/* nmi handlers look at lx_has_crashed */
+
+	apic->send_IPI_allbutself(NMI_VECTOR);
+
+	if (crashing_cpu == -1)
+		crashing_cpu = ccpu;		/* crash cmd uses this */
+
+	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
+	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
+	 */
+	if (ccpu)
+		crash_setup_regs(&lregs, NULL);
+
+	crash_nmi_callback(&lregs);
+}
+STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
+
+/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
+struct hv_gdtreg_32 {
+	u16 fill;
+	u16 limit;
+	u32 address;
+} __packed;
+
+/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
+struct hv_crash_tramp_gdt {
+	u64 null;	/* index 0, selector 0, null selector */
+	u64 cs64;	/* index 1, selector 8, cs64 selector */
+} __packed;
+
+/* No stack, so jump via far ptr in memory to load the 64bit CS */
+struct hv_cs_jmptgt {
+	u32 address;
+	u16 csval;
+	u16 fill;
+} __packed;
+
+/* This trampoline data is copied onto the trampoline page after the asm code */
+struct hv_crash_tramp_data {
+	u64 tramp32_cr3;
+	u64 kernel_cr3;
+	struct hv_gdtreg_32 gdtr32;
+	struct hv_crash_tramp_gdt tramp_gdt;
+	struct hv_cs_jmptgt cs_jmptgt;
+	u64 c_entry_addr;
+} __packed;
+
+/*
+ * Setup a temporary gdt to allow the asm code to switch to the long mode.
+ * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
+ * relative addressing, hence we must use trampoline_pa here. Also, save other
+ * info like jmp and C entry targets for same reasons.
+ *
+ * Returns: 0 on success, -1 on error
+ */
+static int hv_crash_setup_trampdata(u64 trampoline_va)
+{
+	int size, offs;
+	void *dest;
+	struct hv_crash_tramp_data *tramp;
+
+	/* These must match exactly the ones in the corresponding asm file */
+	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
+	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
+	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
+	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
+						     cs_jmptgt.address) != 40);
+
+	/* hv_crash_asm_end is beyond last byte by 1 */
+	size = &hv_crash_asm_end - &hv_crash_asm32;
+	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
+		pr_err("%s: trampoline page overflow\n", __func__);
+		return -1;
+	}
+
+	dest = (void *)trampoline_va;
+	memcpy(dest, &hv_crash_asm32, size);
+
+	dest += size;
+	dest = (void *)round_up((ulong)dest, 16);
+	tramp = (struct hv_crash_tramp_data *)dest;
+
+	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
+	 * non-PCID-aware users". Build cr3 with pcid 0
+	 */
+	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
+
+	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
+	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
+
+	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
+	tramp->gdtr32.address = trampoline_pa +
+				   (ulong)&tramp->tramp_gdt - trampoline_va;
+
+	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
+	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
+
+	tramp->cs_jmptgt.csval = 0x8;
+	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
+	tramp->cs_jmptgt.address = trampoline_pa + offs;
+
+	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
+
+	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
+
+	return 0;
+}
+
+/*
+ * Build 32bit trampoline page table for transition from protected mode
+ * non-paging to long-mode paging. This transition needs pagetables below 4G.
+ */
+static void hv_crash_build_tramp_pt(void)
+{
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+	u64 pa, addr = trampoline_pa;
+
+	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
+	pa = virt_to_phys(hv_crash_ptpgs[1]);
+	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
+	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
+
+	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
+	pa = virt_to_phys(hv_crash_ptpgs[2]);
+	set_pud(pud, __pud(_PAGE_TABLE | pa));
+
+	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
+	pa = virt_to_phys(hv_crash_ptpgs[3]);
+	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
+
+	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
+	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
+}
+
+/*
+ * Setup trampoline for devirtualization:
+ *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
+ *    in protected mode.
+ *  - 4 pages for a temporary page table that asm code uses to turn paging on
+ *  - a temporary gdt to use in the compat mode.
+ *
+ *  Returns: 0 on success
+ */
+static int hv_crash_trampoline_setup(void)
+{
+	int i, rc, order;
+	struct page *page;
+	u64 trampoline_va;
+	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
+
+	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
+	page = alloc_page(flags32);
+	if (page == NULL) {
+		pr_err("%s: failed to alloc asm stub page\n", __func__);
+		return -1;
+	}
+
+	trampoline_va = (u64)page_to_virt(page);
+	trampoline_pa = (u32)page_to_phys(page);
+
+	order = 2;	   /* alloc 2^2 pages */
+	page = alloc_pages(flags32, order);
+	if (page == NULL) {
+		pr_err("%s: failed to alloc pt pages\n", __func__);
+		free_page(trampoline_va);
+		return -1;
+	}
+
+	for (i = 0; i < 4; i++, page++)
+		hv_crash_ptpgs[i] = page_to_virt(page);
+
+	hv_crash_build_tramp_pt();
+
+	rc = hv_crash_setup_trampdata(trampoline_va);
+	if (rc)
+		goto errout;
+
+	return 0;
+
+errout:
+	free_page(trampoline_va);
+	free_pages((ulong)hv_crash_ptpgs[0], order);
+
+	return rc;
+}
+
+/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
+void hv_root_crash_init(void)
+{
+	int rc;
+	struct hv_input_get_system_property *input;
+	struct hv_output_get_system_property *output;
+	unsigned long flags;
+	u64 status;
+	union hv_pfn_range cda_info;
+
+	if (pgtable_l5_enabled()) {
+		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
+		return;
+	}
+
+	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
+				  "hv_crash_nmi");
+	if (rc) {
+		pr_err("Hyper-V: failed to register crash nmi handler\n");
+		return;
+	}
+
+	local_irq_save(flags);
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
+
+	memset(input, 0, sizeof(*input));
+	memset(output, 0, sizeof(*output));
+	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
+
+	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
+	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
+	local_irq_restore(flags);
+
+	if (!hv_result_success(status)) {
+		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
+		       input->property_id, hv_result_to_string(status));
+		goto err_out;
+	}
+
+	if (cda_info.base_pfn == 0) {
+		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
+		goto err_out;
+	}
+
+	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
+
+	rc = hv_crash_trampoline_setup();
+	if (rc)
+		goto err_out;
+
+	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
+
+	crash_kexec_post_notifiers = true;
+	hv_crash_enabled = 1;
+	pr_info("Hyper-V: linux and hv kdump support enabled\n");
+
+	return;
+
+err_out:
+	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
+	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
+}
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
                   ` (4 preceding siblings ...)
  2025-09-10  0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor
@ 2025-09-10  0:10 ` Mukesh Rathor
  2025-09-13  4:53   ` kernel test robot
                     ` (2 more replies)
  5 siblings, 3 replies; 29+ messages in thread
From: Mukesh Rathor @ 2025-09-10  0:10 UTC (permalink / raw)
  To: linux-hyperv, linux-kernel, linux-arch
  Cc: kys, haiyangz, wei.liu, decui, tglx, mingo, bp, dave.hansen, x86,
	hpa, arnd

Enable build of the new files introduced in the earlier commits and add
call to do the setup during boot.

Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
 arch/x86/hyperv/Makefile       | 6 ++++++
 arch/x86/hyperv/hv_init.c      | 1 +
 include/asm-generic/mshyperv.h | 9 +++++++++
 3 files changed, 16 insertions(+)

diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
index d55f494f471d..6f5d97cddd80 100644
--- a/arch/x86/hyperv/Makefile
+++ b/arch/x86/hyperv/Makefile
@@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
 
 ifdef CONFIG_X86_64
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)	+= hv_spinlock.o
+
+ ifdef CONFIG_MSHV_ROOT
+  CFLAGS_REMOVE_hv_trampoline.o += -pg
+  CFLAGS_hv_trampoline.o        += -fno-stack-protector
+  obj-$(CONFIG_CRASH_DUMP)      += hv_crash.o hv_trampoline.o
+ endif
 endif
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index afdbda2dd7b7..577bbd143527 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -510,6 +510,7 @@ void __init hyperv_init(void)
 		memunmap(src);
 
 		hv_remap_tsc_clocksource();
+		hv_root_crash_init();
 	} else {
 		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
 		wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index dbd4c2f3aee3..952c221765f5 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 num_pages);
 int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
 int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
 
+#if CONFIG_CRASH_DUMP
+void hv_root_crash_init(void);
+void hv_crash_asm32(void);
+void hv_crash_asm64_lbl(void);
+void hv_crash_asm_end(void);
+#else   /* CONFIG_CRASH_DUMP */
+static inline void hv_root_crash_init(void) {}
+#endif  /* CONFIG_CRASH_DUMP */
+
 #else /* CONFIG_MSHV_ROOT */
 static inline bool hv_root_partition(void) { return false; }
 static inline bool hv_l1vh_partition(void) { return false; }
-- 
2.36.1.vfs.0.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
@ 2025-09-13  4:53   ` kernel test robot
  2025-09-13  5:57   ` kernel test robot
  2025-09-15 17:56   ` Michael Kelley
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2025-09-13  4:53 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv, linux-kernel, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, arnd

Hi Mukesh,

kernel test robot noticed the following build warnings:

[auto build test WARNING on next-20250909]
[also build test WARNING on v6.17-rc5]
[cannot apply to tip/x86/core tip/master linus/master arnd-asm-generic/master tip/auto-latest v6.17-rc5 v6.17-rc4 v6.17-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-Rathor/x86-hyperv-Rename-guest-crash-shutdown-function/20250910-081309
base:   next-20250909
patch link:    https://lore.kernel.org/r/20250910001009.2651481-7-mrathor%40linux.microsoft.com
patch subject: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
config: x86_64-randconfig-073-20250913 (https://download.01.org/0day-ci/archive/20250913/202509131228.naboUNkE-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250913/202509131228.naboUNkE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509131228.naboUNkE-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from arch/x86/include/asm/mshyperv.h:272,
                    from arch/x86/hyperv/hv_apic.c:29:
>> include/asm-generic/mshyperv.h:370:5: warning: "CONFIG_CRASH_DUMP" is not defined, evaluates to 0 [-Wundef]
     370 | #if CONFIG_CRASH_DUMP
         |     ^~~~~~~~~~~~~~~~~


vim +/CONFIG_CRASH_DUMP +370 include/asm-generic/mshyperv.h

   369	
 > 370	#if CONFIG_CRASH_DUMP
   371	void hv_root_crash_init(void);
   372	void hv_crash_asm32(void);
   373	void hv_crash_asm64_lbl(void);
   374	void hv_crash_asm_end(void);
   375	#else   /* CONFIG_CRASH_DUMP */
   376	static inline void hv_root_crash_init(void) {}
   377	#endif  /* CONFIG_CRASH_DUMP */
   378	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
  2025-09-13  4:53   ` kernel test robot
@ 2025-09-13  5:57   ` kernel test robot
  2025-09-15 17:56   ` Michael Kelley
  2 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2025-09-13  5:57 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv, linux-kernel, linux-arch
  Cc: oe-kbuild-all, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, arnd

Hi Mukesh,

kernel test robot noticed the following build errors:

[auto build test ERROR on next-20250909]
[also build test ERROR on v6.17-rc5]
[cannot apply to tip/x86/core tip/master linus/master arnd-asm-generic/master tip/auto-latest v6.17-rc5 v6.17-rc4 v6.17-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Mukesh-Rathor/x86-hyperv-Rename-guest-crash-shutdown-function/20250910-081309
base:   next-20250909
patch link:    https://lore.kernel.org/r/20250910001009.2651481-7-mrathor%40linux.microsoft.com
patch subject: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
config: x86_64-rhel-9.4 (https://download.01.org/0day-ci/archive/20250913/202509131304.WGYf1Sx7-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250913/202509131304.WGYf1Sx7-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509131304.WGYf1Sx7-lkp@intel.com/

All errors (new ones prefixed by >>):

   arch/x86/hyperv/hv_init.c: In function 'hyperv_init':
>> arch/x86/hyperv/hv_init.c:550:17: error: implicit declaration of function 'hv_root_crash_init' [-Wimplicit-function-declaration]
     550 |                 hv_root_crash_init();
         |                 ^~~~~~~~~~~~~~~~~~


vim +/hv_root_crash_init +550 arch/x86/hyperv/hv_init.c

   431	
   432	/*
   433	 * This function is to be invoked early in the boot sequence after the
   434	 * hypervisor has been detected.
   435	 *
   436	 * 1. Setup the hypercall page.
   437	 * 2. Register Hyper-V specific clocksource.
   438	 * 3. Setup Hyper-V specific APIC entry points.
   439	 */
   440	void __init hyperv_init(void)
   441	{
   442		u64 guest_id;
   443		union hv_x64_msr_hypercall_contents hypercall_msr;
   444		int cpuhp;
   445	
   446		if (x86_hyper_type != X86_HYPER_MS_HYPERV)
   447			return;
   448	
   449		if (hv_common_init())
   450			return;
   451	
   452		/*
   453		 * The VP assist page is useless to a TDX guest: the only use we
   454		 * would have for it is lazy EOI, which can not be used with TDX.
   455		 */
   456		if (hv_isolation_type_tdx())
   457			hv_vp_assist_page = NULL;
   458		else
   459			hv_vp_assist_page = kcalloc(nr_cpu_ids,
   460						    sizeof(*hv_vp_assist_page),
   461						    GFP_KERNEL);
   462		if (!hv_vp_assist_page) {
   463			ms_hyperv.hints &= ~HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
   464	
   465			if (!hv_isolation_type_tdx())
   466				goto common_free;
   467		}
   468	
   469		if (ms_hyperv.paravisor_present && hv_isolation_type_snp()) {
   470			/* Negotiate GHCB Version. */
   471			if (!hv_ghcb_negotiate_protocol())
   472				hv_ghcb_terminate(SEV_TERM_SET_GEN,
   473						  GHCB_SEV_ES_PROT_UNSUPPORTED);
   474	
   475			hv_ghcb_pg = alloc_percpu(union hv_ghcb *);
   476			if (!hv_ghcb_pg)
   477				goto free_vp_assist_page;
   478		}
   479	
   480		cpuhp = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "x86/hyperv_init:online",
   481					  hv_cpu_init, hv_cpu_die);
   482		if (cpuhp < 0)
   483			goto free_ghcb_page;
   484	
   485		/*
   486		 * Setup the hypercall page and enable hypercalls.
   487		 * 1. Register the guest ID
   488		 * 2. Enable the hypercall and register the hypercall page
   489		 *
   490		 * A TDX VM with no paravisor only uses TDX GHCI rather than hv_hypercall_pg:
   491		 * when the hypercall input is a page, such a VM must pass a decrypted
   492		 * page to Hyper-V, e.g. hv_post_message() uses the per-CPU page
   493		 * hyperv_pcpu_input_arg, which is decrypted if no paravisor is present.
   494		 *
   495		 * A TDX VM with the paravisor uses hv_hypercall_pg for most hypercalls,
   496		 * which are handled by the paravisor and the VM must use an encrypted
   497		 * input page: in such a VM, the hyperv_pcpu_input_arg is encrypted and
   498		 * used in the hypercalls, e.g. see hv_mark_gpa_visibility() and
   499		 * hv_arch_irq_unmask(). Such a VM uses TDX GHCI for two hypercalls:
   500		 * 1. HVCALL_SIGNAL_EVENT: see vmbus_set_event() and _hv_do_fast_hypercall8().
   501		 * 2. HVCALL_POST_MESSAGE: the input page must be a decrypted page, i.e.
   502		 * hv_post_message() in such a VM can't use the encrypted hyperv_pcpu_input_arg;
   503		 * instead, hv_post_message() uses the post_msg_page, which is decrypted
   504		 * in such a VM and is only used in such a VM.
   505		 */
   506		guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
   507		wrmsrq(HV_X64_MSR_GUEST_OS_ID, guest_id);
   508	
   509		/* With the paravisor, the VM must also write the ID via GHCB/GHCI */
   510		hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, guest_id);
   511	
   512		/* A TDX VM with no paravisor only uses TDX GHCI rather than hv_hypercall_pg */
   513		if (hv_isolation_type_tdx() && !ms_hyperv.paravisor_present)
   514			goto skip_hypercall_pg_init;
   515	
   516		hv_hypercall_pg = __vmalloc_node_range(PAGE_SIZE, 1, MODULES_VADDR,
   517				MODULES_END, GFP_KERNEL, PAGE_KERNEL_ROX,
   518				VM_FLUSH_RESET_PERMS, NUMA_NO_NODE,
   519				__builtin_return_address(0));
   520		if (hv_hypercall_pg == NULL)
   521			goto clean_guest_os_id;
   522	
   523		rdmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
   524		hypercall_msr.enable = 1;
   525	
   526		if (hv_root_partition()) {
   527			struct page *pg;
   528			void *src;
   529	
   530			/*
   531			 * For the root partition, the hypervisor will set up its
   532			 * hypercall page. The hypervisor guarantees it will not show
   533			 * up in the root's address space. The root can't change the
   534			 * location of the hypercall page.
   535			 *
   536			 * Order is important here. We must enable the hypercall page
   537			 * so it is populated with code, then copy the code to an
   538			 * executable page.
   539			 */
   540			wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
   541	
   542			pg = vmalloc_to_page(hv_hypercall_pg);
   543			src = memremap(hypercall_msr.guest_physical_address << PAGE_SHIFT, PAGE_SIZE,
   544					MEMREMAP_WB);
   545			BUG_ON(!src);
   546			memcpy_to_page(pg, 0, src, HV_HYP_PAGE_SIZE);
   547			memunmap(src);
   548	
   549			hv_remap_tsc_clocksource();
 > 550			hv_root_crash_init();
   551		} else {
   552			hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
   553			wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
   554		}
   555	
   556		hv_set_hypercall_pg(hv_hypercall_pg);
   557	
   558	skip_hypercall_pg_init:
   559		/*
   560		 * hyperv_init() is called before LAPIC is initialized: see
   561		 * apic_intr_mode_init() -> x86_platform.apic_post_init() and
   562		 * apic_bsp_setup() -> setup_local_APIC(). The direct-mode STIMER
   563		 * depends on LAPIC, so hv_stimer_alloc() should be called from
   564		 * x86_init.timers.setup_percpu_clockev.
   565		 */
   566		old_setup_percpu_clockev = x86_init.timers.setup_percpu_clockev;
   567		x86_init.timers.setup_percpu_clockev = hv_stimer_setup_percpu_clockev;
   568	
   569		hv_apic_init();
   570	
   571		x86_init.pci.arch_init = hv_pci_init;
   572	
   573		register_syscore_ops(&hv_syscore_ops);
   574	
   575		if (ms_hyperv.priv_high & HV_ACCESS_PARTITION_ID)
   576			hv_get_partition_id();
   577	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support
  2025-09-10  0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor
@ 2025-09-15 17:54   ` Michael Kelley
  2025-09-16  1:15     ` Mukesh R
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-15 17:54 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> 
> Add data structures for hypervisor crash dump support to the hypervisor
> host ABI header file. Details of their usages are in subsequent commits.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> index 858f6a3925b3..ad9a8048fb4e 100644
> --- a/include/hyperv/hvhdk_mini.h
> +++ b/include/hyperv/hvhdk_mini.h
> @@ -116,6 +116,17 @@ enum hv_system_property {
>  	/* Add more values when needed */
>  	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
>  	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
> +	HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47,
> +};
> +
> +#define HV_PFN_RANGE_PGBITS 24  /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */
> +union hv_pfn_range {            /* HV_SPA_PAGE_RANGE */
> +	u64 as_uint64;
> +	struct {
> +		/* 39:0: base pfn.  63:40: additional pages */
> +		u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS;
> +		u64 add_pfns : HV_PFN_RANGE_PGBITS;
> +	} __packed;
>  };
> 
>  enum hv_dynamic_processor_feature_property {
> @@ -142,6 +153,8 @@ struct hv_output_get_system_property {
>  #if IS_ENABLED(CONFIG_X86)
>  		u64 hv_processor_feature_value;
>  #endif
> +		union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */
> +		u64 hv_tramp_pa;                /* CrashdumpTrampolineAddress */
>  	};
>  } __packed;
> 
> @@ -234,6 +247,48 @@ union hv_gpa_page_access_state {
>  	u8 as_uint8;
>  } __packed;
> 
> +enum hv_crashdump_action {
> +	HV_CRASHDUMP_NONE = 0,
> +	HV_CRASHDUMP_SUSPEND_ALL_VPS,
> +	HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE,
> +	HV_CRASHDUMP_STATE_SAVED,
> +	HV_CRASHDUMP_ENTRY,
> +};

Nit: Since these values are part of the ABI, it's probably better
to assign explicit values to each enum member in order to
ward off any mistaken reordering or additions in the middle
of the list.

> +
> +struct hv_partition_event_root_crashdump_input {
> +	u32 crashdump_action; /* enum hv_crashdump_action */
> +} __packed;
> +
> +struct hv_input_disable_hyp_ex {   /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */
> +	u64 rip;
> +	u64 arg;
> +} __packed;
> +
> +struct hv_crashdump_area {	   /* HV_CRASHDUMP_AREA */
> +	u32 version;
> +	union {
> +		u32 flags_as_uint32;
> +		struct {
> +			u32 cda_valid : 1;
> +			u32 cda_unused : 31;
> +		} __packed;
> +	};
> +	/* more unused fields */
> +} __packed;
> +
> +union hv_partition_event_input {
> +	struct hv_partition_event_root_crashdump_input crashdump_input;
> +};
> +
> +enum hv_partition_event {
> +	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
> +};
> +
> +struct hv_input_notify_partition_event {
> +	u32 event;      /* enum hv_partition_event */
> +	union hv_partition_event_input input;
> +} __packed;
> +
>  struct hv_lp_startup_status {
>  	u64 hv_status;
>  	u64 substatus1;
> --
> 2.36.1.vfs.0.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-10  0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor
@ 2025-09-15 17:55   ` Michael Kelley
  2025-09-16 21:30     ` Mukesh R
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-15 17:55 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> 
> Introduce a small asm stub to transition from the hypervisor to linux

I'd argue for capitalizing "Linux" here and in other places in commit
text and code comments throughout this patch set.

> upon devirtualization.

In this patch and subsequent patches, you've used the phrase "upon
devirtualization", which seems a little vague to me. Does this mean
"when devirtualization is complete" or perhaps "when the hypervisor
completes devirtualization"? Since there's no spec on any of this,
being as precise as possible will help future readers.

> 
> At a high level, during panic of either the hypervisor or the dom0 (aka
> root), the nmi handler asks hypervisor to devirtualize.

Suggest:

At a high level, during panic of either the hypervisor or Linux running
in dom0 (a.k.a. the root partition), the Linux NMI handler asks the
hypervisor to devirtualize.

> As part of that,
> the arguments include an entry point to return back to linux. This asm
> stub implements that entry point.
> 
> The stub is entered in protected mode, uses temporary gdt and page table
> to enable long mode and get to kernel entry point which then restores full
> kernel context to resume execution to kexec.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++
>  1 file changed, 105 insertions(+)
>  create mode 100644 arch/x86/hyperv/hv_trampoline.S
> 
> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S
> new file mode 100644
> index 000000000000..27a755401a42
> --- /dev/null
> +++ b/arch/x86/hyperv/hv_trampoline.S
> @@ -0,0 +1,105 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * X86 specific Hyper-V kdump/crash related code.

Add a qualification that this is for root partition only, and not for
general guests?

> + *
> + * Copyright (C) 2025, Microsoft, Inc.
> + *
> + */
> +#include <linux/linkage.h>
> +#include <asm/alternative.h>
> +#include <asm/msr.h>
> +#include <asm/processor-flags.h>
> +#include <asm/nospec-branch.h>
> +
> +/*
> + * void noreturn hv_crash_asm32(arg1)
> + *    arg1 == edi == 32bit PA of struct hv_crash_trdata

I think this is "struct hv_crash_tramp_data".

> + *
> + * The hypervisor jumps here upon devirtualization in protected mode. This
> + * code gets copied to a page in the low 4G ie, 32bit space so it can run
> + * in the protected mode. Hence we cannot use any compile/link time offsets or
> + * addresses. It restores long mode via temporary gdt and page tables and
> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry.
> + *
> + * PreCondition (ie, Hypervisor call back ABI):
> + *  o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled
> + *  o CR4 is set to 0x0
> + *  o IA32_EFER is set to 0x901 (SCE and NXE are set)
> + *  o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX.
> + *  o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF
> + *  o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF
> + *  o LDTR is initialized as invalid (limit of 0)
> + *  o MSR PAT is power on default.
> + *  o Other state/registers are cleared. All TLBs flushed.

Clarification about "Other state/registers are cleared":  What about
processor features that Linux may have enabled or disabled during its
initial boot? Are those still in the states Linux set? Or are they reset to
power-on defaults? For example, if Linux enabled x2apic, is x2apic
still enabled when the stub is entered?

> + *
> + * See Intel SDM 10.8.5

Hmmm. I downloaded the latest combined SDM, and section 10.8.5
in Volume 3A is about Microcode Update Resources, which doesn't
seem applicable here. Other volumes don't have a section 10.8.5.

> + */
> +
> +#define HV_CRASHDATA_OFFS_TRAMPCR3    0x0    /*	 0 */
> +#define HV_CRASHDATA_OFFS_KERNCR3     0x8    /*	 8 */
> +#define HV_CRASHDATA_OFFS_GDTRLIMIT  0x12    /* 18 */
> +#define HV_CRASHDATA_OFFS_CS_JMPTGT  0x28    /* 40 */
> +#define HV_CRASHDATA_OFFS_C_entry    0x30    /* 48 */

It seems like these offsets should go in a #include file along
with the definition of struct hv_crash_tramp_data. Then the
BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could
check against these symbolic names instead of hardcoding
numbers that must match these.

> +#define HV_CRASHDATA_TRAMPOLINE_CS    0x8

This #define isn't used anywhere.

> +
> +	.text
> +	.code32
> +
> +SYM_CODE_START(hv_crash_asm32)
> +	UNWIND_HINT_UNDEFINED
> +	ANNOTATE_NOENDBR

No ENDBR here, presumably because this function is entered via other
than an indirect CALL or JMP instruction. Right?

> +	movl	$X86_CR4_PAE, %ecx
> +	movl	%ecx, %cr4
> +
> +	movl %edi, %ebx
> +	add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx
> +	movl %cs:(%ebx), %eax
> +	movl %eax, %cr3
> +
> +	# Setup EFER for long mode now.
> +	movl	$MSR_EFER, %ecx
> +	rdmsr
> +	btsl	$_EFER_LME, %eax
> +	wrmsr
> +
> +	# Turn paging on using the temp 32bit trampoline page table.
> +	movl %cr0, %eax
> +	orl $(X86_CR0_PG), %eax
> +	movl %eax, %cr0
> +
> +	/* since kernel cr3 could be above 4G, we need to be in the long mode
> +	 * before we can load 64bits of the kernel cr3. We use a temp gdt for
> +	 * that with CS.L=1 and CS.D=0 */
> +	mov %edi, %eax
> +	add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax
> +	lgdtl %cs:(%eax)
> +
> +	/* not done yet, restore CS now to switch to CS.L=1 */
> +	mov %edi, %eax
> +	add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax
> +	ljmp %cs:*(%eax)
> +SYM_CODE_END(hv_crash_asm32)
> +
> +	/* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */
> +	.code64
> +	.balign 8
> +SYM_CODE_START(hv_crash_asm64)
> +	UNWIND_HINT_UNDEFINED
> +	ANNOTATE_NOENDBR

But this *is* entered via an indirect JMP, right? So back to my
earlier question about the state of processor feature enablement.
If Linux enabled IBT, is it still enabled after devirtualization and
the hypervisor invokes this entry point? Linux guests on Hyper-V
have historically not enabled IBT, but patches that enable it are
now in linux-next, and will go into the 6.18 kernel. So maybe
this needs an ENDBR64.

> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL)
> +	/* restore kernel page tables so we can jump to kernel code */
> +	mov %edi, %eax
> +	add $HV_CRASHDATA_OFFS_KERNCR3, %eax
> +	movq %cs:(%eax), %rbx
> +	movq %rbx, %cr3
> +
> +	mov %edi, %eax
> +	add $HV_CRASHDATA_OFFS_C_entry, %eax
> +	movq %cs:(%eax), %rbx
> +	ANNOTATE_RETPOLINE_SAFE
> +	jmp *%rbx
> +
> +	int $3
> +
> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL)
> +SYM_CODE_END(hv_crash_asm64)
> --
> 2.36.1.vfs.0.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-10  0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor
@ 2025-09-15 17:55   ` Michael Kelley
  2025-09-17  1:13     ` Mukesh R
  2025-09-18 17:11   ` Stanislav Kinsburskii
  1 sibling, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-15 17:55 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> 
> Introduce a new file to implement collection of hypervisor ram into the

s/ram/RAM/ (multiple places)

> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
> protected via hw page table. Hyper-V implements a disable hypercall which

The terminology here is a bit confusing since you have two names for
the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
just use "devirtualize" everywhere, and drop the "disable" terminology?

> essentially devirtualizes the system on the fly. This mechanism makes the
> hypervisor ram accessible to linux. Because the hypervisor ram is already
> mapped into linux address space (as reserved ram), 

Is the hypervisor RAM mapped into the VMM process user address space,
or somewhere in the kernel address space? If the latter, where in the kernel
code, or what mechanism, does that? Just curious, as I wasn't aware that
this is happening ....

> it is automatically
> collected into the vmcore without extra work. More details of the
> implementation are available in the file prologue.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
>  1 file changed, 622 insertions(+)
>  create mode 100644 arch/x86/hyperv/hv_crash.c
> 
> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> new file mode 100644
> index 000000000000..531bac79d598
> --- /dev/null
> +++ b/arch/x86/hyperv/hv_crash.c
> @@ -0,0 +1,622 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * X86 specific Hyper-V kdump/crash support module
> + *
> + * Copyright (C) 2025, Microsoft, Inc.
> + *
> + * This module implements hypervisor ram collection into vmcore for both
> + * cases of the hypervisor crash and linux dom0/root crash. 

For a hypervisor crash, does any of this apply to general guest VMs? I'm
thinking it does not. Hypervisor RAM is collected only into the vmcore
for the root partition, right? Maybe some additional clarification could be
added so there's no confusion in this regard.

And what *does* happen to guest VMs after a hypervisor crash?

> + * Hyper-V implements
> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
> + * is already mapped in linux, it is automatically collected into linux vmcore,
> + * and can be examined by the crash command (raw ram dump) or windbg.
> + *
> + * At a high level:
> + *
> + *  Hypervisor Crash:
> + *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
> + *    restrictive mode with very limited hypercall and msr support.

s/msr/MSR/

> + *    Each cpu then injects NMIs into dom0/root vcpus. 

The "Each cpu" part of this sentence is confusing to me -- which CPUs does
this refer to? Maybe it would be better to say "It then injects an NMI into
each dom0/root partition vCPU." without being specific as to which CPUs do
the injecting since that seems more like a hypervisor implementation detail
that's not relevant here.

> + *    A shared page is used to check
> + *    by linux in the nmi handler if the hypervisor has crashed. This shared

s/nmi/NMI/  (multiple places)

> + *    page is setup in hv_root_crash_init during boot.
> + *
> + *  Linux Crash:
> + *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
> + *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
> + *    for all cpus to be in NMI.
> + *
> + *  NMI Handler (upon quorum):
> + *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.

s/hanlder/handler/

And maybe just drop the word "wil" (which is misspelled).

> + *    Hyper-V requires the disable hypervisor must be done from the bsp. So

s/bsp/BSP  (multiple places)

> + *    the bsp nmi handler saves current context, does some fixups and makes
> + *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
> + *    at that point will suspend all vcpus (except the bsp), unlock all its
> + *    ram, and return to linux at the 32bit mode entry RIP.
> + *
> + *  Linux 32bit entry trampoline will then restore long mode and call C
> + *  function here to restore context and continue execution to crash kexec.
> + */
> +
> +#include <linux/delay.h>
> +#include <linux/kexec.h>
> +#include <linux/crash_dump.h>
> +#include <linux/panic.h>
> +#include <asm/apic.h>
> +#include <asm/desc.h>
> +#include <asm/page.h>
> +#include <asm/pgalloc.h>
> +#include <asm/mshyperv.h>
> +#include <asm/nmi.h>
> +#include <asm/idtentry.h>
> +#include <asm/reboot.h>
> +#include <asm/intel_pt.h>
> +
> +int hv_crash_enabled;

Seems like this is conceptually a "bool", not an "int".

> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
> +
> +struct hv_crash_ctxt {
> +	ulong rsp;
> +	ulong cr0;
> +	ulong cr2;
> +	ulong cr4;
> +	ulong cr8;
> +
> +	u16 cs;
> +	u16 ss;
> +	u16 ds;
> +	u16 es;
> +	u16 fs;
> +	u16 gs;
> +
> +	u16 gdt_fill;
> +	struct desc_ptr gdtr;
> +	char idt_fill[6];
> +	struct desc_ptr idtr;
> +
> +	u64 gsbase;
> +	u64 efer;
> +	u64 pat;
> +};
> +static struct hv_crash_ctxt hv_crash_ctxt;
> +
> +/* Shared hypervisor page that contains crash dump area we peek into.
> + * NB: windbg looks for "hv_cda" symbol so don't change it.
> + */
> +static struct hv_crashdump_area *hv_cda;
> +
> +static u32 trampoline_pa, devirt_cr3arg;
> +static atomic_t crash_cpus_wait;
> +static void *hv_crash_ptpgs[4];
> +static int hv_has_crashed, lx_has_crashed;

These are conceptually "bool" as well.

> +
> +/* This cannot be inlined as it needs stack */
> +static noinline __noclone void hv_crash_restore_tss(void)
> +{
> +	load_TR_desc();
> +}
> +
> +/* This cannot be inlined as it needs stack */
> +static noinline void hv_crash_clear_kernpt(void)
> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +
> +	/* Clear entry so it's not confusing to someone looking at the core */
> +	pgd = pgd_offset_k(trampoline_pa);
> +	p4d = p4d_offset(pgd, trampoline_pa);
> +	native_p4d_clear(p4d);
> +}
> +
> +/*
> + * This is the C entry point from the asm glue code after the devirt hypercall.
> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
> + * page tables with our below 4G page identity mapped, but using a temporary
> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
> + * available. We restore kernel GDT, and rest of the context, and continue
> + * to kexec.
> + */
> +static asmlinkage void __noreturn hv_crash_c_entry(void)
> +{
> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> +
> +	/* first thing, restore kernel gdt */
> +	native_load_gdt(&ctxt->gdtr);
> +
> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> +
> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
> +
> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
> +
> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
> +
> +	native_load_idt(&ctxt->idtr);
> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
> +	native_wrmsrq(MSR_EFER, ctxt->efer);
> +
> +	/* restore the original kernel CS now via far return */
> +	asm volatile("movzwq %0, %%rax\n\t"
> +		     "pushq %%rax\n\t"
> +		     "pushq $1f\n\t"
> +		     "lretq\n\t"
> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
> +
> +	/* We are in asmlinkage without stack frame, hence make a C function
> +	 * call which will buy stack frame to restore the tss or clear PT entry.
> +	 */
> +	hv_crash_restore_tss();
> +	hv_crash_clear_kernpt();
> +
> +	/* we are now fully in devirtualized normal kernel mode */
> +	__crash_kexec(NULL);

The comments for __crash_kexec() say that "panic_cpu" should be set to
the current CPU. I don't see that such is the case here.

> +
> +	for (;;)
> +		cpu_relax();

Is the intent that __crash_kexec() should never return, on any of the vCPUs,
because devirtualization isn't done unless there's a valid kdump image loaded?
I wonder if

	native_wrmsrq(HV_X64_MSR_RESET, 1);

would be better than looping forever in case __crash_kexec() fails
somewhere along the way even if there's a kdump image loaded.

> +}
> +/* Tell gcc we are using lretq long jump in the above function intentionally */
> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
> +
> +static void hv_mark_tss_not_busy(void)
> +{
> +	struct desc_struct *desc = get_current_gdt_rw();
> +	tss_desc tss;
> +
> +	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
> +	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
> +	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
> +}
> +
> +/* Save essential context */
> +static void hv_hvcrash_ctxt_save(void)
> +{
> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> +
> +	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
> +
> +	ctxt->cr0 = native_read_cr0();
> +	ctxt->cr4 = native_read_cr4();
> +
> +	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
> +	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
> +
> +	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
> +	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
> +	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
> +	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
> +	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
> +	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
> +
> +	native_store_gdt(&ctxt->gdtr);
> +	store_idt(&ctxt->idtr);
> +
> +	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
> +	ctxt->efer = __rdmsr(MSR_EFER);
> +	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
> +}
> +
> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
> +static void hv_crash_fixup_kernpt(void)
> +{
> +	pgd_t *pgd;
> +	p4d_t *p4d;
> +
> +	pgd = pgd_offset_k(trampoline_pa);
> +	p4d = p4d_offset(pgd, trampoline_pa);
> +
> +	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
> +	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
> +	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
> +}
> +
> +/*
> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
> + * crashed and will collect core. This will cause the hyp to quiesce and
> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
> + */
> +static void hv_notify_prepare_hyp(void)
> +{
> +	u64 status;
> +	struct hv_input_notify_partition_event *input;
> +	struct hv_partition_event_root_crashdump_input *cda;
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	cda = &input->input.crashdump_input;

The code ordering here is a bit weird. I'd expect this line to be grouped
with cda->crashdump_action being set.

> +	memset(input, 0, sizeof(*input));
> +	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
> +
> +	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
> +	if (!hv_result_success(status))
> +		return;
> +
> +	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
> +	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
> +}
> +
> +/*
> + * Common function for all cpus before devirtualization.
> + *
> + * Hypervisor crash: all cpus get here in nmi context.
> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> + *		context. Note, panicing cpu may not be the bsp.
> + *
> + * The function is not inlined so it will show on the stack. It is named so
> + * because the crash cmd looks for certain well known function names on the
> + * stack before looking into the cpu saved note in the elf section, and
> + * that work is currently incomplete.
> + *
> + * Notes:
> + *  Hypervisor crash:
> + *    - the hypervisor is in a very restrictive mode at this point and any
> + *	vmexit it cannot handle would result in reboot. For example, console
> + *	output from here would result in synic ipi hcall, which would result
> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> + *
> + *  Devirtualization is supported from the bsp only.
> + */
> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> +{
> +	struct hv_input_disable_hyp_ex *input;
> +	u64 status;
> +	int msecs = 1000, ccpu = smp_processor_id();
> +
> +	if (ccpu == 0) {
> +		/* crash_save_cpu() will be done in the kexec path */
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +	} else {
> +		crash_save_cpu(regs, ccpu);
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +		for (;;);			/* cause no vmexits */
> +	}
> +
> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> +		mdelay(1);
> +
> +	stop_nmi();
> +	if (!hv_has_crashed)
> +		hv_notify_prepare_hyp();
> +
> +	if (crashing_cpu == -1)
> +		crashing_cpu = ccpu;		/* crash cmd uses this */

Could just be "crashing_cpu = 0" since only the BSP gets here.

> +
> +	hv_hvcrash_ctxt_save();
> +	hv_mark_tss_not_busy();
> +	hv_crash_fixup_kernpt();
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */

Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
And just for clarification, Hyper-V treats this "arg" value as opaque and does
not access it. It only provides it in EDI when it invokes the trampoline
function, right?

> +
> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
> +
> +	/* Devirt failed, just reboot as things are in very bad state now */
> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
> +}
> +
> +/*
> + * Generic nmi callback handler: could be called without any crash also.
> + *   hv crash: hypervisor injects nmi's into all cpus
> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
> + */
> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
> +{
> +	int ccpu = smp_processor_id();
> +
> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
> +		hv_has_crashed = 1;
> +
> +	if (!hv_has_crashed && !lx_has_crashed)
> +		return NMI_DONE;	/* ignore the nmi */
> +
> +	if (hv_has_crashed) {
> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
> +			if (ccpu == 0) {
> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
> +			} else
> +				for (;;);	/* cause no vmexits */
> +		}
> +	}
> +
> +	crash_nmi_callback(regs);
> +
> +	return NMI_DONE;

crash_nmi_callback() should never return, right? Normally one would
expect to return NMI_HANDLED here, but I guess it doesn't matter
if the return is never executed.

> +}
> +
> +/*
> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
> + *
> + * On normal linux panic, this is called twice: first from panic and then again
> + * from native_machine_crash_shutdown.
> + *
> + * In case of mshv, 3 ways to get here:
> + *  1. hv crash (only bsp will get here):
> + *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
> + *		  -> __crash_kexec -> native_machine_crash_shutdown
> + *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
> + *  linux panic:
> + *	2. panic cpu x: panic() -> crash_smp_send_stop
> + *				     -> smp_ops.crash_stop_other_cpus
> + *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
> + *
> + * NB: noclone and non standard stack because of call to crash_setup_regs().
> + */
> +static void __noclone hv_crash_stop_other_cpus(void)
> +{
> +	static int crash_stop_done;
> +	struct pt_regs lregs;
> +	int ccpu = smp_processor_id();
> +
> +	if (hv_has_crashed)
> +		return;		/* all cpus already in nmi handler path */
> +
> +	if (!kexec_crash_loaded())
> +		return;

If we're in a normal panic path (your Case #2 above) with no kdump kernel
loaded, why leave the other vCPUs running? Seems like that could violate
expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
thereafter assumes other vCPUs are not running.

> +
> +	if (crash_stop_done)
> +		return;
> +	crash_stop_done = 1;

Is crash_stop_done necessary?  hv_crash_stop_other_cpus() is called
from crash_smp_send_stop(), which has its own static variable 
"cpus_stopped" that does the same thing.

> +
> +	/* linux has crashed: hv is healthy, we can ipi safely */
> +	lx_has_crashed = 1;
> +	wmb();			/* nmi handlers look at lx_has_crashed */
> +
> +	apic->send_IPI_allbutself(NMI_VECTOR);

The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
should disable_local_APIC() be done somewhere here as well?

> +
> +	if (crashing_cpu == -1)
> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> +
> +	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
> +	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
> +	 */
> +	if (ccpu)
> +		crash_setup_regs(&lregs, NULL);
> +
> +	crash_nmi_callback(&lregs);
> +}
> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
> +
> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
> +struct hv_gdtreg_32 {
> +	u16 fill;
> +	u16 limit;
> +	u32 address;
> +} __packed;
> +
> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
> +struct hv_crash_tramp_gdt {
> +	u64 null;	/* index 0, selector 0, null selector */
> +	u64 cs64;	/* index 1, selector 8, cs64 selector */
> +} __packed;
> +
> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
> +struct hv_cs_jmptgt {
> +	u32 address;
> +	u16 csval;
> +	u16 fill;
> +} __packed;
> +
> +/* This trampoline data is copied onto the trampoline page after the asm code */
> +struct hv_crash_tramp_data {
> +	u64 tramp32_cr3;
> +	u64 kernel_cr3;
> +	struct hv_gdtreg_32 gdtr32;
> +	struct hv_crash_tramp_gdt tramp_gdt;
> +	struct hv_cs_jmptgt cs_jmptgt;
> +	u64 c_entry_addr;
> +} __packed;
> +
> +/*
> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
> + * relative addressing, hence we must use trampoline_pa here. Also, save other
> + * info like jmp and C entry targets for same reasons.
> + *
> + * Returns: 0 on success, -1 on error
> + */
> +static int hv_crash_setup_trampdata(u64 trampoline_va)
> +{
> +	int size, offs;
> +	void *dest;
> +	struct hv_crash_tramp_data *tramp;
> +
> +	/* These must match exactly the ones in the corresponding asm file */
> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
> +						     cs_jmptgt.address) != 40);

It would be nice to pick up the constants from a #include file that is
shared with the asm code in Patch 4 of the series.

> +
> +	/* hv_crash_asm_end is beyond last byte by 1 */
> +	size = &hv_crash_asm_end - &hv_crash_asm32;
> +	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
> +		pr_err("%s: trampoline page overflow\n", __func__);
> +		return -1;
> +	}
> +
> +	dest = (void *)trampoline_va;
> +	memcpy(dest, &hv_crash_asm32, size);
> +
> +	dest += size;
> +	dest = (void *)round_up((ulong)dest, 16);
> +	tramp = (struct hv_crash_tramp_data *)dest;
> +
> +	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
> +	 * non-PCID-aware users". Build cr3 with pcid 0
> +	 */
> +	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
> +
> +	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
> +	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
> +
> +	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
> +	tramp->gdtr32.address = trampoline_pa +
> +				   (ulong)&tramp->tramp_gdt - trampoline_va;
> +
> +	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
> +	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
> +
> +	tramp->cs_jmptgt.csval = 0x8;
> +	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
> +	tramp->cs_jmptgt.address = trampoline_pa + offs;
> +
> +	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
> +
> +	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
> +
> +	return 0;
> +}
> +
> +/*
> + * Build 32bit trampoline page table for transition from protected mode
> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
> + */
> +static void hv_crash_build_tramp_pt(void)
> +{
> +	p4d_t *p4d;
> +	pud_t *pud;
> +	pmd_t *pmd;
> +	pte_t *pte;
> +	u64 pa, addr = trampoline_pa;
> +
> +	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
> +	pa = virt_to_phys(hv_crash_ptpgs[1]);
> +	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
> +	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
> +
> +	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
> +	pa = virt_to_phys(hv_crash_ptpgs[2]);
> +	set_pud(pud, __pud(_PAGE_TABLE | pa));
> +
> +	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
> +	pa = virt_to_phys(hv_crash_ptpgs[3]);
> +	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
> +
> +	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
> +	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
> +}
> +
> +/*
> + * Setup trampoline for devirtualization:
> + *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
> + *    in protected mode.
> + *  - 4 pages for a temporary page table that asm code uses to turn paging on
> + *  - a temporary gdt to use in the compat mode.
> + *
> + *  Returns: 0 on success
> + */
> +static int hv_crash_trampoline_setup(void)
> +{
> +	int i, rc, order;
> +	struct page *page;
> +	u64 trampoline_va;
> +	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
> +
> +	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
> +	page = alloc_page(flags32);
> +	if (page == NULL) {
> +		pr_err("%s: failed to alloc asm stub page\n", __func__);
> +		return -1;
> +	}
> +
> +	trampoline_va = (u64)page_to_virt(page);
> +	trampoline_pa = (u32)page_to_phys(page);
> +
> +	order = 2;	   /* alloc 2^2 pages */
> +	page = alloc_pages(flags32, order);
> +	if (page == NULL) {
> +		pr_err("%s: failed to alloc pt pages\n", __func__);
> +		free_page(trampoline_va);
> +		return -1;
> +	}
> +
> +	for (i = 0; i < 4; i++, page++)
> +		hv_crash_ptpgs[i] = page_to_virt(page);
> +
> +	hv_crash_build_tramp_pt();
> +
> +	rc = hv_crash_setup_trampdata(trampoline_va);
> +	if (rc)
> +		goto errout;
> +
> +	return 0;
> +
> +errout:
> +	free_page(trampoline_va);
> +	free_pages((ulong)hv_crash_ptpgs[0], order);
> +
> +	return rc;
> +}
> +
> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
> +void hv_root_crash_init(void)
> +{
> +	int rc;
> +	struct hv_input_get_system_property *input;
> +	struct hv_output_get_system_property *output;
> +	unsigned long flags;
> +	u64 status;
> +	union hv_pfn_range cda_info;
> +
> +	if (pgtable_l5_enabled()) {
> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
> +		return;
> +	}
> +
> +	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
> +				  "hv_crash_nmi");
> +	if (rc) {
> +		pr_err("Hyper-V: failed to register crash nmi handler\n");
> +		return;
> +	}
> +
> +	local_irq_save(flags);
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	memset(input, 0, sizeof(*input));
> +	memset(output, 0, sizeof(*output));

Why zero the output area? This is one of those hypercall things that we're
inconsistent about. A few hypercall call sites zero the output area, and it's
not clear why they do. Hyper-V should be responsible for properly filling in
the output area. Linux should not need to do this zero'ing, unless there's some
known bug in Hyper-V for certain hypercalls, in which case there should be
a code comment stating "why".

> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
> +
> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
> +	local_irq_restore(flags);
> +
> +	if (!hv_result_success(status)) {
> +		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
> +		       input->property_id, hv_result_to_string(status));
> +		goto err_out;
> +	}
> +
> +	if (cda_info.base_pfn == 0) {
> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
> +		goto err_out;
> +	}
> +
> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);

Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
terms of the Hyper-V page size, which isn't necessarily the guest page size. 
Yes, on x86 there's no difference, but for future robustness ....

> +
> +	rc = hv_crash_trampoline_setup();
> +	if (rc)
> +		goto err_out;
> +
> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
> +
> +	crash_kexec_post_notifiers = true;
> +	hv_crash_enabled = 1;
> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");

This message and the message below aren't consistent. One refers
to "hv kdump" and the other to "hyp kdump".

> +
> +	return;
> +
> +err_out:
> +	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
> +	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
> +}
> --
> 2.36.1.vfs.0.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
  2025-09-13  4:53   ` kernel test robot
  2025-09-13  5:57   ` kernel test robot
@ 2025-09-15 17:56   ` Michael Kelley
  2025-09-17  1:15     ` Mukesh R
  2 siblings, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-15 17:56 UTC (permalink / raw)
  To: Mukesh Rathor, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> 
> Enable build of the new files introduced in the earlier commits and add
> call to do the setup during boot.
> 
> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> ---
>  arch/x86/hyperv/Makefile       | 6 ++++++
>  arch/x86/hyperv/hv_init.c      | 1 +
>  include/asm-generic/mshyperv.h | 9 +++++++++
>  3 files changed, 16 insertions(+)
> 
> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> index d55f494f471d..6f5d97cddd80 100644
> --- a/arch/x86/hyperv/Makefile
> +++ b/arch/x86/hyperv/Makefile
> @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
> 
>  ifdef CONFIG_X86_64
>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)	+= hv_spinlock.o
> +
> + ifdef CONFIG_MSHV_ROOT
> +  CFLAGS_REMOVE_hv_trampoline.o += -pg
> +  CFLAGS_hv_trampoline.o        += -fno-stack-protector
> +  obj-$(CONFIG_CRASH_DUMP)      += hv_crash.o hv_trampoline.o
> + endif
>  endif
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index afdbda2dd7b7..577bbd143527 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -510,6 +510,7 @@ void __init hyperv_init(void)
>  		memunmap(src);
> 
>  		hv_remap_tsc_clocksource();
> +		hv_root_crash_init();
>  	} else {
>  		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
>  		wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> index dbd4c2f3aee3..952c221765f5 100644
> --- a/include/asm-generic/mshyperv.h
> +++ b/include/asm-generic/mshyperv.h
> @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
> num_pages);
>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>  int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> 
> +#if CONFIG_CRASH_DUMP
> +void hv_root_crash_init(void);
> +void hv_crash_asm32(void);
> +void hv_crash_asm64_lbl(void);
> +void hv_crash_asm_end(void);
> +#else   /* CONFIG_CRASH_DUMP */
> +static inline void hv_root_crash_init(void) {}
> +#endif  /* CONFIG_CRASH_DUMP */
> +

The hv_crash_asm* functions are x86 specific. Seems like their
declarations should go in arch/x86/include/asm/mshyperv.h, not in
the architecture-neutral include/asm-generic/mshyperv.h.

>  #else /* CONFIG_MSHV_ROOT */
>  static inline bool hv_root_partition(void) { return false; }
>  static inline bool hv_l1vh_partition(void) { return false; }
> --
> 2.36.1.vfs.0.0
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support
  2025-09-15 17:54   ` Michael Kelley
@ 2025-09-16  1:15     ` Mukesh R
  2025-09-18 23:52       ` Michael Kelley
  0 siblings, 1 reply; 29+ messages in thread
From: Mukesh R @ 2025-09-16  1:15 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/15/25 10:54, Michael Kelley wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>
>> Add data structures for hypervisor crash dump support to the hypervisor
>> host ABI header file. Details of their usages are in subsequent commits.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>  include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++
>>  1 file changed, 55 insertions(+)
>>
>> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
>> index 858f6a3925b3..ad9a8048fb4e 100644
>> --- a/include/hyperv/hvhdk_mini.h
>> +++ b/include/hyperv/hvhdk_mini.h
>> @@ -116,6 +116,17 @@ enum hv_system_property {
>>  	/* Add more values when needed */
>>  	HV_SYSTEM_PROPERTY_SCHEDULER_TYPE = 15,
>>  	HV_DYNAMIC_PROCESSOR_FEATURE_PROPERTY = 21,
>> +	HV_SYSTEM_PROPERTY_CRASHDUMPAREA = 47,
>> +};
>> +
>> +#define HV_PFN_RANGE_PGBITS 24  /* HV_SPA_PAGE_RANGE_ADDITIONAL_PAGES_BITS */
>> +union hv_pfn_range {            /* HV_SPA_PAGE_RANGE */
>> +	u64 as_uint64;
>> +	struct {
>> +		/* 39:0: base pfn.  63:40: additional pages */
>> +		u64 base_pfn : 64 - HV_PFN_RANGE_PGBITS;
>> +		u64 add_pfns : HV_PFN_RANGE_PGBITS;
>> +	} __packed;
>>  };
>>
>>  enum hv_dynamic_processor_feature_property {
>> @@ -142,6 +153,8 @@ struct hv_output_get_system_property {
>>  #if IS_ENABLED(CONFIG_X86)
>>  		u64 hv_processor_feature_value;
>>  #endif
>> +		union hv_pfn_range hv_cda_info; /* CrashdumpAreaAddress */
>> +		u64 hv_tramp_pa;                /* CrashdumpTrampolineAddress */
>>  	};
>>  } __packed;
>>
>> @@ -234,6 +247,48 @@ union hv_gpa_page_access_state {
>>  	u8 as_uint8;
>>  } __packed;
>>
>> +enum hv_crashdump_action {
>> +	HV_CRASHDUMP_NONE = 0,
>> +	HV_CRASHDUMP_SUSPEND_ALL_VPS,
>> +	HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE,
>> +	HV_CRASHDUMP_STATE_SAVED,
>> +	HV_CRASHDUMP_ENTRY,
>> +};
> 
> Nit: Since these values are part of the ABI, it's probably better
> to assign explicit values to each enum member in order to
> ward off any mistaken reordering or additions in the middle
> of the list.

No, like I have mentioned in the past, we are mirroring hyp headers
with the eventual goal of just consuming from there directly.
Each change in ABI header is very carefully examined, we now have
a process for it. 
 
>> +
>> +struct hv_partition_event_root_crashdump_input {
>> +	u32 crashdump_action; /* enum hv_crashdump_action */
>> +} __packed;
>> +
>> +struct hv_input_disable_hyp_ex {   /* HV_X64_INPUT_DISABLE_HYPERVISOR_EX */
>> +	u64 rip;
>> +	u64 arg;
>> +} __packed;
>> +
>> +struct hv_crashdump_area {	   /* HV_CRASHDUMP_AREA */
>> +	u32 version;
>> +	union {
>> +		u32 flags_as_uint32;
>> +		struct {
>> +			u32 cda_valid : 1;
>> +			u32 cda_unused : 31;
>> +		} __packed;
>> +	};
>> +	/* more unused fields */
>> +} __packed;
>> +
>> +union hv_partition_event_input {
>> +	struct hv_partition_event_root_crashdump_input crashdump_input;
>> +};
>> +
>> +enum hv_partition_event {
>> +	HV_PARTITION_EVENT_ROOT_CRASHDUMP = 2,
>> +};
>> +
>> +struct hv_input_notify_partition_event {
>> +	u32 event;      /* enum hv_partition_event */
>> +	union hv_partition_event_input input;
>> +} __packed;
>> +
>>  struct hv_lp_startup_status {
>>  	u64 hv_status;
>>  	u64 substatus1;
>> --
>> 2.36.1.vfs.0.0
>>
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-15 17:55   ` Michael Kelley
@ 2025-09-16 21:30     ` Mukesh R
  2025-09-18 23:52       ` Michael Kelley
  0 siblings, 1 reply; 29+ messages in thread
From: Mukesh R @ 2025-09-16 21:30 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/15/25 10:55, Michael Kelley wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>
>> Introduce a small asm stub to transition from the hypervisor to linux
> 
> I'd argue for capitalizing "Linux" here and in other places in commit
> text and code comments throughout this patch set.

I'd argue against it. A quick grep indicates it is a common practice,
and in the code world goes easy on the eyes :).

>> upon devirtualization.
> 
> In this patch and subsequent patches, you've used the phrase "upon
> devirtualization", which seems a little vague to me. Does this mean
> "when devirtualization is complete" or perhaps "when the hypervisor
> completes devirtualization"? Since there's no spec on any of this,
> being as precise as possible will help future readers.

since control comes back to linux at the callback here, i fail to 
understand what is vague about it. when hyp completes devirt, 
devirt is complete.

>>
>> At a high level, during panic of either the hypervisor or the dom0 (aka
>> root), the nmi handler asks hypervisor to devirtualize.
> 
> Suggest:
> 
> At a high level, during panic of either the hypervisor or Linux running
> in dom0 (a.k.a. the root partition), the Linux NMI handler asks the
> hypervisor to devirtualize.
> 
>> As part of that,
>> the arguments include an entry point to return back to linux. This asm
>> stub implements that entry point.
>>
>> The stub is entered in protected mode, uses temporary gdt and page table
>> to enable long mode and get to kernel entry point which then restores full
>> kernel context to resume execution to kexec.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>  arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++
>>  1 file changed, 105 insertions(+)
>>  create mode 100644 arch/x86/hyperv/hv_trampoline.S
>>
>> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S
>> new file mode 100644
>> index 000000000000..27a755401a42
>> --- /dev/null
>> +++ b/arch/x86/hyperv/hv_trampoline.S
>> @@ -0,0 +1,105 @@
>> +/* SPDX-License-Identifier: GPL-2.0-only */
>> +/*
>> + * X86 specific Hyper-V kdump/crash related code.
> 
> Add a qualification that this is for root partition only, and not for
> general guests?

i don't think it is needed, it would be odd for guests to collect hyp
core. besides makefile/kconfig shows this is root vm only

>> + *
>> + * Copyright (C) 2025, Microsoft, Inc.
>> + *
>> + */
>> +#include <linux/linkage.h>
>> +#include <asm/alternative.h>
>> +#include <asm/msr.h>
>> +#include <asm/processor-flags.h>
>> +#include <asm/nospec-branch.h>
>> +
>> +/*
>> + * void noreturn hv_crash_asm32(arg1)
>> + *    arg1 == edi == 32bit PA of struct hv_crash_trdata
> 
> I think this is "struct hv_crash_tramp_data".

correct

>> + *
>> + * The hypervisor jumps here upon devirtualization in protected mode. This
>> + * code gets copied to a page in the low 4G ie, 32bit space so it can run
>> + * in the protected mode. Hence we cannot use any compile/link time offsets or
>> + * addresses. It restores long mode via temporary gdt and page tables and
>> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry.
>> + *
>> + * PreCondition (ie, Hypervisor call back ABI):
>> + *  o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled
>> + *  o CR4 is set to 0x0
>> + *  o IA32_EFER is set to 0x901 (SCE and NXE are set)
>> + *  o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX.
>> + *  o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF
>> + *  o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF
>> + *  o LDTR is initialized as invalid (limit of 0)
>> + *  o MSR PAT is power on default.
>> + *  o Other state/registers are cleared. All TLBs flushed.
> 
> Clarification about "Other state/registers are cleared":  What about
> processor features that Linux may have enabled or disabled during its
> initial boot? Are those still in the states Linux set? Or are they reset to
> power-on defaults? For example, if Linux enabled x2apic, is x2apic
> still enabled when the stub is entered?

correct, if linux set x2apic, x2apic would still be enabled.

>> + *
>> + * See Intel SDM 10.8.5
> 
> Hmmm. I downloaded the latest combined SDM, and section 10.8.5
> in Volume 3A is about Microcode Update Resources, which doesn't
> seem applicable here. Other volumes don't have a section 10.8.5.

google ai found it right away upon searching: intel sdm 10.8.5 ia-32e

>> + */
>> +
>> +#define HV_CRASHDATA_OFFS_TRAMPCR3    0x0    /*	 0 */
>> +#define HV_CRASHDATA_OFFS_KERNCR3     0x8    /*	 8 */
>> +#define HV_CRASHDATA_OFFS_GDTRLIMIT  0x12    /* 18 */
>> +#define HV_CRASHDATA_OFFS_CS_JMPTGT  0x28    /* 40 */
>> +#define HV_CRASHDATA_OFFS_C_entry    0x30    /* 48 */
> 
> It seems like these offsets should go in a #include file along
> with the definition of struct hv_crash_tramp_data. Then the
> BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could
> check against these symbolic names instead of hardcoding
> numbers that must match these.

yeah, that works too and was the first cut. but given the small
number of these, and that they are not used/needed anywhere else,
and that they will almost never change, creating another tiny header
in a non-driver directory didn't seem worth it.. but i could go
either way.

>> +#define HV_CRASHDATA_TRAMPOLINE_CS    0x8
> 
> This #define isn't used anywhere.

removed

>> +
>> +	.text
>> +	.code32
>> +
>> +SYM_CODE_START(hv_crash_asm32)
>> +	UNWIND_HINT_UNDEFINED
>> +	ANNOTATE_NOENDBR
> 
> No ENDBR here, presumably because this function is entered via other
> than an indirect CALL or JMP instruction. Right?
> 
>> +	movl	$X86_CR4_PAE, %ecx
>> +	movl	%ecx, %cr4
>> +
>> +	movl %edi, %ebx
>> +	add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx
>> +	movl %cs:(%ebx), %eax
>> +	movl %eax, %cr3
>> +
>> +	# Setup EFER for long mode now.
>> +	movl	$MSR_EFER, %ecx
>> +	rdmsr
>> +	btsl	$_EFER_LME, %eax
>> +	wrmsr
>> +
>> +	# Turn paging on using the temp 32bit trampoline page table.
>> +	movl %cr0, %eax
>> +	orl $(X86_CR0_PG), %eax
>> +	movl %eax, %cr0
>> +
>> +	/* since kernel cr3 could be above 4G, we need to be in the long mode
>> +	 * before we can load 64bits of the kernel cr3. We use a temp gdt for
>> +	 * that with CS.L=1 and CS.D=0 */
>> +	mov %edi, %eax
>> +	add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax
>> +	lgdtl %cs:(%eax)
>> +
>> +	/* not done yet, restore CS now to switch to CS.L=1 */
>> +	mov %edi, %eax
>> +	add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax
>> +	ljmp %cs:*(%eax)
>> +SYM_CODE_END(hv_crash_asm32)
>> +
>> +	/* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */
>> +	.code64
>> +	.balign 8
>> +SYM_CODE_START(hv_crash_asm64)
>> +	UNWIND_HINT_UNDEFINED
>> +	ANNOTATE_NOENDBR
> 
> But this *is* entered via an indirect JMP, right? So back to my
> earlier question about the state of processor feature enablement.
> If Linux enabled IBT, is it still enabled after devirtualization and
> the hypervisor invokes this entry point? Linux guests on Hyper-V
> have historically not enabled IBT, but patches that enable it are
> now in linux-next, and will go into the 6.18 kernel. So maybe
> this needs an ENDBR64.

IBT would be disabled in the transition here.... so doesn't really
matter. ENDBR ok too..

>> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL)
>> +	/* restore kernel page tables so we can jump to kernel code */
>> +	mov %edi, %eax
>> +	add $HV_CRASHDATA_OFFS_KERNCR3, %eax
>> +	movq %cs:(%eax), %rbx
>> +	movq %rbx, %cr3
>> +
>> +	mov %edi, %eax
>> +	add $HV_CRASHDATA_OFFS_C_entry, %eax
>> +	movq %cs:(%eax), %rbx
>> +	ANNOTATE_RETPOLINE_SAFE
>> +	jmp *%rbx
>> +
>> +	int $3
>> +
>> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL)
>> +SYM_CODE_END(hv_crash_asm64)
>> --
>> 2.36.1.vfs.0.0
>>
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-15 17:55   ` Michael Kelley
@ 2025-09-17  1:13     ` Mukesh R
  2025-09-17 20:37       ` Mukesh R
  2025-09-18 23:53       ` Michael Kelley
  0 siblings, 2 replies; 29+ messages in thread
From: Mukesh R @ 2025-09-17  1:13 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/15/25 10:55, Michael Kelley wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>
>> Introduce a new file to implement collection of hypervisor ram into the
> 
> s/ram/RAM/ (multiple places)

a quick grep indicates using saying ram is common, i like ram over RAM

>> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
>> protected via hw page table. Hyper-V implements a disable hypercall which
> 
> The terminology here is a bit confusing since you have two names for
> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
> just use "devirtualize" everywhere, and drop the "disable" terminology?

The concept is devirtualize and the actual hypercall was originally named
disable. so intermixing is natural imo.

>> essentially devirtualizes the system on the fly. This mechanism makes the
>> hypervisor ram accessible to linux. Because the hypervisor ram is already
>> mapped into linux address space (as reserved ram), 
> 
> Is the hypervisor RAM mapped into the VMM process user address space,
> or somewhere in the kernel address space? If the latter, where in the kernel
> code, or what mechanism, does that? Just curious, as I wasn't aware that
> this is happening ....

mapped in kernel as normal ram and we reserve it very early in boot. i 
see that patch has not made it here yet, should be coming very soon.

>> it is automatically
>> collected into the vmcore without extra work. More details of the
>> implementation are available in the file prologue.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>  arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
>>  1 file changed, 622 insertions(+)
>>  create mode 100644 arch/x86/hyperv/hv_crash.c
>>
>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>> new file mode 100644
>> index 000000000000..531bac79d598
>> --- /dev/null
>> +++ b/arch/x86/hyperv/hv_crash.c
>> @@ -0,0 +1,622 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * X86 specific Hyper-V kdump/crash support module
>> + *
>> + * Copyright (C) 2025, Microsoft, Inc.
>> + *
>> + * This module implements hypervisor ram collection into vmcore for both
>> + * cases of the hypervisor crash and linux dom0/root crash. 
> 
> For a hypervisor crash, does any of this apply to general guest VMs? I'm
> thinking it does not. Hypervisor RAM is collected only into the vmcore
> for the root partition, right? Maybe some additional clarification could be
> added so there's no confusion in this regard.

it would be odd for guests to collect hyp core, and target audience is
assumed to be those who are somewhat familiar with basic concepts before
getting here.

> And what *does* happen to guest VMs after a hypervisor crash?

they are gone... what else could we do?

>> + * Hyper-V implements
>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
>> + * is already mapped in linux, it is automatically collected into linux vmcore,
>> + * and can be examined by the crash command (raw ram dump) or windbg.
>> + *
>> + * At a high level:
>> + *
>> + *  Hypervisor Crash:
>> + *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
>> + *    restrictive mode with very limited hypercall and msr support.
> 
> s/msr/MSR/

msr is used all over, seems acceptable.

>> + *    Each cpu then injects NMIs into dom0/root vcpus. 
> 
> The "Each cpu" part of this sentence is confusing to me -- which CPUs does
> this refer to? Maybe it would be better to say "It then injects an NMI into
> each dom0/root partition vCPU." without being specific as to which CPUs do
> the injecting since that seems more like a hypervisor implementation detail
> that's not relevant here.

all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu.

>> + *    A shared page is used to check
>> + *    by linux in the nmi handler if the hypervisor has crashed. This shared
> 
> s/nmi/NMI/  (multiple places)

>> + *    page is setup in hv_root_crash_init during boot.
>> + *
>> + *  Linux Crash:
>> + *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
>> + *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
>> + *    for all cpus to be in NMI.
>> + *
>> + *  NMI Handler (upon quorum):
>> + *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.
> 
> s/hanlder/handler/
> 
> And maybe just drop the word "wil" (which is misspelled).
> 
>> + *    Hyper-V requires the disable hypervisor must be done from the bsp. So
> 
> s/bsp/BSP  (multiple places)
> 
>> + *    the bsp nmi handler saves current context, does some fixups and makes
>> + *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
>> + *    at that point will suspend all vcpus (except the bsp), unlock all its
>> + *    ram, and return to linux at the 32bit mode entry RIP.
>> + *
>> + *  Linux 32bit entry trampoline will then restore long mode and call C
>> + *  function here to restore context and continue execution to crash kexec.
>> + */
>> +
>> +#include <linux/delay.h>
>> +#include <linux/kexec.h>
>> +#include <linux/crash_dump.h>
>> +#include <linux/panic.h>
>> +#include <asm/apic.h>
>> +#include <asm/desc.h>
>> +#include <asm/page.h>
>> +#include <asm/pgalloc.h>
>> +#include <asm/mshyperv.h>
>> +#include <asm/nmi.h>
>> +#include <asm/idtentry.h>
>> +#include <asm/reboot.h>
>> +#include <asm/intel_pt.h>
>> +
>> +int hv_crash_enabled;
> 
> Seems like this is conceptually a "bool", not an "int".

yeah, can change it to bool if i do another iteration.

>> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
>> +
>> +struct hv_crash_ctxt {
>> +	ulong rsp;
>> +	ulong cr0;
>> +	ulong cr2;
>> +	ulong cr4;
>> +	ulong cr8;
>> +
>> +	u16 cs;
>> +	u16 ss;
>> +	u16 ds;
>> +	u16 es;
>> +	u16 fs;
>> +	u16 gs;
>> +
>> +	u16 gdt_fill;
>> +	struct desc_ptr gdtr;
>> +	char idt_fill[6];
>> +	struct desc_ptr idtr;
>> +
>> +	u64 gsbase;
>> +	u64 efer;
>> +	u64 pat;
>> +};
>> +static struct hv_crash_ctxt hv_crash_ctxt;
>> +
>> +/* Shared hypervisor page that contains crash dump area we peek into.
>> + * NB: windbg looks for "hv_cda" symbol so don't change it.
>> + */
>> +static struct hv_crashdump_area *hv_cda;
>> +
>> +static u32 trampoline_pa, devirt_cr3arg;
>> +static atomic_t crash_cpus_wait;
>> +static void *hv_crash_ptpgs[4];
>> +static int hv_has_crashed, lx_has_crashed;
> 
> These are conceptually "bool" as well.
> 
>> +
>> +/* This cannot be inlined as it needs stack */
>> +static noinline __noclone void hv_crash_restore_tss(void)
>> +{
>> +	load_TR_desc();
>> +}
>> +
>> +/* This cannot be inlined as it needs stack */
>> +static noinline void hv_crash_clear_kernpt(void)
>> +{
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +
>> +	/* Clear entry so it's not confusing to someone looking at the core */
>> +	pgd = pgd_offset_k(trampoline_pa);
>> +	p4d = p4d_offset(pgd, trampoline_pa);
>> +	native_p4d_clear(p4d);
>> +}
>> +
>> +/*
>> + * This is the C entry point from the asm glue code after the devirt hypercall.
>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>> + * page tables with our below 4G page identity mapped, but using a temporary
>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
>> + * available. We restore kernel GDT, and rest of the context, and continue
>> + * to kexec.
>> + */
>> +static asmlinkage void __noreturn hv_crash_c_entry(void)
>> +{
>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>> +
>> +	/* first thing, restore kernel gdt */
>> +	native_load_gdt(&ctxt->gdtr);
>> +
>> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>> +
>> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>> +
>> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>> +
>> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>> +
>> +	native_load_idt(&ctxt->idtr);
>> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>> +	native_wrmsrq(MSR_EFER, ctxt->efer);
>> +
>> +	/* restore the original kernel CS now via far return */
>> +	asm volatile("movzwq %0, %%rax\n\t"
>> +		     "pushq %%rax\n\t"
>> +		     "pushq $1f\n\t"
>> +		     "lretq\n\t"
>> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>> +
>> +	/* We are in asmlinkage without stack frame, hence make a C function
>> +	 * call which will buy stack frame to restore the tss or clear PT entry.
>> +	 */
>> +	hv_crash_restore_tss();
>> +	hv_crash_clear_kernpt();
>> +
>> +	/* we are now fully in devirtualized normal kernel mode */
>> +	__crash_kexec(NULL);
> 
> The comments for __crash_kexec() say that "panic_cpu" should be set to
> the current CPU. I don't see that such is the case here.

if linux panic, it would be set by vpanic, if hyp crash, that is
irrelevant.

>> +
>> +	for (;;)
>> +		cpu_relax();
> 
> Is the intent that __crash_kexec() should never return, on any of the vCPUs,
> because devirtualization isn't done unless there's a valid kdump image loaded?
> I wonder if
> 
> 	native_wrmsrq(HV_X64_MSR_RESET, 1);
> 
> would be better than looping forever in case __crash_kexec() fails
> somewhere along the way even if there's a kdump image loaded.

yeah, i've gone thru all 3 possibilities here:
  o loop forever
  o reset
  o BUG() : this was in V0

reset is just bad because system would just reboot without any indication
if hyp crashes. with loop at least there is a hang, and one could make
note of it, and if internal, attach debugger.

BUG is best imo because with hyp gone linux will try to redo panic
and we would print something extra to help. I think i'll just go
back to my V0: BUG()

>> +}
>> +/* Tell gcc we are using lretq long jump in the above function intentionally */
>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
>> +
>> +static void hv_mark_tss_not_busy(void)
>> +{
>> +	struct desc_struct *desc = get_current_gdt_rw();
>> +	tss_desc tss;
>> +
>> +	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
>> +	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
>> +	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
>> +}
>> +
>> +/* Save essential context */
>> +static void hv_hvcrash_ctxt_save(void)
>> +{
>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>> +
>> +	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
>> +
>> +	ctxt->cr0 = native_read_cr0();
>> +	ctxt->cr4 = native_read_cr4();
>> +
>> +	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
>> +	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
>> +
>> +	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
>> +	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
>> +	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
>> +	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
>> +	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
>> +	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
>> +
>> +	native_store_gdt(&ctxt->gdtr);
>> +	store_idt(&ctxt->idtr);
>> +
>> +	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
>> +	ctxt->efer = __rdmsr(MSR_EFER);
>> +	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
>> +}
>> +
>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
>> +static void hv_crash_fixup_kernpt(void)
>> +{
>> +	pgd_t *pgd;
>> +	p4d_t *p4d;
>> +
>> +	pgd = pgd_offset_k(trampoline_pa);
>> +	p4d = p4d_offset(pgd, trampoline_pa);
>> +
>> +	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
>> +	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
>> +	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
>> +}
>> +
>> +/*
>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
>> + * crashed and will collect core. This will cause the hyp to quiesce and
>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
>> + */
>> +static void hv_notify_prepare_hyp(void)
>> +{
>> +	u64 status;
>> +	struct hv_input_notify_partition_event *input;
>> +	struct hv_partition_event_root_crashdump_input *cda;
>> +
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	cda = &input->input.crashdump_input;
> 
> The code ordering here is a bit weird. I'd expect this line to be grouped
> with cda->crashdump_action being set.

we are setting two pointers, and using them later. setting pointers 
up front is pretty normal.

>> +	memset(input, 0, sizeof(*input));
>> +	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
>> +
>> +	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
>> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>> +	if (!hv_result_success(status))
>> +		return;
>> +
>> +	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
>> +	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>> +}
>> +
>> +/*
>> + * Common function for all cpus before devirtualization.
>> + *
>> + * Hypervisor crash: all cpus get here in nmi context.
>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
>> + *		context. Note, panicing cpu may not be the bsp.
>> + *
>> + * The function is not inlined so it will show on the stack. It is named so
>> + * because the crash cmd looks for certain well known function names on the
>> + * stack before looking into the cpu saved note in the elf section, and
>> + * that work is currently incomplete.
>> + *
>> + * Notes:
>> + *  Hypervisor crash:
>> + *    - the hypervisor is in a very restrictive mode at this point and any
>> + *	vmexit it cannot handle would result in reboot. For example, console
>> + *	output from here would result in synic ipi hcall, which would result
>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
>> + *
>> + *  Devirtualization is supported from the bsp only.
>> + */
>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
>> +{
>> +	struct hv_input_disable_hyp_ex *input;
>> +	u64 status;
>> +	int msecs = 1000, ccpu = smp_processor_id();
>> +
>> +	if (ccpu == 0) {
>> +		/* crash_save_cpu() will be done in the kexec path */
>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>> +		atomic_inc(&crash_cpus_wait);
>> +	} else {
>> +		crash_save_cpu(regs, ccpu);
>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>> +		atomic_inc(&crash_cpus_wait);
>> +		for (;;);			/* cause no vmexits */
>> +	}
>> +
>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
>> +		mdelay(1);
>> +
>> +	stop_nmi();
>> +	if (!hv_has_crashed)
>> +		hv_notify_prepare_hyp();
>> +
>> +	if (crashing_cpu == -1)
>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> 
> Could just be "crashing_cpu = 0" since only the BSP gets here.

a code change request has been open for while to remove the requirement 
of bsp..

>> +
>> +	hv_hvcrash_ctxt_save();
>> +	hv_mark_tss_not_busy();
>> +	hv_crash_fixup_kernpt();
>> +
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	memset(input, 0, sizeof(*input));
>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> 
> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
> And just for clarification, Hyper-V treats this "arg" value as opaque and does
> not access it. It only provides it in EDI when it invokes the trampoline
> function, right?

comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).

right, comes in edi, i don't know what EDI is (just kidding!)... 

>> +
>> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>> +
>> +	/* Devirt failed, just reboot as things are in very bad state now */
>> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
>> +}
>> +
>> +/*
>> + * Generic nmi callback handler: could be called without any crash also.
>> + *   hv crash: hypervisor injects nmi's into all cpus
>> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
>> + */
>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
>> +{
>> +	int ccpu = smp_processor_id();
>> +
>> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
>> +		hv_has_crashed = 1;
>> +
>> +	if (!hv_has_crashed && !lx_has_crashed)
>> +		return NMI_DONE;	/* ignore the nmi */
>> +
>> +	if (hv_has_crashed) {
>> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
>> +			if (ccpu == 0) {
>> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
>> +			} else
>> +				for (;;);	/* cause no vmexits */
>> +		}
>> +	}
>> +
>> +	crash_nmi_callback(regs);
>> +
>> +	return NMI_DONE;
> 
> crash_nmi_callback() should never return, right? Normally one would
> expect to return NMI_HANDLED here, but I guess it doesn't matter
> if the return is never executed.

correct. 

>> +}
>> +
>> +/*
>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
>> + *
>> + * On normal linux panic, this is called twice: first from panic and then again
>> + * from native_machine_crash_shutdown.
>> + *
>> + * In case of mshv, 3 ways to get here:
>> + *  1. hv crash (only bsp will get here):
>> + *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
>> + *		  -> __crash_kexec -> native_machine_crash_shutdown
>> + *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
>> + *  linux panic:
>> + *	2. panic cpu x: panic() -> crash_smp_send_stop
>> + *				     -> smp_ops.crash_stop_other_cpus
>> + *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
>> + *
>> + * NB: noclone and non standard stack because of call to crash_setup_regs().
>> + */
>> +static void __noclone hv_crash_stop_other_cpus(void)
>> +{
>> +	static int crash_stop_done;
>> +	struct pt_regs lregs;
>> +	int ccpu = smp_processor_id();
>> +
>> +	if (hv_has_crashed)
>> +		return;		/* all cpus already in nmi handler path */
>> +
>> +	if (!kexec_crash_loaded())
>> +		return;
> 
> If we're in a normal panic path (your Case #2 above) with no kdump kernel
> loaded, why leave the other vCPUs running? Seems like that could violate
> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
> thereafter assumes other vCPUs are not running.

no, there is lots of complexity here!

if we hang vcpus here, hyp will note and may trigger its own watchdog.
also, machine_crash_shutdown() does another ipi.

I think the best thing to do here is go back to my V0 which did not
have check for kexec_crash_loaded(), but had this in hv_crash_c_entry:

+       /* we are now fully in devirtualized normal kernel mode */
+       __crash_kexec(NULL);
+
+       BUG();


this way hyp would be disabled, ie, system devirtualized, and 
__crash_kernel() will return, resulting in BUG() that will cause
it to go thru panic and honor panic= parameter with either hang
or reset. instead of bug, i could just call panic() also.

>> +
>> +	if (crash_stop_done)
>> +		return;
>> +	crash_stop_done = 1;
> 
> Is crash_stop_done necessary?  hv_crash_stop_other_cpus() is called
> from crash_smp_send_stop(), which has its own static variable 
> "cpus_stopped" that does the same thing.

yes. for error paths.

>> +
>> +	/* linux has crashed: hv is healthy, we can ipi safely */
>> +	lx_has_crashed = 1;
>> +	wmb();			/* nmi handlers look at lx_has_crashed */
>> +
>> +	apic->send_IPI_allbutself(NMI_VECTOR);
> 
> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
> should disable_local_APIC() be done somewhere here as well?

no, hyp does that.

>> +
>> +	if (crashing_cpu == -1)
>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>> +
>> +	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
>> +	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
>> +	 */
>> +	if (ccpu)
>> +		crash_setup_regs(&lregs, NULL);
>> +
>> +	crash_nmi_callback(&lregs);
>> +}
>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
>> +
>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
>> +struct hv_gdtreg_32 {
>> +	u16 fill;
>> +	u16 limit;
>> +	u32 address;
>> +} __packed;
>> +
>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
>> +struct hv_crash_tramp_gdt {
>> +	u64 null;	/* index 0, selector 0, null selector */
>> +	u64 cs64;	/* index 1, selector 8, cs64 selector */
>> +} __packed;
>> +
>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
>> +struct hv_cs_jmptgt {
>> +	u32 address;
>> +	u16 csval;
>> +	u16 fill;
>> +} __packed;
>> +
>> +/* This trampoline data is copied onto the trampoline page after the asm code */
>> +struct hv_crash_tramp_data {
>> +	u64 tramp32_cr3;
>> +	u64 kernel_cr3;
>> +	struct hv_gdtreg_32 gdtr32;
>> +	struct hv_crash_tramp_gdt tramp_gdt;
>> +	struct hv_cs_jmptgt cs_jmptgt;
>> +	u64 c_entry_addr;
>> +} __packed;
>> +
>> +/*
>> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
>> + * relative addressing, hence we must use trampoline_pa here. Also, save other
>> + * info like jmp and C entry targets for same reasons.
>> + *
>> + * Returns: 0 on success, -1 on error
>> + */
>> +static int hv_crash_setup_trampdata(u64 trampoline_va)
>> +{
>> +	int size, offs;
>> +	void *dest;
>> +	struct hv_crash_tramp_data *tramp;
>> +
>> +	/* These must match exactly the ones in the corresponding asm file */
>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
>> +						     cs_jmptgt.address) != 40);
> 
> It would be nice to pick up the constants from a #include file that is
> shared with the asm code in Patch 4 of the series.

yeah, could go either way, some don't like tiny headers...  if there are
no objections to new header for this, i could go that way too.

>> +
>> +	/* hv_crash_asm_end is beyond last byte by 1 */
>> +	size = &hv_crash_asm_end - &hv_crash_asm32;
>> +	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
>> +		pr_err("%s: trampoline page overflow\n", __func__);
>> +		return -1;
>> +	}
>> +
>> +	dest = (void *)trampoline_va;
>> +	memcpy(dest, &hv_crash_asm32, size);
>> +
>> +	dest += size;
>> +	dest = (void *)round_up((ulong)dest, 16);
>> +	tramp = (struct hv_crash_tramp_data *)dest;
>> +
>> +	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
>> +	 * non-PCID-aware users". Build cr3 with pcid 0
>> +	 */
>> +	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
>> +
>> +	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
>> +	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
>> +
>> +	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
>> +	tramp->gdtr32.address = trampoline_pa +
>> +				   (ulong)&tramp->tramp_gdt - trampoline_va;
>> +
>> +	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
>> +	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
>> +
>> +	tramp->cs_jmptgt.csval = 0x8;
>> +	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
>> +	tramp->cs_jmptgt.address = trampoline_pa + offs;
>> +
>> +	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
>> +
>> +	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
>> +
>> +	return 0;
>> +}
>> +
>> +/*
>> + * Build 32bit trampoline page table for transition from protected mode
>> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
>> + */
>> +static void hv_crash_build_tramp_pt(void)
>> +{
>> +	p4d_t *p4d;
>> +	pud_t *pud;
>> +	pmd_t *pmd;
>> +	pte_t *pte;
>> +	u64 pa, addr = trampoline_pa;
>> +
>> +	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
>> +	pa = virt_to_phys(hv_crash_ptpgs[1]);
>> +	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
>> +	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
>> +
>> +	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
>> +	pa = virt_to_phys(hv_crash_ptpgs[2]);
>> +	set_pud(pud, __pud(_PAGE_TABLE | pa));
>> +
>> +	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
>> +	pa = virt_to_phys(hv_crash_ptpgs[3]);
>> +	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
>> +
>> +	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
>> +	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
>> +}
>> +
>> +/*
>> + * Setup trampoline for devirtualization:
>> + *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
>> + *    in protected mode.
>> + *  - 4 pages for a temporary page table that asm code uses to turn paging on
>> + *  - a temporary gdt to use in the compat mode.
>> + *
>> + *  Returns: 0 on success
>> + */
>> +static int hv_crash_trampoline_setup(void)
>> +{
>> +	int i, rc, order;
>> +	struct page *page;
>> +	u64 trampoline_va;
>> +	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
>> +
>> +	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
>> +	page = alloc_page(flags32);
>> +	if (page == NULL) {
>> +		pr_err("%s: failed to alloc asm stub page\n", __func__);
>> +		return -1;
>> +	}
>> +
>> +	trampoline_va = (u64)page_to_virt(page);
>> +	trampoline_pa = (u32)page_to_phys(page);
>> +
>> +	order = 2;	   /* alloc 2^2 pages */
>> +	page = alloc_pages(flags32, order);
>> +	if (page == NULL) {
>> +		pr_err("%s: failed to alloc pt pages\n", __func__);
>> +		free_page(trampoline_va);
>> +		return -1;
>> +	}
>> +
>> +	for (i = 0; i < 4; i++, page++)
>> +		hv_crash_ptpgs[i] = page_to_virt(page);
>> +
>> +	hv_crash_build_tramp_pt();
>> +
>> +	rc = hv_crash_setup_trampdata(trampoline_va);
>> +	if (rc)
>> +		goto errout;
>> +
>> +	return 0;
>> +
>> +errout:
>> +	free_page(trampoline_va);
>> +	free_pages((ulong)hv_crash_ptpgs[0], order);
>> +
>> +	return rc;
>> +}
>> +
>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
>> +void hv_root_crash_init(void)
>> +{
>> +	int rc;
>> +	struct hv_input_get_system_property *input;
>> +	struct hv_output_get_system_property *output;
>> +	unsigned long flags;
>> +	u64 status;
>> +	union hv_pfn_range cda_info;
>> +
>> +	if (pgtable_l5_enabled()) {
>> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
>> +		return;
>> +	}
>> +
>> +	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
>> +				  "hv_crash_nmi");
>> +	if (rc) {
>> +		pr_err("Hyper-V: failed to register crash nmi handler\n");
>> +		return;
>> +	}
>> +
>> +	local_irq_save(flags);
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>> +
>> +	memset(input, 0, sizeof(*input));
>> +	memset(output, 0, sizeof(*output));
> 
> Why zero the output area? This is one of those hypercall things that we're
> inconsistent about. A few hypercall call sites zero the output area, and it's
> not clear why they do. Hyper-V should be responsible for properly filling in
> the output area. Linux should not need to do this zero'ing, unless there's some
> known bug in Hyper-V for certain hypercalls, in which case there should be
> a code comment stating "why".

for the same reason sometimes you see char *p = NULL, either leftover
code or someone was debugging or just copy and paste. this is just copy
paste. i agree in general that we don't need to clear it at all, in fact,
i'd like to remove them all! but i also understand people with different
skills and junior members find it easier to debug, and also we were in
early product development. for that reason, it doesn't have to be 
consistent either, if some complex hypercalls are failing repeatedly, 
just for ease of debug, one might leave it there temporarily.  but
now that things are stable, i think we should just remove them all and 
get used to a bit more inconvenient debugging...

>> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
>> +
>> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
>> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
>> +	local_irq_restore(flags);
>> +
>> +	if (!hv_result_success(status)) {
>> +		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
>> +		       input->property_id, hv_result_to_string(status));
>> +		goto err_out;
>> +	}
>> +
>> +	if (cda_info.base_pfn == 0) {
>> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
>> +		goto err_out;
>> +	}
>> +
>> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
> 
> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
> terms of the Hyper-V page size, which isn't necessarily the guest page size. 
> Yes, on x86 there's no difference, but for future robustness ....

i don't know about guests, but we won't even boot if dom0 pg size
didn't match.. but easier to change than to make the case..

>> +
>> +	rc = hv_crash_trampoline_setup();
>> +	if (rc)
>> +		goto err_out;
>> +
>> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
>> +
>> +	crash_kexec_post_notifiers = true;
>> +	hv_crash_enabled = 1;
>> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
> 
> This message and the message below aren't consistent. One refers
> to "hv kdump" and the other to "hyp kdump".

>> +
>> +	return;
>> +
>> +err_out:
>> +	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
>> +	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
>> +}
>> --
>> 2.36.1.vfs.0.0
>>
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-15 17:56   ` Michael Kelley
@ 2025-09-17  1:15     ` Mukesh R
  2025-09-18 23:53       ` Michael Kelley
  0 siblings, 1 reply; 29+ messages in thread
From: Mukesh R @ 2025-09-17  1:15 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/15/25 10:56, Michael Kelley wrote:
> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>
>> Enable build of the new files introduced in the earlier commits and add
>> call to do the setup during boot.
>>
>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>> ---
>>  arch/x86/hyperv/Makefile       | 6 ++++++
>>  arch/x86/hyperv/hv_init.c      | 1 +
>>  include/asm-generic/mshyperv.h | 9 +++++++++
>>  3 files changed, 16 insertions(+)
>>
>> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
>> index d55f494f471d..6f5d97cddd80 100644
>> --- a/arch/x86/hyperv/Makefile
>> +++ b/arch/x86/hyperv/Makefile
>> @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
>>
>>  ifdef CONFIG_X86_64
>>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)	+= hv_spinlock.o
>> +
>> + ifdef CONFIG_MSHV_ROOT
>> +  CFLAGS_REMOVE_hv_trampoline.o += -pg
>> +  CFLAGS_hv_trampoline.o        += -fno-stack-protector
>> +  obj-$(CONFIG_CRASH_DUMP)      += hv_crash.o hv_trampoline.o
>> + endif
>>  endif
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index afdbda2dd7b7..577bbd143527 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -510,6 +510,7 @@ void __init hyperv_init(void)
>>  		memunmap(src);
>>
>>  		hv_remap_tsc_clocksource();
>> +		hv_root_crash_init();
>>  	} else {
>>  		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
>>  		wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
>> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
>> index dbd4c2f3aee3..952c221765f5 100644
>> --- a/include/asm-generic/mshyperv.h
>> +++ b/include/asm-generic/mshyperv.h
>> @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
>> num_pages);
>>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
>>  int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
>>
>> +#if CONFIG_CRASH_DUMP
>> +void hv_root_crash_init(void);
>> +void hv_crash_asm32(void);
>> +void hv_crash_asm64_lbl(void);
>> +void hv_crash_asm_end(void);
>> +#else   /* CONFIG_CRASH_DUMP */
>> +static inline void hv_root_crash_init(void) {}
>> +#endif  /* CONFIG_CRASH_DUMP */
>> +
> 
> The hv_crash_asm* functions are x86 specific. Seems like their
> declarations should go in arch/x86/include/asm/mshyperv.h, not in
> the architecture-neutral include/asm-generic/mshyperv.h.

well, arm port is going on. i suppose i could move it to x86 and
they can move it back  here in their patch submissions. hopefully
they will remember or someone will catch it.

>>  #else /* CONFIG_MSHV_ROOT */
>>  static inline bool hv_root_partition(void) { return false; }
>>  static inline bool hv_l1vh_partition(void) { return false; }
>> --
>> 2.36.1.vfs.0.0
>>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-17  1:13     ` Mukesh R
@ 2025-09-17 20:37       ` Mukesh R
  2025-09-18 23:53       ` Michael Kelley
  1 sibling, 0 replies; 29+ messages in thread
From: Mukesh R @ 2025-09-17 20:37 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/16/25 18:13, Mukesh R wrote:
> On 9/15/25 10:55, Michael Kelley wrote:
>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>>
>>> Introduce a new file to implement collection of hypervisor ram into the
>>
>> s/ram/RAM/ (multiple places)
> 
> a quick grep indicates using saying ram is common, i like ram over RAM
> 
>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
>>> protected via hw page table. Hyper-V implements a disable hypercall which
>>
>> The terminology here is a bit confusing since you have two names for
>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
>> just use "devirtualize" everywhere, and drop the "disable" terminology?
> 
> The concept is devirtualize and the actual hypercall was originally named
> disable. so intermixing is natural imo.

[snip]

>>> +
>>> +/*
>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other
>>> + * info like jmp and C entry targets for same reasons.
>>> + *
>>> + * Returns: 0 on success, -1 on error
>>> + */
>>> +static int hv_crash_setup_trampdata(u64 trampoline_va)
>>> +{
>>> +	int size, offs;
>>> +	void *dest;
>>> +	struct hv_crash_tramp_data *tramp;
>>> +
>>> +	/* These must match exactly the ones in the corresponding asm file */
>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
>>> +						     cs_jmptgt.address) != 40);
>>
>> It would be nice to pick up the constants from a #include file that is
>> shared with the asm code in Patch 4 of the series.
> 
> yeah, could go either way, some don't like tiny headers...  if there are
> no objections to new header for this, i could go that way too.


yeah, i experimented with creating a new header or try to add to existing.
new header doesn't make sense for just 5 #defines, adding C struct there
is not a great idea given it's scope is limited to the specific function
in the c file. adding to another header results in ifdefs for ASM/KERNEL,
so not really worth it. I think for now it is ok, we can live with it.
If arm ends up adding more declarations, we can look into it.


Thanks,
-Mukesh

[ .. deleted.. ]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-10  0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor
  2025-09-15 17:55   ` Michael Kelley
@ 2025-09-18 17:11   ` Stanislav Kinsburskii
  1 sibling, 0 replies; 29+ messages in thread
From: Stanislav Kinsburskii @ 2025-09-18 17:11 UTC (permalink / raw)
  To: Mukesh Rathor
  Cc: linux-hyperv, linux-kernel, linux-arch, kys, haiyangz, wei.liu,
	decui, tglx, mingo, bp, dave.hansen, x86, hpa, arnd

On Tue, Sep 09, 2025 at 05:10:08PM -0700, Mukesh Rathor wrote:

<snip>

> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> +{
> +	struct hv_input_disable_hyp_ex *input;
> +	u64 status;
> +	int msecs = 1000, ccpu = smp_processor_id();
> +
> +	if (ccpu == 0) {
> +		/* crash_save_cpu() will be done in the kexec path */
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +	} else {
> +		crash_save_cpu(regs, ccpu);
> +		cpu_emergency_stop_pt();	/* disable performance trace */
> +		atomic_inc(&crash_cpus_wait);
> +		for (;;);			/* cause no vmexits */
> +	}
> +
> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> +		mdelay(1);
> +
> +	stop_nmi();
> +	if (!hv_has_crashed)
> +		hv_notify_prepare_hyp();
> +
> +	if (crashing_cpu == -1)
> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> +
> +	hv_hvcrash_ctxt_save();
> +	hv_mark_tss_not_busy();
> +	hv_crash_fixup_kernpt();
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	memset(input, 0, sizeof(*input));
> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> +
> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
> +
> +	/* Devirt failed, just reboot as things are in very bad state now */
> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */

AFAIU here ...

> +}
> +
> +/*
> + * Generic nmi callback handler: could be called without any crash also.
> + *   hv crash: hypervisor injects nmi's into all cpus
> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
> + */
> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
> +{
> +	int ccpu = smp_processor_id();
> +
> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
> +		hv_has_crashed = 1;
> +
> +	if (!hv_has_crashed && !lx_has_crashed)
> +		return NMI_DONE;	/* ignore the nmi */
> +
> +	if (hv_has_crashed) {
> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
> +			if (ccpu == 0) {
> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */

and here the machine will be reset, which in both cases won't allow to
collect the VMRS file, thus not allowing to debug nested hypervisor
failures.

Perhaps it worth keeping the state for any case (not just nested), but
the nested state should be preserved.

Thanks,
Stanislav

> -- 
> 2.36.1.vfs.0.0
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support
  2025-09-16  1:15     ` Mukesh R
@ 2025-09-18 23:52       ` Michael Kelley
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Kelley @ 2025-09-18 23:52 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Monday, September 15, 2025 6:15 PM
> 
> On 9/15/25 10:54, Michael Kelley wrote:
> > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> >>
> >> Add data structures for hypervisor crash dump support to the hypervisor
> >> host ABI header file. Details of their usages are in subsequent commits.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>  include/hyperv/hvhdk_mini.h | 55 +++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 55 insertions(+)
> >>
> >> diff --git a/include/hyperv/hvhdk_mini.h b/include/hyperv/hvhdk_mini.h
> >> index 858f6a3925b3..ad9a8048fb4e 100644
> >> --- a/include/hyperv/hvhdk_mini.h
> >> +++ b/include/hyperv/hvhdk_mini.h
> >>

[snip]

> >> +enum hv_crashdump_action {
> >> +	HV_CRASHDUMP_NONE = 0,
> >> +	HV_CRASHDUMP_SUSPEND_ALL_VPS,
> >> +	HV_CRASHDUMP_PREPARE_FOR_STATE_SAVE,
> >> +	HV_CRASHDUMP_STATE_SAVED,
> >> +	HV_CRASHDUMP_ENTRY,
> >> +};
> >
> > Nit: Since these values are part of the ABI, it's probably better
> > to assign explicit values to each enum member in order to
> > ward off any mistaken reordering or additions in the middle
> > of the list.
> 
> No, like I have mentioned in the past, we are mirroring hyp headers
> with the eventual goal of just consuming from there directly.
> Each change in ABI header is very carefully examined, we now have
> a process for it.
> 

Acknowledged. I keep wanting to tighten up the ABI specification,
and sometimes forget that there are constraints on doing so.

Michael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-16 21:30     ` Mukesh R
@ 2025-09-18 23:52       ` Michael Kelley
  2025-09-19  9:06         ` Borislav Petkov
  0 siblings, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-18 23:52 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM
> 
> On 9/15/25 10:55, Michael Kelley wrote:
> > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> >>
> >> Introduce a small asm stub to transition from the hypervisor to linux
> >
> > I'd argue for capitalizing "Linux" here and in other places in commit
> > text and code comments throughout this patch set.
> 
> I'd argue against it. A quick grep indicates it is a common practice,
> and in the code world goes easy on the eyes :).

I'll offer a final comment on this topic, and then let it be. There's
a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson,
and other maintainers giving comments to use the capitalized form
of "Linux", "MSR", "RAM", etc. See:

https://lore.kernel.org/lkml/Y+4WHGNdWTZ5Hc6Y@kroah.com/
https://lore.kernel.org/lkml/86o7u0dqzj.wl-maz@kernel.org/
https://lore.kernel.org/lkml/408e68d0-1ae1-6d56-d008-61de14214326@linaro.org/
https://lore.kernel.org/lkml/20250819215304.GMaKTyQBWi6YzqZ0bW@fat_crate.local/
https://lore.kernel.org/lkml/Y0CAHch5UR2Lp0tU@google.com/
https://lore.kernel.org/lkml/20240126214336.GA453589@bhelgaas/
https://lore.kernel.org/lkml/20161117155543.vg3domfqm3bhp4f7@pd.tnic/

> 
> >> upon devirtualization.
> >
> > In this patch and subsequent patches, you've used the phrase "upon
> > devirtualization", which seems a little vague to me. Does this mean
> > "when devirtualization is complete" or perhaps "when the hypervisor
> > completes devirtualization"? Since there's no spec on any of this,
> > being as precise as possible will help future readers.
> 
> since control comes back to linux at the callback here, i fail to
> understand what is vague about it. when hyp completes devirt,
> devirt is complete.

To me, the word "upon" is less precise than just "after".  In temporal
contexts, "upon" might mean "at the same time as" or it might mean
"immediately after". I wrote this comment as I was trying to figure out
how the entire devirtualization process works. Eventually it became clear
and the ambiguity was resolved, but initially I was uncertain. See some
broader thoughts in my reply on Patch 5 of the series.

> 
> >>
> >> At a high level, during panic of either the hypervisor or the dom0 (aka
> >> root), the nmi handler asks hypervisor to devirtualize.
> >
> > Suggest:
> >
> > At a high level, during panic of either the hypervisor or Linux running
> > in dom0 (a.k.a. the root partition), the Linux NMI handler asks the
> > hypervisor to devirtualize.
> >
> >> As part of that,
> >> the arguments include an entry point to return back to linux. This asm
> >> stub implements that entry point.
> >>
> >> The stub is entered in protected mode, uses temporary gdt and page table
> >> to enable long mode and get to kernel entry point which then restores full
> >> kernel context to resume execution to kexec.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>  arch/x86/hyperv/hv_trampoline.S | 105 ++++++++++++++++++++++++++++++++
> >>  1 file changed, 105 insertions(+)
> >>  create mode 100644 arch/x86/hyperv/hv_trampoline.S
> >>
> >> diff --git a/arch/x86/hyperv/hv_trampoline.S b/arch/x86/hyperv/hv_trampoline.S
> >> new file mode 100644
> >> index 000000000000..27a755401a42
> >> --- /dev/null
> >> +++ b/arch/x86/hyperv/hv_trampoline.S
> >> @@ -0,0 +1,105 @@
> >> +/* SPDX-License-Identifier: GPL-2.0-only */
> >> +/*
> >> + * X86 specific Hyper-V kdump/crash related code.
> >
> > Add a qualification that this is for root partition only, and not for
> > general guests?
> 
> i don't think it is needed, it would be odd for guests to collect hyp
> core. besides makefile/kconfig shows this is root vm only
> 
> >> + *
> >> + * Copyright (C) 2025, Microsoft, Inc.
> >> + *
> >> + */
> >> +#include <linux/linkage.h>
> >> +#include <asm/alternative.h>
> >> +#include <asm/msr.h>
> >> +#include <asm/processor-flags.h>
> >> +#include <asm/nospec-branch.h>
> >> +
> >> +/*
> >> + * void noreturn hv_crash_asm32(arg1)
> >> + *    arg1 == edi == 32bit PA of struct hv_crash_trdata
> >
> > I think this is "struct hv_crash_tramp_data".
> 
> correct
> 
> >> + *
> >> + * The hypervisor jumps here upon devirtualization in protected mode. This
> >> + * code gets copied to a page in the low 4G ie, 32bit space so it can run
> >> + * in the protected mode. Hence we cannot use any compile/link time offsets or
> >> + * addresses. It restores long mode via temporary gdt and page tables and
> >> + * eventually jumps to kernel code entry at HV_CRASHDATA_OFFS_C_entry.
> >> + *
> >> + * PreCondition (ie, Hypervisor call back ABI):
> >> + *  o CR0 is set to 0x0021: PE(prot mode) and NE are set, paging is disabled
> >> + *  o CR4 is set to 0x0
> >> + *  o IA32_EFER is set to 0x901 (SCE and NXE are set)
> >> + *  o EDI is set to the Arg passed to HVCALL_DISABLE_HYP_EX.
> >> + *  o CS, DS, ES, FS, GS are all initialized with a base of 0 and limit 0xFFFF
> >> + *  o IDTR, TR and GDTR are initialized with a base of 0 and limit of 0xFFFF
> >> + *  o LDTR is initialized as invalid (limit of 0)
> >> + *  o MSR PAT is power on default.
> >> + *  o Other state/registers are cleared. All TLBs flushed.
> >
> > Clarification about "Other state/registers are cleared":  What about
> > processor features that Linux may have enabled or disabled during its
> > initial boot? Are those still in the states Linux set? Or are they reset to
> > power-on defaults? For example, if Linux enabled x2apic, is x2apic
> > still enabled when the stub is entered?
> 
> correct, if linux set x2apic, x2apic would still be enabled.
> 
> >> + *
> >> + * See Intel SDM 10.8.5
> >
> > Hmmm. I downloaded the latest combined SDM, and section 10.8.5
> > in Volume 3A is about Microcode Update Resources, which doesn't
> > seem applicable here. Other volumes don't have a section 10.8.5.
> 
> google ai found it right away upon searching: intel sdm 10.8.5 ia-32e

Unfortunately, Intel doesn't necessarily maintain the section numbering
across revisions of the SDM. This web page:

https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

has a link to download the "Combined Volume Set", and currently provides
the version dated June 2025. The section "Initializing IA-32e Mode" is
numbered 11.8.5. The December 2024 version has the same 11.8.5
numbering. Are you finding an older version?

Presumably the section title is less likely to change unless Intel does a
major rewrite. So something like this would be more durable:

* See Intel SDM Volume 3A section "Initializing IA-32e Mode" (numbered
11.8.5 in the June 2025 version)

> 
> >> + */
> >> +
> >> +#define HV_CRASHDATA_OFFS_TRAMPCR3    0x0    /*	 0 */
> >> +#define HV_CRASHDATA_OFFS_KERNCR3     0x8    /*	 8 */
> >> +#define HV_CRASHDATA_OFFS_GDTRLIMIT  0x12    /* 18 */
> >> +#define HV_CRASHDATA_OFFS_CS_JMPTGT  0x28    /* 40 */
> >> +#define HV_CRASHDATA_OFFS_C_entry    0x30    /* 48 */
> >
> > It seems like these offsets should go in a #include file along
> > with the definition of struct hv_crash_tramp_data. Then the
> > BUILD_BUG_ON() calls in hv_crash_setup_trampdata() could
> > check against these symbolic names instead of hardcoding
> > numbers that must match these.
> 
> yeah, that works too and was the first cut. but given the small
> number of these, and that they are not used/needed anywhere else,
> and that they will almost never change, creating another tiny header
> in a non-driver directory didn't seem worth it.. but i could go
> either way.
> 
> >> +#define HV_CRASHDATA_TRAMPOLINE_CS    0x8
> >
> > This #define isn't used anywhere.
> 
> removed
> 
> >> +
> >> +	.text
> >> +	.code32
> >> +
> >> +SYM_CODE_START(hv_crash_asm32)
> >> +	UNWIND_HINT_UNDEFINED
> >> +	ANNOTATE_NOENDBR
> >
> > No ENDBR here, presumably because this function is entered via other
> > than an indirect CALL or JMP instruction. Right?
> >
> >> +	movl	$X86_CR4_PAE, %ecx
> >> +	movl	%ecx, %cr4
> >> +
> >> +	movl %edi, %ebx
> >> +	add $HV_CRASHDATA_OFFS_TRAMPCR3, %ebx
> >> +	movl %cs:(%ebx), %eax
> >> +	movl %eax, %cr3
> >> +
> >> +	# Setup EFER for long mode now.
> >> +	movl	$MSR_EFER, %ecx
> >> +	rdmsr
> >> +	btsl	$_EFER_LME, %eax
> >> +	wrmsr
> >> +
> >> +	# Turn paging on using the temp 32bit trampoline page table.
> >> +	movl %cr0, %eax
> >> +	orl $(X86_CR0_PG), %eax
> >> +	movl %eax, %cr0
> >> +
> >> +	/* since kernel cr3 could be above 4G, we need to be in the long mode
> >> +	 * before we can load 64bits of the kernel cr3. We use a temp gdt for
> >> +	 * that with CS.L=1 and CS.D=0 */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_GDTRLIMIT, %eax
> >> +	lgdtl %cs:(%eax)
> >> +
> >> +	/* not done yet, restore CS now to switch to CS.L=1 */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_CS_JMPTGT, %eax
> >> +	ljmp %cs:*(%eax)
> >> +SYM_CODE_END(hv_crash_asm32)
> >> +
> >> +	/* we now run in full 64bit IA32-e long mode, CS.L=1 and CS.D=0 */
> >> +	.code64
> >> +	.balign 8
> >> +SYM_CODE_START(hv_crash_asm64)
> >> +	UNWIND_HINT_UNDEFINED
> >> +	ANNOTATE_NOENDBR
> >
> > But this *is* entered via an indirect JMP, right? So back to my
> > earlier question about the state of processor feature enablement.
> > If Linux enabled IBT, is it still enabled after devirtualization and
> > the hypervisor invokes this entry point? Linux guests on Hyper-V
> > have historically not enabled IBT, but patches that enable it are
> > now in linux-next, and will go into the 6.18 kernel. So maybe
> > this needs an ENDBR64.
> 
> IBT would be disabled in the transition here.... so doesn't really
> matter. ENDBR ok too..

So does Hyper-V explicitly disable IBT before making the callback?
Or is the IBT disabling somehow a processor side effect of going back
to protected mode? I don't see anything in the SDM about the latter.
Not having a Hyper-V spec for all this is frustrating ...

Doing the ENDBR64 here might be safer in the long run in case
we ever do end up here with IBT enabled.

> 
> >> +SYM_INNER_LABEL(hv_crash_asm64_lbl, SYM_L_GLOBAL)
> >> +	/* restore kernel page tables so we can jump to kernel code */
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_KERNCR3, %eax
> >> +	movq %cs:(%eax), %rbx
> >> +	movq %rbx, %cr3
> >> +
> >> +	mov %edi, %eax
> >> +	add $HV_CRASHDATA_OFFS_C_entry, %eax
> >> +	movq %cs:(%eax), %rbx
> >> +	ANNOTATE_RETPOLINE_SAFE
> >> +	jmp *%rbx
> >> +
> >> +	int $3
> >> +
> >> +SYM_INNER_LABEL(hv_crash_asm_end, SYM_L_GLOBAL)
> >> +SYM_CODE_END(hv_crash_asm64)
> >> --
> >> 2.36.1.vfs.0.0
> >>
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-17  1:13     ` Mukesh R
  2025-09-17 20:37       ` Mukesh R
@ 2025-09-18 23:53       ` Michael Kelley
  2025-09-19  2:32         ` Mukesh R
  1 sibling, 1 reply; 29+ messages in thread
From: Michael Kelley @ 2025-09-18 23:53 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
> 
> On 9/15/25 10:55, Michael Kelley wrote:
> > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> >>
> >> Introduce a new file to implement collection of hypervisor ram into the
> >
> > s/ram/RAM/ (multiple places)
> 
> a quick grep indicates using saying ram is common, i like ram over RAM
> 
> >> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
> >> protected via hw page table. Hyper-V implements a disable hypercall which
> >
> > The terminology here is a bit confusing since you have two names for
> > the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
> > just use "devirtualize" everywhere, and drop the "disable" terminology?
> 
> The concept is devirtualize and the actual hypercall was originally named
> disable. so intermixing is natural imo.
> 
> >> essentially devirtualizes the system on the fly. This mechanism makes the
> >> hypervisor ram accessible to linux. Because the hypervisor ram is already
> >> mapped into linux address space (as reserved ram),
> >
> > Is the hypervisor RAM mapped into the VMM process user address space,
> > or somewhere in the kernel address space? If the latter, where in the kernel
> > code, or what mechanism, does that? Just curious, as I wasn't aware that
> > this is happening ....
> 
> mapped in kernel as normal ram and we reserve it very early in boot. i
> see that patch has not made it here yet, should be coming very soon.

OK, that's fine. The answer to my question is coming soon ....

> 
> >> it is automatically
> >> collected into the vmcore without extra work. More details of the
> >> implementation are available in the file prologue.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>  arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 622 insertions(+)
> >>  create mode 100644 arch/x86/hyperv/hv_crash.c
> >>
> >> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
> >> new file mode 100644
> >> index 000000000000..531bac79d598
> >> --- /dev/null
> >> +++ b/arch/x86/hyperv/hv_crash.c
> >> @@ -0,0 +1,622 @@
> >> +// SPDX-License-Identifier: GPL-2.0-only
> >> +/*
> >> + * X86 specific Hyper-V kdump/crash support module
> >> + *
> >> + * Copyright (C) 2025, Microsoft, Inc.
> >> + *
> >> + * This module implements hypervisor ram collection into vmcore for both
> >> + * cases of the hypervisor crash and linux dom0/root crash.
> >
> > For a hypervisor crash, does any of this apply to general guest VMs? I'm
> > thinking it does not. Hypervisor RAM is collected only into the vmcore
> > for the root partition, right? Maybe some additional clarification could be
> > added so there's no confusion in this regard.
> 
> it would be odd for guests to collect hyp core, and target audience is
> assumed to be those who are somewhat familiar with basic concepts before
> getting here.

I was unsure because I had not seen any code that adds the hypervisor memory
to the Linux memory map. Thought maybe something was going on I hadn’t
heard about, so I didn't know the scope of it.

Of course, I'm one of those people who was *not* familiar with the basic concepts
before getting here. And given that there's no spec available from Hyper-V,
the comments in this patch set are all there is for anyone outside of Microsoft.
In that vein, I think it's reasonable to provide some description of how this
all works in the code comments. And you've done that, which is very
helpful. But I encountered a few places where I was confused or unclear, and
my suggestions here and in Patch 4 are just about making things as precise as
possible without adding a huge amount of additional verbiage. For someone
new, English text descriptions that the code can be checked against are
helpful, and drawing hard boundaries ("this is only applicable to the root
partition") is helpful.

If you don't want to deal with it now, I could provide a follow-on patch later
that tweaks or augments the wording a bit to clarify some of these places. 
You can review, like with any patch. I've done wording work over the years
to many places in the VMBus code, and more broadly in providing most of
the documentation in Documentation/virt/hyperv.

> 
> > And what *does* happen to guest VMs after a hypervisor crash?
> 
> they are gone... what else could we do?
> 
> >> + * Hyper-V implements
> >> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
> >> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
> >> + * is already mapped in linux, it is automatically collected into linux vmcore,
> >> + * and can be examined by the crash command (raw ram dump) or windbg.
> >> + *
> >> + * At a high level:
> >> + *
> >> + *  Hypervisor Crash:
> >> + *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
> >> + *    restrictive mode with very limited hypercall and msr support.
> >
> > s/msr/MSR/
> 
> msr is used all over, seems acceptable.
> 
> >> + *    Each cpu then injects NMIs into dom0/root vcpus.
> >
> > The "Each cpu" part of this sentence is confusing to me -- which CPUs does
> > this refer to? Maybe it would be better to say "It then injects an NMI into
> > each dom0/root partition vCPU." without being specific as to which CPUs do
> > the injecting since that seems more like a hypervisor implementation detail
> > that's not relevant here.
> 
> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu.

OK, that makes sense now that I think about it. Each physical CPU in the host
has a corresponding vCPU in the dom0/root partition. And each of the vCPUs
gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off
running a vCPU in some guest VM.

> 
> >> + *    A shared page is used to check
> >> + *    by linux in the nmi handler if the hypervisor has crashed. This shared
> >
> > s/nmi/NMI/  (multiple places)
> 
> >> + *    page is setup in hv_root_crash_init during boot.
> >> + *
> >> + *  Linux Crash:
> >> + *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
> >> + *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
> >> + *    for all cpus to be in NMI.
> >> + *
> >> + *  NMI Handler (upon quorum):
> >> + *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.
> >
> > s/hanlder/handler/
> >
> > And maybe just drop the word "wil" (which is misspelled).
> >
> >> + *    Hyper-V requires the disable hypervisor must be done from the bsp. So
> >
> > s/bsp/BSP  (multiple places)
> >
> >> + *    the bsp nmi handler saves current context, does some fixups and makes
> >> + *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
> >> + *    at that point will suspend all vcpus (except the bsp), unlock all its
> >> + *    ram, and return to linux at the 32bit mode entry RIP.
> >> + *
> >> + *  Linux 32bit entry trampoline will then restore long mode and call C
> >> + *  function here to restore context and continue execution to crash kexec.
> >> + */
> >> +
> >> +#include <linux/delay.h>
> >> +#include <linux/kexec.h>
> >> +#include <linux/crash_dump.h>
> >> +#include <linux/panic.h>
> >> +#include <asm/apic.h>
> >> +#include <asm/desc.h>
> >> +#include <asm/page.h>
> >> +#include <asm/pgalloc.h>
> >> +#include <asm/mshyperv.h>
> >> +#include <asm/nmi.h>
> >> +#include <asm/idtentry.h>
> >> +#include <asm/reboot.h>
> >> +#include <asm/intel_pt.h>
> >> +
> >> +int hv_crash_enabled;
> >
> > Seems like this is conceptually a "bool", not an "int".
> 
> yeah, can change it to bool if i do another iteration.
> 
> >> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
> >> +
> >> +struct hv_crash_ctxt {
> >> +	ulong rsp;
> >> +	ulong cr0;
> >> +	ulong cr2;
> >> +	ulong cr4;
> >> +	ulong cr8;
> >> +
> >> +	u16 cs;
> >> +	u16 ss;
> >> +	u16 ds;
> >> +	u16 es;
> >> +	u16 fs;
> >> +	u16 gs;
> >> +
> >> +	u16 gdt_fill;
> >> +	struct desc_ptr gdtr;
> >> +	char idt_fill[6];
> >> +	struct desc_ptr idtr;
> >> +
> >> +	u64 gsbase;
> >> +	u64 efer;
> >> +	u64 pat;
> >> +};
> >> +static struct hv_crash_ctxt hv_crash_ctxt;
> >> +
> >> +/* Shared hypervisor page that contains crash dump area we peek into.
> >> + * NB: windbg looks for "hv_cda" symbol so don't change it.
> >> + */
> >> +static struct hv_crashdump_area *hv_cda;
> >> +
> >> +static u32 trampoline_pa, devirt_cr3arg;
> >> +static atomic_t crash_cpus_wait;
> >> +static void *hv_crash_ptpgs[4];
> >> +static int hv_has_crashed, lx_has_crashed;
> >
> > These are conceptually "bool" as well.
> >
> >> +
> >> +/* This cannot be inlined as it needs stack */
> >> +static noinline __noclone void hv_crash_restore_tss(void)
> >> +{
> >> +	load_TR_desc();
> >> +}
> >> +
> >> +/* This cannot be inlined as it needs stack */
> >> +static noinline void hv_crash_clear_kernpt(void)
> >> +{
> >> +	pgd_t *pgd;
> >> +	p4d_t *p4d;
> >> +
> >> +	/* Clear entry so it's not confusing to someone looking at the core */
> >> +	pgd = pgd_offset_k(trampoline_pa);
> >> +	p4d = p4d_offset(pgd, trampoline_pa);
> >> +	native_p4d_clear(p4d);
> >> +}
> >> +
> >> +/*
> >> + * This is the C entry point from the asm glue code after the devirt hypercall.
> >> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
> >> + * page tables with our below 4G page identity mapped, but using a temporary
> >> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
> >> + * available. We restore kernel GDT, and rest of the context, and continue
> >> + * to kexec.
> >> + */
> >> +static asmlinkage void __noreturn hv_crash_c_entry(void)
> >> +{
> >> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> >> +
> >> +	/* first thing, restore kernel gdt */
> >> +	native_load_gdt(&ctxt->gdtr);
> >> +
> >> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
> >> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
> >> +
> >> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
> >> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
> >> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
> >> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
> >> +
> >> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
> >> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
> >> +
> >> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
> >> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
> >> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
> >> +
> >> +	native_load_idt(&ctxt->idtr);
> >> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
> >> +	native_wrmsrq(MSR_EFER, ctxt->efer);
> >> +
> >> +	/* restore the original kernel CS now via far return */
> >> +	asm volatile("movzwq %0, %%rax\n\t"
> >> +		     "pushq %%rax\n\t"
> >> +		     "pushq $1f\n\t"
> >> +		     "lretq\n\t"
> >> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
> >> +
> >> +	/* We are in asmlinkage without stack frame, hence make a C function
> >> +	 * call which will buy stack frame to restore the tss or clear PT entry.
> >> +	 */
> >> +	hv_crash_restore_tss();
> >> +	hv_crash_clear_kernpt();
> >> +
> >> +	/* we are now fully in devirtualized normal kernel mode */
> >> +	__crash_kexec(NULL);
> >
> > The comments for __crash_kexec() say that "panic_cpu" should be set to
> > the current CPU. I don't see that such is the case here.
> 
> if linux panic, it would be set by vpanic, if hyp crash, that is
> irrelevant.
> 
> >> +
> >> +	for (;;)
> >> +		cpu_relax();
> >
> > Is the intent that __crash_kexec() should never return, on any of the vCPUs,
> > because devirtualization isn't done unless there's a valid kdump image loaded?
> > I wonder if
> >
> > 	native_wrmsrq(HV_X64_MSR_RESET, 1);
> >
> > would be better than looping forever in case __crash_kexec() fails
> > somewhere along the way even if there's a kdump image loaded.
> 
> yeah, i've gone thru all 3 possibilities here:
>   o loop forever
>   o reset
>   o BUG() : this was in V0
> 
> reset is just bad because system would just reboot without any indication
> if hyp crashes. with loop at least there is a hang, and one could make
> note of it, and if internal, attach debugger.
> 
> BUG is best imo because with hyp gone linux will try to redo panic
> and we would print something extra to help. I think i'll just go
> back to my V0: BUG()
> 
> >> +}
> >> +/* Tell gcc we are using lretq long jump in the above function intentionally */
> >> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
> >> +
> >> +static void hv_mark_tss_not_busy(void)
> >> +{
> >> +	struct desc_struct *desc = get_current_gdt_rw();
> >> +	tss_desc tss;
> >> +
> >> +	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
> >> +	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
> >> +	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
> >> +}
> >> +
> >> +/* Save essential context */
> >> +static void hv_hvcrash_ctxt_save(void)
> >> +{
> >> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
> >> +
> >> +	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
> >> +
> >> +	ctxt->cr0 = native_read_cr0();
> >> +	ctxt->cr4 = native_read_cr4();
> >> +
> >> +	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
> >> +	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
> >> +
> >> +	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
> >> +	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
> >> +	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
> >> +	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
> >> +	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
> >> +	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
> >> +
> >> +	native_store_gdt(&ctxt->gdtr);
> >> +	store_idt(&ctxt->idtr);
> >> +
> >> +	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
> >> +	ctxt->efer = __rdmsr(MSR_EFER);
> >> +	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
> >> +}
> >> +
> >> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
> >> +static void hv_crash_fixup_kernpt(void)
> >> +{
> >> +	pgd_t *pgd;
> >> +	p4d_t *p4d;
> >> +
> >> +	pgd = pgd_offset_k(trampoline_pa);
> >> +	p4d = p4d_offset(pgd, trampoline_pa);
> >> +
> >> +	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
> >> +	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
> >> +	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
> >> +}
> >> +
> >> +/*
> >> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
> >> + * crashed and will collect core. This will cause the hyp to quiesce and
> >> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
> >> + */
> >> +static void hv_notify_prepare_hyp(void)
> >> +{
> >> +	u64 status;
> >> +	struct hv_input_notify_partition_event *input;
> >> +	struct hv_partition_event_root_crashdump_input *cda;
> >> +
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	cda = &input->input.crashdump_input;
> >
> > The code ordering here is a bit weird. I'd expect this line to be grouped
> > with cda->crashdump_action being set.
> 
> we are setting two pointers, and using them later. setting pointers
> up front is pretty normal.
> 
> >> +	memset(input, 0, sizeof(*input));
> >> +	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
> >> +
> >> +	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
> >> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
> >> +	if (!hv_result_success(status))
> >> +		return;
> >> +
> >> +	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
> >> +	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
> >> +}
> >> +
> >> +/*
> >> + * Common function for all cpus before devirtualization.
> >> + *
> >> + * Hypervisor crash: all cpus get here in nmi context.
> >> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> >> + *		context. Note, panicing cpu may not be the bsp.
> >> + *
> >> + * The function is not inlined so it will show on the stack. It is named so
> >> + * because the crash cmd looks for certain well known function names on the
> >> + * stack before looking into the cpu saved note in the elf section, and
> >> + * that work is currently incomplete.
> >> + *
> >> + * Notes:
> >> + *  Hypervisor crash:
> >> + *    - the hypervisor is in a very restrictive mode at this point and any
> >> + *	vmexit it cannot handle would result in reboot. For example, console
> >> + *	output from here would result in synic ipi hcall, which would result
> >> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> >> + *
> >> + *  Devirtualization is supported from the bsp only.
> >> + */
> >> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> >> +{
> >> +	struct hv_input_disable_hyp_ex *input;
> >> +	u64 status;
> >> +	int msecs = 1000, ccpu = smp_processor_id();
> >> +
> >> +	if (ccpu == 0) {
> >> +		/* crash_save_cpu() will be done in the kexec path */
> >> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >> +		atomic_inc(&crash_cpus_wait);
> >> +	} else {
> >> +		crash_save_cpu(regs, ccpu);
> >> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >> +		atomic_inc(&crash_cpus_wait);
> >> +		for (;;);			/* cause no vmexits */
> >> +	}
> >> +
> >> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> >> +		mdelay(1);
> >> +
> >> +	stop_nmi();
> >> +	if (!hv_has_crashed)
> >> +		hv_notify_prepare_hyp();
> >> +
> >> +	if (crashing_cpu == -1)
> >> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> >
> > Could just be "crashing_cpu = 0" since only the BSP gets here.
> 
> a code change request has been open for while to remove the requirement
> of bsp..
> 
> >> +
> >> +	hv_hvcrash_ctxt_save();
> >> +	hv_mark_tss_not_busy();
> >> +	hv_crash_fixup_kernpt();
> >> +
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	memset(input, 0, sizeof(*input));
> >> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> >> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> >
> > Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
> > And just for clarification, Hyper-V treats this "arg" value as opaque and does
> > not access it. It only provides it in EDI when it invokes the trampoline
> > function, right?
> 
> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).

Yes, the comment matches the name of the "devirt_cr3arg" variable.
Unfortunately my previous comment was incomplete because the value
stored in the static variable "devirt_cr3arg" isn’t the address of an L4 page
table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
PA of struct hv_crash_tramp_data. The CR3 value is stored in the
tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
of indirection. The (corrected) comment in the header to hv_crash_asm32()
describes EDI as containing "PA of struct hv_crash_tramp_data", which
ought to match what is described here. I'd say that "devirt_cr3arg" ought
to be renamed to "tramp_data_pa" or something else parallel to
"trampoline_pa".

> 
> right, comes in edi, i don't know what EDI is (just kidding!)...
> 
> >> +
> >> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
> >> +
> >> +	/* Devirt failed, just reboot as things are in very bad state now */
> >> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
> >> +}
> >> +
> >> +/*
> >> + * Generic nmi callback handler: could be called without any crash also.
> >> + *   hv crash: hypervisor injects nmi's into all cpus
> >> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
> >> + */
> >> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
> >> +{
> >> +	int ccpu = smp_processor_id();
> >> +
> >> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
> >> +		hv_has_crashed = 1;
> >> +
> >> +	if (!hv_has_crashed && !lx_has_crashed)
> >> +		return NMI_DONE;	/* ignore the nmi */
> >> +
> >> +	if (hv_has_crashed) {
> >> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
> >> +			if (ccpu == 0) {
> >> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
> >> +			} else
> >> +				for (;;);	/* cause no vmexits */
> >> +		}
> >> +	}
> >> +
> >> +	crash_nmi_callback(regs);
> >> +
> >> +	return NMI_DONE;
> >
> > crash_nmi_callback() should never return, right? Normally one would
> > expect to return NMI_HANDLED here, but I guess it doesn't matter
> > if the return is never executed.
> 
> correct.
> 
> >> +}
> >> +
> >> +/*
> >> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
> >> + *
> >> + * On normal linux panic, this is called twice: first from panic and then again
> >> + * from native_machine_crash_shutdown.
> >> + *
> >> + * In case of mshv, 3 ways to get here:
> >> + *  1. hv crash (only bsp will get here):
> >> + *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
> >> + *		  -> __crash_kexec -> native_machine_crash_shutdown
> >> + *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
> >> + *  linux panic:
> >> + *	2. panic cpu x: panic() -> crash_smp_send_stop
> >> + *				     -> smp_ops.crash_stop_other_cpus
> >> + *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
> >> + *
> >> + * NB: noclone and non standard stack because of call to crash_setup_regs().
> >> + */
> >> +static void __noclone hv_crash_stop_other_cpus(void)
> >> +{
> >> +	static int crash_stop_done;
> >> +	struct pt_regs lregs;
> >> +	int ccpu = smp_processor_id();
> >> +
> >> +	if (hv_has_crashed)
> >> +		return;		/* all cpus already in nmi handler path */
> >> +
> >> +	if (!kexec_crash_loaded())
> >> +		return;
> >
> > If we're in a normal panic path (your Case #2 above) with no kdump kernel
> > loaded, why leave the other vCPUs running? Seems like that could violate
> > expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
> > thereafter assumes other vCPUs are not running.
> 
> no, there is lots of complexity here!
> 
> if we hang vcpus here, hyp will note and may trigger its own watchdog.
> also, machine_crash_shutdown() does another ipi.
> 
> I think the best thing to do here is go back to my V0 which did not
> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry:
> 
> +       /* we are now fully in devirtualized normal kernel mode */
> +       __crash_kexec(NULL);
> +
> +       BUG();
> 
> 
> this way hyp would be disabled, ie, system devirtualized, and
> __crash_kernel() will return, resulting in BUG() that will cause
> it to go thru panic and honor panic= parameter with either hang
> or reset. instead of bug, i could just call panic() also.
> 
> >> +
> >> +	if (crash_stop_done)
> >> +		return;
> >> +	crash_stop_done = 1;
> >
> > Is crash_stop_done necessary?  hv_crash_stop_other_cpus() is called
> > from crash_smp_send_stop(), which has its own static variable
> > "cpus_stopped" that does the same thing.
> 
> yes. for error paths.
> 
> >> +
> >> +	/* linux has crashed: hv is healthy, we can ipi safely */
> >> +	lx_has_crashed = 1;
> >> +	wmb();			/* nmi handlers look at lx_has_crashed */
> >> +
> >> +	apic->send_IPI_allbutself(NMI_VECTOR);
> >
> > The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
> > In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
> > should disable_local_APIC() be done somewhere here as well?
> 
> no, hyp does that.

As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX
hypercall in crash_nmi_callback()? This gets back to an earlier question/comment
where I was trying to figure out if the APIC is still enabled, and in what mode,
when hv_crash_asm32() is invoked.

> 
> >> +
> >> +	if (crashing_cpu == -1)
> >> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> >> +
> >> +	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
> >> +	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
> >> +	 */
> >> +	if (ccpu)
> >> +		crash_setup_regs(&lregs, NULL);
> >> +
> >> +	crash_nmi_callback(&lregs);
> >> +}
> >> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
> >> +
> >> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
> >> +struct hv_gdtreg_32 {
> >> +	u16 fill;
> >> +	u16 limit;
> >> +	u32 address;
> >> +} __packed;
> >> +
> >> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
> >> +struct hv_crash_tramp_gdt {
> >> +	u64 null;	/* index 0, selector 0, null selector */
> >> +	u64 cs64;	/* index 1, selector 8, cs64 selector */
> >> +} __packed;
> >> +
> >> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
> >> +struct hv_cs_jmptgt {
> >> +	u32 address;
> >> +	u16 csval;
> >> +	u16 fill;
> >> +} __packed;
> >> +
> >> +/* This trampoline data is copied onto the trampoline page after the asm code */
> >> +struct hv_crash_tramp_data {
> >> +	u64 tramp32_cr3;
> >> +	u64 kernel_cr3;
> >> +	struct hv_gdtreg_32 gdtr32;
> >> +	struct hv_crash_tramp_gdt tramp_gdt;
> >> +	struct hv_cs_jmptgt cs_jmptgt;
> >> +	u64 c_entry_addr;
> >> +} __packed;
> >> +
> >> +/*
> >> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
> >> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
> >> + * relative addressing, hence we must use trampoline_pa here. Also, save other
> >> + * info like jmp and C entry targets for same reasons.
> >> + *
> >> + * Returns: 0 on success, -1 on error
> >> + */
> >> +static int hv_crash_setup_trampdata(u64 trampoline_va)
> >> +{
> >> +	int size, offs;
> >> +	void *dest;
> >> +	struct hv_crash_tramp_data *tramp;
> >> +
> >> +	/* These must match exactly the ones in the corresponding asm file */
> >> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
> >> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
> >> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
> >> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
> >> +						     cs_jmptgt.address) != 40);
> >
> > It would be nice to pick up the constants from a #include file that is
> > shared with the asm code in Patch 4 of the series.
> 
> yeah, could go either way, some don't like tiny headers...  if there are
> no objections to new header for this, i could go that way too.

Saw your follow-on comments about this as well. The tiny header
is ugly. It's a judgment call that can go either way, so go with your
preference.

> 
> >> +
> >> +	/* hv_crash_asm_end is beyond last byte by 1 */
> >> +	size = &hv_crash_asm_end - &hv_crash_asm32;
> >> +	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
> >> +		pr_err("%s: trampoline page overflow\n", __func__);
> >> +		return -1;
> >> +	}
> >> +
> >> +	dest = (void *)trampoline_va;
> >> +	memcpy(dest, &hv_crash_asm32, size);
> >> +
> >> +	dest += size;
> >> +	dest = (void *)round_up((ulong)dest, 16);
> >> +	tramp = (struct hv_crash_tramp_data *)dest;
> >> +
> >> +	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
> >> +	 * non-PCID-aware users". Build cr3 with pcid 0
> >> +	 */
> >> +	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
> >> +
> >> +	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
> >> +	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
> >> +
> >> +	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
> >> +	tramp->gdtr32.address = trampoline_pa +
> >> +				   (ulong)&tramp->tramp_gdt - trampoline_va;
> >> +
> >> +	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
> >> +	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
> >> +
> >> +	tramp->cs_jmptgt.csval = 0x8;
> >> +	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
> >> +	tramp->cs_jmptgt.address = trampoline_pa + offs;
> >> +
> >> +	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
> >> +
> >> +	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/*
> >> + * Build 32bit trampoline page table for transition from protected mode
> >> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
> >> + */
> >> +static void hv_crash_build_tramp_pt(void)
> >> +{
> >> +	p4d_t *p4d;
> >> +	pud_t *pud;
> >> +	pmd_t *pmd;
> >> +	pte_t *pte;
> >> +	u64 pa, addr = trampoline_pa;
> >> +
> >> +	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
> >> +	pa = virt_to_phys(hv_crash_ptpgs[1]);
> >> +	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
> >> +	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
> >> +
> >> +	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
> >> +	pa = virt_to_phys(hv_crash_ptpgs[2]);
> >> +	set_pud(pud, __pud(_PAGE_TABLE | pa));
> >> +
> >> +	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
> >> +	pa = virt_to_phys(hv_crash_ptpgs[3]);
> >> +	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
> >> +
> >> +	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
> >> +	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
> >> +}
> >> +
> >> +/*
> >> + * Setup trampoline for devirtualization:
> >> + *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
> >> + *    in protected mode.
> >> + *  - 4 pages for a temporary page table that asm code uses to turn paging on
> >> + *  - a temporary gdt to use in the compat mode.
> >> + *
> >> + *  Returns: 0 on success
> >> + */
> >> +static int hv_crash_trampoline_setup(void)
> >> +{
> >> +	int i, rc, order;
> >> +	struct page *page;
> >> +	u64 trampoline_va;
> >> +	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
> >> +
> >> +	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
> >> +	page = alloc_page(flags32);
> >> +	if (page == NULL) {
> >> +		pr_err("%s: failed to alloc asm stub page\n", __func__);
> >> +		return -1;
> >> +	}
> >> +
> >> +	trampoline_va = (u64)page_to_virt(page);
> >> +	trampoline_pa = (u32)page_to_phys(page);
> >> +
> >> +	order = 2;	   /* alloc 2^2 pages */
> >> +	page = alloc_pages(flags32, order);
> >> +	if (page == NULL) {
> >> +		pr_err("%s: failed to alloc pt pages\n", __func__);
> >> +		free_page(trampoline_va);
> >> +		return -1;
> >> +	}
> >> +
> >> +	for (i = 0; i < 4; i++, page++)
> >> +		hv_crash_ptpgs[i] = page_to_virt(page);
> >> +
> >> +	hv_crash_build_tramp_pt();
> >> +
> >> +	rc = hv_crash_setup_trampdata(trampoline_va);
> >> +	if (rc)
> >> +		goto errout;
> >> +
> >> +	return 0;
> >> +
> >> +errout:
> >> +	free_page(trampoline_va);
> >> +	free_pages((ulong)hv_crash_ptpgs[0], order);
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
> >> +void hv_root_crash_init(void)
> >> +{
> >> +	int rc;
> >> +	struct hv_input_get_system_property *input;
> >> +	struct hv_output_get_system_property *output;
> >> +	unsigned long flags;
> >> +	u64 status;
> >> +	union hv_pfn_range cda_info;
> >> +
> >> +	if (pgtable_l5_enabled()) {
> >> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
> >> +		return;
> >> +	}
> >> +
> >> +	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
> >> +				  "hv_crash_nmi");
> >> +	if (rc) {
> >> +		pr_err("Hyper-V: failed to register crash nmi handler\n");
> >> +		return;
> >> +	}
> >> +
> >> +	local_irq_save(flags);
> >> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
> >> +
> >> +	memset(input, 0, sizeof(*input));
> >> +	memset(output, 0, sizeof(*output));
> >
> > Why zero the output area? This is one of those hypercall things that we're
> > inconsistent about. A few hypercall call sites zero the output area, and it's
> > not clear why they do. Hyper-V should be responsible for properly filling in
> > the output area. Linux should not need to do this zero'ing, unless there's some
> > known bug in Hyper-V for certain hypercalls, in which case there should be
> > a code comment stating "why".
> 
> for the same reason sometimes you see char *p = NULL, either leftover
> code or someone was debugging or just copy and paste. this is just copy
> paste. i agree in general that we don't need to clear it at all, in fact,
> i'd like to remove them all! but i also understand people with different
> skills and junior members find it easier to debug, and also we were in
> early product development. for that reason, it doesn't have to be
> consistent either, if some complex hypercalls are failing repeatedly,
> just for ease of debug, one might leave it there temporarily.  but
> now that things are stable, i think we should just remove them all and
> get used to a bit more inconvenient debugging...

I see your point about debugging, but on balance I agree that they
should all be removed. If there's some debug case, add it back
temporarily to debug, but leave upstream without it. The zero'ing is
also unnecessary code in the interrupt disabled window, which you
have expressed concern about in a different thread.

> 
> >> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
> >> +
> >> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
> >> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
> >> +	local_irq_restore(flags);
> >> +
> >> +	if (!hv_result_success(status)) {
> >> +		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
> >> +		       input->property_id, hv_result_to_string(status));
> >> +		goto err_out;
> >> +	}
> >> +
> >> +	if (cda_info.base_pfn == 0) {
> >> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
> >> +		goto err_out;
> >> +	}
> >> +
> >> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
> >
> > Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
> > terms of the Hyper-V page size, which isn't necessarily the guest page size.
> > Yes, on x86 there's no difference, but for future robustness ....
> 
> i don't know about guests, but we won't even boot if dom0 pg size
> didn't match.. but easier to change than to make the case..

FWIW, a normal Linux guest on ARM64 works just fine with a page
size of 16K or 64K, even though the underlying Hyper-V page size
is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in
the first place. Using it properly really matters for normal guests.
(Having the guest page size smaller than the Hyper-V page size
does *not* work, but there are no such use cases.)

Even on ARM64, I know the root partition page size is required to
match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is
still appropriate just to not leave code that will go wrong if the
match requirement should ever change.

> 
> >> +
> >> +	rc = hv_crash_trampoline_setup();
> >> +	if (rc)
> >> +		goto err_out;
> >> +
> >> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
> >> +
> >> +	crash_kexec_post_notifiers = true;
> >> +	hv_crash_enabled = 1;
> >> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
> >
> > This message and the message below aren't consistent. One refers
> > to "hv kdump" and the other to "hyp kdump".
> 
> >> +
> >> +	return;
> >> +
> >> +err_out:
> >> +	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
> >> +	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
> >> +}
> >> --
> >> 2.36.1.vfs.0.0
> >>
> >


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files
  2025-09-17  1:15     ` Mukesh R
@ 2025-09-18 23:53       ` Michael Kelley
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Kelley @ 2025-09-18 23:53 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:16 PM
> 
> On 9/15/25 10:56, Michael Kelley wrote:
> > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> >>
> >> Enable build of the new files introduced in the earlier commits and add
> >> call to do the setup during boot.
> >>
> >> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> >> ---
> >>  arch/x86/hyperv/Makefile       | 6 ++++++
> >>  arch/x86/hyperv/hv_init.c      | 1 +
> >>  include/asm-generic/mshyperv.h | 9 +++++++++
> >>  3 files changed, 16 insertions(+)
> >>
> >> diff --git a/arch/x86/hyperv/Makefile b/arch/x86/hyperv/Makefile
> >> index d55f494f471d..6f5d97cddd80 100644
> >> --- a/arch/x86/hyperv/Makefile
> >> +++ b/arch/x86/hyperv/Makefile
> >> @@ -5,4 +5,10 @@ obj-$(CONFIG_HYPERV_VTL_MODE)	+= hv_vtl.o
> >>
> >>  ifdef CONFIG_X86_64
> >>  obj-$(CONFIG_PARAVIRT_SPINLOCKS)	+= hv_spinlock.o
> >> +
> >> + ifdef CONFIG_MSHV_ROOT
> >> +  CFLAGS_REMOVE_hv_trampoline.o += -pg
> >> +  CFLAGS_hv_trampoline.o        += -fno-stack-protector
> >> +  obj-$(CONFIG_CRASH_DUMP)      += hv_crash.o hv_trampoline.o
> >> + endif
> >>  endif
> >> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> >> index afdbda2dd7b7..577bbd143527 100644
> >> --- a/arch/x86/hyperv/hv_init.c
> >> +++ b/arch/x86/hyperv/hv_init.c
> >> @@ -510,6 +510,7 @@ void __init hyperv_init(void)
> >>  		memunmap(src);
> >>
> >>  		hv_remap_tsc_clocksource();
> >> +		hv_root_crash_init();
> >>  	} else {
> >>  		hypercall_msr.guest_physical_address = vmalloc_to_pfn(hv_hypercall_pg);
> >>  		wrmsrq(HV_X64_MSR_HYPERCALL, hypercall_msr.as_uint64);
> >> diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
> >> index dbd4c2f3aee3..952c221765f5 100644
> >> --- a/include/asm-generic/mshyperv.h
> >> +++ b/include/asm-generic/mshyperv.h
> >> @@ -367,6 +367,15 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32
> >> num_pages);
> >>  int hv_call_add_logical_proc(int node, u32 lp_index, u32 acpi_id);
> >>  int hv_call_create_vp(int node, u64 partition_id, u32 vp_index, u32 flags);
> >>
> >> +#if CONFIG_CRASH_DUMP
> >> +void hv_root_crash_init(void);
> >> +void hv_crash_asm32(void);
> >> +void hv_crash_asm64_lbl(void);
> >> +void hv_crash_asm_end(void);
> >> +#else   /* CONFIG_CRASH_DUMP */
> >> +static inline void hv_root_crash_init(void) {}
> >> +#endif  /* CONFIG_CRASH_DUMP */
> >> +
> >
> > The hv_crash_asm* functions are x86 specific. Seems like their
> > declarations should go in arch/x86/include/asm/mshyperv.h, not in
> > the architecture-neutral include/asm-generic/mshyperv.h.
> 
> well, arm port is going on. i suppose i could move it to x86 and
> they can move it back  here in their patch submissions. hopefully
> they will remember or someone will catch it.

I could see the ARM64 implementation implementing its own version
of hv_root_crash_init() since that's a generic name. But sharing the
"asm" function names across architectures seems more questionable.
I doubt there would be hv_crash_asm32() on ARM64. :-)

> 
> >>  #else /* CONFIG_MSHV_ROOT */
> >>  static inline bool hv_root_partition(void) { return false; }
> >>  static inline bool hv_l1vh_partition(void) { return false; }
> >> --
> >> 2.36.1.vfs.0.0
> >>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-18 23:53       ` Michael Kelley
@ 2025-09-19  2:32         ` Mukesh R
  2025-09-19 19:48           ` Michael Kelley
  2025-09-20  1:42           ` Mukesh R
  0 siblings, 2 replies; 29+ messages in thread
From: Mukesh R @ 2025-09-19  2:32 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/18/25 16:53, Michael Kelley wrote:
> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
>>
>> On 9/15/25 10:55, Michael Kelley wrote:
>>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>>>
>>>> Introduce a new file to implement collection of hypervisor ram into the
>>>
>>> s/ram/RAM/ (multiple places)
>>
>> a quick grep indicates using saying ram is common, i like ram over RAM
>>
>>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
>>>> protected via hw page table. Hyper-V implements a disable hypercall which
>>>
>>> The terminology here is a bit confusing since you have two names for
>>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
>>> just use "devirtualize" everywhere, and drop the "disable" terminology?
>>
>> The concept is devirtualize and the actual hypercall was originally named
>> disable. so intermixing is natural imo.
>>
>>>> essentially devirtualizes the system on the fly. This mechanism makes the
>>>> hypervisor ram accessible to linux. Because the hypervisor ram is already
>>>> mapped into linux address space (as reserved ram),
>>>
>>> Is the hypervisor RAM mapped into the VMM process user address space,
>>> or somewhere in the kernel address space? If the latter, where in the kernel
>>> code, or what mechanism, does that? Just curious, as I wasn't aware that
>>> this is happening ....
>>
>> mapped in kernel as normal ram and we reserve it very early in boot. i
>> see that patch has not made it here yet, should be coming very soon.
> 
> OK, that's fine. The answer to my question is coming soon ....
> 
>>
>>>> it is automatically
>>>> collected into the vmcore without extra work. More details of the
>>>> implementation are available in the file prologue.
>>>>
>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>> ---
>>>>  arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
>>>>  1 file changed, 622 insertions(+)
>>>>  create mode 100644 arch/x86/hyperv/hv_crash.c
>>>>
>>>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>>>> new file mode 100644
>>>> index 000000000000..531bac79d598
>>>> --- /dev/null
>>>> +++ b/arch/x86/hyperv/hv_crash.c
>>>> @@ -0,0 +1,622 @@
>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>> +/*
>>>> + * X86 specific Hyper-V kdump/crash support module
>>>> + *
>>>> + * Copyright (C) 2025, Microsoft, Inc.
>>>> + *
>>>> + * This module implements hypervisor ram collection into vmcore for both
>>>> + * cases of the hypervisor crash and linux dom0/root crash.
>>>
>>> For a hypervisor crash, does any of this apply to general guest VMs? I'm
>>> thinking it does not. Hypervisor RAM is collected only into the vmcore
>>> for the root partition, right? Maybe some additional clarification could be
>>> added so there's no confusion in this regard.
>>
>> it would be odd for guests to collect hyp core, and target audience is
>> assumed to be those who are somewhat familiar with basic concepts before
>> getting here.
> 
> I was unsure because I had not seen any code that adds the hypervisor memory
> to the Linux memory map. Thought maybe something was going on I hadn?t
> heard about, so I didn't know the scope of it.
> 
> Of course, I'm one of those people who was *not* familiar with the basic concepts
> before getting here. And given that there's no spec available from Hyper-V,
> the comments in this patch set are all there is for anyone outside of Microsoft.
> In that vein, I think it's reasonable to provide some description of how this
> all works in the code comments. And you've done that, which is very
> helpful. But I encountered a few places where I was confused or unclear, and
> my suggestions here and in Patch 4 are just about making things as precise as
> possible without adding a huge amount of additional verbiage. For someone
> new, English text descriptions that the code can be checked against are
> helpful, and drawing hard boundaries ("this is only applicable to the root
> partition") is helpful.
> 
> If you don't want to deal with it now, I could provide a follow-on patch later
> that tweaks or augments the wording a bit to clarify some of these places. 
> You can review, like with any patch. I've done wording work over the years
> to many places in the VMBus code, and more broadly in providing most of
> the documentation in Documentation/virt/hyperv.

with time, things will start making sense... i find comment pretty clear
that it collects core for both cases of hv crash and dom0 crash, and no
mention of guest implies has nothing to do with guests. 

>>
>>> And what *does* happen to guest VMs after a hypervisor crash?
>>
>> they are gone... what else could we do?
>>
>>>> + * Hyper-V implements
>>>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
>>>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
>>>> + * is already mapped in linux, it is automatically collected into linux vmcore,
>>>> + * and can be examined by the crash command (raw ram dump) or windbg.
>>>> + *
>>>> + * At a high level:
>>>> + *
>>>> + *  Hypervisor Crash:
>>>> + *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
>>>> + *    restrictive mode with very limited hypercall and msr support.
>>>
>>> s/msr/MSR/
>>
>> msr is used all over, seems acceptable.
>>
>>>> + *    Each cpu then injects NMIs into dom0/root vcpus.
>>>
>>> The "Each cpu" part of this sentence is confusing to me -- which CPUs does
>>> this refer to? Maybe it would be better to say "It then injects an NMI into
>>> each dom0/root partition vCPU." without being specific as to which CPUs do
>>> the injecting since that seems more like a hypervisor implementation detail
>>> that's not relevant here.
>>
>> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu.
> 
> OK, that makes sense now that I think about it. Each physical CPU in the host
> has a corresponding vCPU in the dom0/root partition. And each of the vCPUs
> gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off
> running a vCPU in some guest VM.
> 
>>
>>>> + *    A shared page is used to check
>>>> + *    by linux in the nmi handler if the hypervisor has crashed. This shared
>>>
>>> s/nmi/NMI/  (multiple places)
>>
>>>> + *    page is setup in hv_root_crash_init during boot.
>>>> + *
>>>> + *  Linux Crash:
>>>> + *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
>>>> + *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
>>>> + *    for all cpus to be in NMI.
>>>> + *
>>>> + *  NMI Handler (upon quorum):
>>>> + *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.
>>>
>>> s/hanlder/handler/
>>>
>>> And maybe just drop the word "wil" (which is misspelled).
>>>
>>>> + *    Hyper-V requires the disable hypervisor must be done from the bsp. So
>>>
>>> s/bsp/BSP  (multiple places)
>>>
>>>> + *    the bsp nmi handler saves current context, does some fixups and makes
>>>> + *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
>>>> + *    at that point will suspend all vcpus (except the bsp), unlock all its
>>>> + *    ram, and return to linux at the 32bit mode entry RIP.
>>>> + *
>>>> + *  Linux 32bit entry trampoline will then restore long mode and call C
>>>> + *  function here to restore context and continue execution to crash kexec.
>>>> + */
>>>> +
>>>> +#include <linux/delay.h>
>>>> +#include <linux/kexec.h>
>>>> +#include <linux/crash_dump.h>
>>>> +#include <linux/panic.h>
>>>> +#include <asm/apic.h>
>>>> +#include <asm/desc.h>
>>>> +#include <asm/page.h>
>>>> +#include <asm/pgalloc.h>
>>>> +#include <asm/mshyperv.h>
>>>> +#include <asm/nmi.h>
>>>> +#include <asm/idtentry.h>
>>>> +#include <asm/reboot.h>
>>>> +#include <asm/intel_pt.h>
>>>> +
>>>> +int hv_crash_enabled;
>>>
>>> Seems like this is conceptually a "bool", not an "int".
>>
>> yeah, can change it to bool if i do another iteration.
>>
>>>> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
>>>> +
>>>> +struct hv_crash_ctxt {
>>>> +	ulong rsp;
>>>> +	ulong cr0;
>>>> +	ulong cr2;
>>>> +	ulong cr4;
>>>> +	ulong cr8;
>>>> +
>>>> +	u16 cs;
>>>> +	u16 ss;
>>>> +	u16 ds;
>>>> +	u16 es;
>>>> +	u16 fs;
>>>> +	u16 gs;
>>>> +
>>>> +	u16 gdt_fill;
>>>> +	struct desc_ptr gdtr;
>>>> +	char idt_fill[6];
>>>> +	struct desc_ptr idtr;
>>>> +
>>>> +	u64 gsbase;
>>>> +	u64 efer;
>>>> +	u64 pat;
>>>> +};
>>>> +static struct hv_crash_ctxt hv_crash_ctxt;
>>>> +
>>>> +/* Shared hypervisor page that contains crash dump area we peek into.
>>>> + * NB: windbg looks for "hv_cda" symbol so don't change it.
>>>> + */
>>>> +static struct hv_crashdump_area *hv_cda;
>>>> +
>>>> +static u32 trampoline_pa, devirt_cr3arg;
>>>> +static atomic_t crash_cpus_wait;
>>>> +static void *hv_crash_ptpgs[4];
>>>> +static int hv_has_crashed, lx_has_crashed;
>>>
>>> These are conceptually "bool" as well.
>>>
>>>> +
>>>> +/* This cannot be inlined as it needs stack */
>>>> +static noinline __noclone void hv_crash_restore_tss(void)
>>>> +{
>>>> +	load_TR_desc();
>>>> +}
>>>> +
>>>> +/* This cannot be inlined as it needs stack */
>>>> +static noinline void hv_crash_clear_kernpt(void)
>>>> +{
>>>> +	pgd_t *pgd;
>>>> +	p4d_t *p4d;
>>>> +
>>>> +	/* Clear entry so it's not confusing to someone looking at the core */
>>>> +	pgd = pgd_offset_k(trampoline_pa);
>>>> +	p4d = p4d_offset(pgd, trampoline_pa);
>>>> +	native_p4d_clear(p4d);
>>>> +}
>>>> +
>>>> +/*
>>>> + * This is the C entry point from the asm glue code after the devirt hypercall.
>>>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>>>> + * page tables with our below 4G page identity mapped, but using a temporary
>>>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
>>>> + * available. We restore kernel GDT, and rest of the context, and continue
>>>> + * to kexec.
>>>> + */
>>>> +static asmlinkage void __noreturn hv_crash_c_entry(void)
>>>> +{
>>>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>> +
>>>> +	/* first thing, restore kernel gdt */
>>>> +	native_load_gdt(&ctxt->gdtr);
>>>> +
>>>> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>>>> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>>>> +
>>>> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>>>> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>>>> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>>>> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>>>> +
>>>> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>>>> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>>>> +
>>>> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>>>> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>>>> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>>>> +
>>>> +	native_load_idt(&ctxt->idtr);
>>>> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>>>> +	native_wrmsrq(MSR_EFER, ctxt->efer);
>>>> +
>>>> +	/* restore the original kernel CS now via far return */
>>>> +	asm volatile("movzwq %0, %%rax\n\t"
>>>> +		     "pushq %%rax\n\t"
>>>> +		     "pushq $1f\n\t"
>>>> +		     "lretq\n\t"
>>>> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>>>> +
>>>> +	/* We are in asmlinkage without stack frame, hence make a C function
>>>> +	 * call which will buy stack frame to restore the tss or clear PT entry.
>>>> +	 */
>>>> +	hv_crash_restore_tss();
>>>> +	hv_crash_clear_kernpt();
>>>> +
>>>> +	/* we are now fully in devirtualized normal kernel mode */
>>>> +	__crash_kexec(NULL);
>>>
>>> The comments for __crash_kexec() say that "panic_cpu" should be set to
>>> the current CPU. I don't see that such is the case here.
>>
>> if linux panic, it would be set by vpanic, if hyp crash, that is
>> irrelevant.
>>
>>>> +
>>>> +	for (;;)
>>>> +		cpu_relax();
>>>
>>> Is the intent that __crash_kexec() should never return, on any of the vCPUs,
>>> because devirtualization isn't done unless there's a valid kdump image loaded?
>>> I wonder if
>>>
>>> 	native_wrmsrq(HV_X64_MSR_RESET, 1);
>>>
>>> would be better than looping forever in case __crash_kexec() fails
>>> somewhere along the way even if there's a kdump image loaded.
>>
>> yeah, i've gone thru all 3 possibilities here:
>>   o loop forever
>>   o reset
>>   o BUG() : this was in V0
>>
>> reset is just bad because system would just reboot without any indication
>> if hyp crashes. with loop at least there is a hang, and one could make
>> note of it, and if internal, attach debugger.
>>
>> BUG is best imo because with hyp gone linux will try to redo panic
>> and we would print something extra to help. I think i'll just go
>> back to my V0: BUG()
>>
>>>> +}
>>>> +/* Tell gcc we are using lretq long jump in the above function intentionally */
>>>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
>>>> +
>>>> +static void hv_mark_tss_not_busy(void)
>>>> +{
>>>> +	struct desc_struct *desc = get_current_gdt_rw();
>>>> +	tss_desc tss;
>>>> +
>>>> +	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
>>>> +	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
>>>> +	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
>>>> +}
>>>> +
>>>> +/* Save essential context */
>>>> +static void hv_hvcrash_ctxt_save(void)
>>>> +{
>>>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>> +
>>>> +	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
>>>> +
>>>> +	ctxt->cr0 = native_read_cr0();
>>>> +	ctxt->cr4 = native_read_cr4();
>>>> +
>>>> +	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
>>>> +	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
>>>> +
>>>> +	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
>>>> +	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
>>>> +	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
>>>> +	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
>>>> +	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
>>>> +	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
>>>> +
>>>> +	native_store_gdt(&ctxt->gdtr);
>>>> +	store_idt(&ctxt->idtr);
>>>> +
>>>> +	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
>>>> +	ctxt->efer = __rdmsr(MSR_EFER);
>>>> +	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
>>>> +}
>>>> +
>>>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
>>>> +static void hv_crash_fixup_kernpt(void)
>>>> +{
>>>> +	pgd_t *pgd;
>>>> +	p4d_t *p4d;
>>>> +
>>>> +	pgd = pgd_offset_k(trampoline_pa);
>>>> +	p4d = p4d_offset(pgd, trampoline_pa);
>>>> +
>>>> +	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
>>>> +	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
>>>> +	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
>>>> +}
>>>> +
>>>> +/*
>>>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
>>>> + * crashed and will collect core. This will cause the hyp to quiesce and
>>>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
>>>> + */
>>>> +static void hv_notify_prepare_hyp(void)
>>>> +{
>>>> +	u64 status;
>>>> +	struct hv_input_notify_partition_event *input;
>>>> +	struct hv_partition_event_root_crashdump_input *cda;
>>>> +
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	cda = &input->input.crashdump_input;
>>>
>>> The code ordering here is a bit weird. I'd expect this line to be grouped
>>> with cda->crashdump_action being set.
>>
>> we are setting two pointers, and using them later. setting pointers
>> up front is pretty normal.
>>
>>>> +	memset(input, 0, sizeof(*input));
>>>> +	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
>>>> +
>>>> +	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
>>>> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>> +	if (!hv_result_success(status))
>>>> +		return;
>>>> +
>>>> +	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
>>>> +	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Common function for all cpus before devirtualization.
>>>> + *
>>>> + * Hypervisor crash: all cpus get here in nmi context.
>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
>>>> + *		context. Note, panicing cpu may not be the bsp.
>>>> + *
>>>> + * The function is not inlined so it will show on the stack. It is named so
>>>> + * because the crash cmd looks for certain well known function names on the
>>>> + * stack before looking into the cpu saved note in the elf section, and
>>>> + * that work is currently incomplete.
>>>> + *
>>>> + * Notes:
>>>> + *  Hypervisor crash:
>>>> + *    - the hypervisor is in a very restrictive mode at this point and any
>>>> + *	vmexit it cannot handle would result in reboot. For example, console
>>>> + *	output from here would result in synic ipi hcall, which would result
>>>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
>>>> + *
>>>> + *  Devirtualization is supported from the bsp only.
>>>> + */
>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
>>>> +{
>>>> +	struct hv_input_disable_hyp_ex *input;
>>>> +	u64 status;
>>>> +	int msecs = 1000, ccpu = smp_processor_id();
>>>> +
>>>> +	if (ccpu == 0) {
>>>> +		/* crash_save_cpu() will be done in the kexec path */
>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>>>> +		atomic_inc(&crash_cpus_wait);
>>>> +	} else {
>>>> +		crash_save_cpu(regs, ccpu);
>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>>>> +		atomic_inc(&crash_cpus_wait);
>>>> +		for (;;);			/* cause no vmexits */
>>>> +	}
>>>> +
>>>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
>>>> +		mdelay(1);
>>>> +
>>>> +	stop_nmi();
>>>> +	if (!hv_has_crashed)
>>>> +		hv_notify_prepare_hyp();
>>>> +
>>>> +	if (crashing_cpu == -1)
>>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>>>
>>> Could just be "crashing_cpu = 0" since only the BSP gets here.
>>
>> a code change request has been open for while to remove the requirement
>> of bsp..
>>
>>>> +
>>>> +	hv_hvcrash_ctxt_save();
>>>> +	hv_mark_tss_not_busy();
>>>> +	hv_crash_fixup_kernpt();
>>>> +
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	memset(input, 0, sizeof(*input));
>>>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
>>>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
>>>
>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
>>> not access it. It only provides it in EDI when it invokes the trampoline
>>> function, right?
>>
>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
> 
> Yes, the comment matches the name of the "devirt_cr3arg" variable.
> Unfortunately my previous comment was incomplete because the value
> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
> PA of struct hv_crash_tramp_data. The CR3 value is stored in the
> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
> of indirection. The (corrected) comment in the header to hv_crash_asm32()
> describes EDI as containing "PA of struct hv_crash_tramp_data", which
> ought to match what is described here. I'd say that "devirt_cr3arg" ought
> to be renamed to "tramp_data_pa" or something else parallel to
> "trampoline_pa".

hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy 
back extra information for ourselves needed in trampoline.S. so it's 
all good.

>>
>> right, comes in edi, i don't know what EDI is (just kidding!)...
>>
>>>> +
>>>> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>>>> +
>>>> +	/* Devirt failed, just reboot as things are in very bad state now */
>>>> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
>>>> +}
>>>> +
>>>> +/*
>>>> + * Generic nmi callback handler: could be called without any crash also.
>>>> + *   hv crash: hypervisor injects nmi's into all cpus
>>>> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
>>>> + */
>>>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
>>>> +{
>>>> +	int ccpu = smp_processor_id();
>>>> +
>>>> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
>>>> +		hv_has_crashed = 1;
>>>> +
>>>> +	if (!hv_has_crashed && !lx_has_crashed)
>>>> +		return NMI_DONE;	/* ignore the nmi */
>>>> +
>>>> +	if (hv_has_crashed) {
>>>> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
>>>> +			if (ccpu == 0) {
>>>> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
>>>> +			} else
>>>> +				for (;;);	/* cause no vmexits */
>>>> +		}
>>>> +	}
>>>> +
>>>> +	crash_nmi_callback(regs);
>>>> +
>>>> +	return NMI_DONE;
>>>
>>> crash_nmi_callback() should never return, right? Normally one would
>>> expect to return NMI_HANDLED here, but I guess it doesn't matter
>>> if the return is never executed.
>>
>> correct.
>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
>>>> + *
>>>> + * On normal linux panic, this is called twice: first from panic and then again
>>>> + * from native_machine_crash_shutdown.
>>>> + *
>>>> + * In case of mshv, 3 ways to get here:
>>>> + *  1. hv crash (only bsp will get here):
>>>> + *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
>>>> + *		  -> __crash_kexec -> native_machine_crash_shutdown
>>>> + *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
>>>> + *  linux panic:
>>>> + *	2. panic cpu x: panic() -> crash_smp_send_stop
>>>> + *				     -> smp_ops.crash_stop_other_cpus
>>>> + *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
>>>> + *
>>>> + * NB: noclone and non standard stack because of call to crash_setup_regs().
>>>> + */
>>>> +static void __noclone hv_crash_stop_other_cpus(void)
>>>> +{
>>>> +	static int crash_stop_done;
>>>> +	struct pt_regs lregs;
>>>> +	int ccpu = smp_processor_id();
>>>> +
>>>> +	if (hv_has_crashed)
>>>> +		return;		/* all cpus already in nmi handler path */
>>>> +
>>>> +	if (!kexec_crash_loaded())
>>>> +		return;
>>>
>>> If we're in a normal panic path (your Case #2 above) with no kdump kernel
>>> loaded, why leave the other vCPUs running? Seems like that could violate
>>> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
>>> thereafter assumes other vCPUs are not running.
>>
>> no, there is lots of complexity here!
>>
>> if we hang vcpus here, hyp will note and may trigger its own watchdog.
>> also, machine_crash_shutdown() does another ipi.
>>
>> I think the best thing to do here is go back to my V0 which did not
>> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry:
>>
>> +       /* we are now fully in devirtualized normal kernel mode */
>> +       __crash_kexec(NULL);
>> +
>> +       BUG();
>>
>>
>> this way hyp would be disabled, ie, system devirtualized, and
>> __crash_kernel() will return, resulting in BUG() that will cause
>> it to go thru panic and honor panic= parameter with either hang
>> or reset. instead of bug, i could just call panic() also.
>>
>>>> +
>>>> +	if (crash_stop_done)
>>>> +		return;
>>>> +	crash_stop_done = 1;
>>>
>>> Is crash_stop_done necessary?  hv_crash_stop_other_cpus() is called
>>> from crash_smp_send_stop(), which has its own static variable
>>> "cpus_stopped" that does the same thing.
>>
>> yes. for error paths.
>>
>>>> +
>>>> +	/* linux has crashed: hv is healthy, we can ipi safely */
>>>> +	lx_has_crashed = 1;
>>>> +	wmb();			/* nmi handlers look at lx_has_crashed */
>>>> +
>>>> +	apic->send_IPI_allbutself(NMI_VECTOR);
>>>
>>> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
>>> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
>>> should disable_local_APIC() be done somewhere here as well?
>>
>> no, hyp does that.
> 
> As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX
> hypercall in crash_nmi_callback()? This gets back to an earlier question/comment
> where I was trying to figure out if the APIC is still enabled, and in what mode,
> when hv_crash_asm32() is invoked.

>>
>>>> +
>>>> +	if (crashing_cpu == -1)
>>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>>>> +
>>>> +	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
>>>> +	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
>>>> +	 */
>>>> +	if (ccpu)
>>>> +		crash_setup_regs(&lregs, NULL);
>>>> +
>>>> +	crash_nmi_callback(&lregs);
>>>> +}
>>>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
>>>> +
>>>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
>>>> +struct hv_gdtreg_32 {
>>>> +	u16 fill;
>>>> +	u16 limit;
>>>> +	u32 address;
>>>> +} __packed;
>>>> +
>>>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
>>>> +struct hv_crash_tramp_gdt {
>>>> +	u64 null;	/* index 0, selector 0, null selector */
>>>> +	u64 cs64;	/* index 1, selector 8, cs64 selector */
>>>> +} __packed;
>>>> +
>>>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
>>>> +struct hv_cs_jmptgt {
>>>> +	u32 address;
>>>> +	u16 csval;
>>>> +	u16 fill;
>>>> +} __packed;
>>>> +
>>>> +/* This trampoline data is copied onto the trampoline page after the asm code */
>>>> +struct hv_crash_tramp_data {
>>>> +	u64 tramp32_cr3;
>>>> +	u64 kernel_cr3;
>>>> +	struct hv_gdtreg_32 gdtr32;
>>>> +	struct hv_crash_tramp_gdt tramp_gdt;
>>>> +	struct hv_cs_jmptgt cs_jmptgt;
>>>> +	u64 c_entry_addr;
>>>> +} __packed;
>>>> +
>>>> +/*
>>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
>>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
>>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other
>>>> + * info like jmp and C entry targets for same reasons.
>>>> + *
>>>> + * Returns: 0 on success, -1 on error
>>>> + */
>>>> +static int hv_crash_setup_trampdata(u64 trampoline_va)
>>>> +{
>>>> +	int size, offs;
>>>> +	void *dest;
>>>> +	struct hv_crash_tramp_data *tramp;
>>>> +
>>>> +	/* These must match exactly the ones in the corresponding asm file */
>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
>>>> +						     cs_jmptgt.address) != 40);
>>>
>>> It would be nice to pick up the constants from a #include file that is
>>> shared with the asm code in Patch 4 of the series.
>>
>> yeah, could go either way, some don't like tiny headers...  if there are
>> no objections to new header for this, i could go that way too.
> 
> Saw your follow-on comments about this as well. The tiny header
> is ugly. It's a judgment call that can go either way, so go with your
> preference.
> 
>>
>>>> +
>>>> +	/* hv_crash_asm_end is beyond last byte by 1 */
>>>> +	size = &hv_crash_asm_end - &hv_crash_asm32;
>>>> +	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
>>>> +		pr_err("%s: trampoline page overflow\n", __func__);
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	dest = (void *)trampoline_va;
>>>> +	memcpy(dest, &hv_crash_asm32, size);
>>>> +
>>>> +	dest += size;
>>>> +	dest = (void *)round_up((ulong)dest, 16);
>>>> +	tramp = (struct hv_crash_tramp_data *)dest;
>>>> +
>>>> +	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
>>>> +	 * non-PCID-aware users". Build cr3 with pcid 0
>>>> +	 */
>>>> +	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
>>>> +
>>>> +	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
>>>> +	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
>>>> +
>>>> +	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
>>>> +	tramp->gdtr32.address = trampoline_pa +
>>>> +				   (ulong)&tramp->tramp_gdt - trampoline_va;
>>>> +
>>>> +	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
>>>> +	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
>>>> +
>>>> +	tramp->cs_jmptgt.csval = 0x8;
>>>> +	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
>>>> +	tramp->cs_jmptgt.address = trampoline_pa + offs;
>>>> +
>>>> +	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
>>>> +
>>>> +	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
>>>> +
>>>> +	return 0;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Build 32bit trampoline page table for transition from protected mode
>>>> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
>>>> + */
>>>> +static void hv_crash_build_tramp_pt(void)
>>>> +{
>>>> +	p4d_t *p4d;
>>>> +	pud_t *pud;
>>>> +	pmd_t *pmd;
>>>> +	pte_t *pte;
>>>> +	u64 pa, addr = trampoline_pa;
>>>> +
>>>> +	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
>>>> +	pa = virt_to_phys(hv_crash_ptpgs[1]);
>>>> +	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
>>>> +	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
>>>> +
>>>> +	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
>>>> +	pa = virt_to_phys(hv_crash_ptpgs[2]);
>>>> +	set_pud(pud, __pud(_PAGE_TABLE | pa));
>>>> +
>>>> +	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
>>>> +	pa = virt_to_phys(hv_crash_ptpgs[3]);
>>>> +	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
>>>> +
>>>> +	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
>>>> +	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
>>>> +}
>>>> +
>>>> +/*
>>>> + * Setup trampoline for devirtualization:
>>>> + *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
>>>> + *    in protected mode.
>>>> + *  - 4 pages for a temporary page table that asm code uses to turn paging on
>>>> + *  - a temporary gdt to use in the compat mode.
>>>> + *
>>>> + *  Returns: 0 on success
>>>> + */
>>>> +static int hv_crash_trampoline_setup(void)
>>>> +{
>>>> +	int i, rc, order;
>>>> +	struct page *page;
>>>> +	u64 trampoline_va;
>>>> +	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
>>>> +
>>>> +	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
>>>> +	page = alloc_page(flags32);
>>>> +	if (page == NULL) {
>>>> +		pr_err("%s: failed to alloc asm stub page\n", __func__);
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	trampoline_va = (u64)page_to_virt(page);
>>>> +	trampoline_pa = (u32)page_to_phys(page);
>>>> +
>>>> +	order = 2;	   /* alloc 2^2 pages */
>>>> +	page = alloc_pages(flags32, order);
>>>> +	if (page == NULL) {
>>>> +		pr_err("%s: failed to alloc pt pages\n", __func__);
>>>> +		free_page(trampoline_va);
>>>> +		return -1;
>>>> +	}
>>>> +
>>>> +	for (i = 0; i < 4; i++, page++)
>>>> +		hv_crash_ptpgs[i] = page_to_virt(page);
>>>> +
>>>> +	hv_crash_build_tramp_pt();
>>>> +
>>>> +	rc = hv_crash_setup_trampdata(trampoline_va);
>>>> +	if (rc)
>>>> +		goto errout;
>>>> +
>>>> +	return 0;
>>>> +
>>>> +errout:
>>>> +	free_page(trampoline_va);
>>>> +	free_pages((ulong)hv_crash_ptpgs[0], order);
>>>> +
>>>> +	return rc;
>>>> +}
>>>> +
>>>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
>>>> +void hv_root_crash_init(void)
>>>> +{
>>>> +	int rc;
>>>> +	struct hv_input_get_system_property *input;
>>>> +	struct hv_output_get_system_property *output;
>>>> +	unsigned long flags;
>>>> +	u64 status;
>>>> +	union hv_pfn_range cda_info;
>>>> +
>>>> +	if (pgtable_l5_enabled()) {
>>>> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
>>>> +				  "hv_crash_nmi");
>>>> +	if (rc) {
>>>> +		pr_err("Hyper-V: failed to register crash nmi handler\n");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	local_irq_save(flags);
>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>> +
>>>> +	memset(input, 0, sizeof(*input));
>>>> +	memset(output, 0, sizeof(*output));
>>>
>>> Why zero the output area? This is one of those hypercall things that we're
>>> inconsistent about. A few hypercall call sites zero the output area, and it's
>>> not clear why they do. Hyper-V should be responsible for properly filling in
>>> the output area. Linux should not need to do this zero'ing, unless there's some
>>> known bug in Hyper-V for certain hypercalls, in which case there should be
>>> a code comment stating "why".
>>
>> for the same reason sometimes you see char *p = NULL, either leftover
>> code or someone was debugging or just copy and paste. this is just copy
>> paste. i agree in general that we don't need to clear it at all, in fact,
>> i'd like to remove them all! but i also understand people with different
>> skills and junior members find it easier to debug, and also we were in
>> early product development. for that reason, it doesn't have to be
>> consistent either, if some complex hypercalls are failing repeatedly,
>> just for ease of debug, one might leave it there temporarily.  but
>> now that things are stable, i think we should just remove them all and
>> get used to a bit more inconvenient debugging...
> 
> I see your point about debugging, but on balance I agree that they
> should all be removed. If there's some debug case, add it back
> temporarily to debug, but leave upstream without it. The zero'ing is
> also unnecessary code in the interrupt disabled window, which you
> have expressed concern about in a different thread.

yeah, i've been extremely busy so not able to pay much attention to
upstreaming, but imo they should have been removed before upstreaming.
a simple patch that just removes memset of output would be welcome.

>>
>>>> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
>>>> +
>>>> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
>>>> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
>>>> +	local_irq_restore(flags);
>>>> +
>>>> +	if (!hv_result_success(status)) {
>>>> +		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
>>>> +		       input->property_id, hv_result_to_string(status));
>>>> +		goto err_out;
>>>> +	}
>>>> +
>>>> +	if (cda_info.base_pfn == 0) {
>>>> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
>>>> +		goto err_out;
>>>> +	}
>>>> +
>>>> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
>>>
>>> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
>>> terms of the Hyper-V page size, which isn't necessarily the guest page size.
>>> Yes, on x86 there's no difference, but for future robustness ....
>>
>> i don't know about guests, but we won't even boot if dom0 pg size
>> didn't match.. but easier to change than to make the case..
> 
> FWIW, a normal Linux guest on ARM64 works just fine with a page
> size of 16K or 64K, even though the underlying Hyper-V page size
> is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in
> the first place. Using it properly really matters for normal guests.
> (Having the guest page size smaller than the Hyper-V page size
> does *not* work, but there are no such use cases.)
> 
> Even on ARM64, I know the root partition page size is required to
> match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is
> still appropriate just to not leave code that will go wrong if the
> match requirement should ever change.
> 
>>
>>>> +
>>>> +	rc = hv_crash_trampoline_setup();
>>>> +	if (rc)
>>>> +		goto err_out;
>>>> +
>>>> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
>>>> +
>>>> +	crash_kexec_post_notifiers = true;
>>>> +	hv_crash_enabled = 1;
>>>> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
>>>
>>> This message and the message below aren't consistent. One refers
>>> to "hv kdump" and the other to "hyp kdump".
>>
>>>> +
>>>> +	return;
>>>> +
>>>> +err_out:
>>>> +	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
>>>> +	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
>>>> +}
>>>> --
>>>> 2.36.1.vfs.0.0
>>>>
>>>
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-18 23:52       ` Michael Kelley
@ 2025-09-19  9:06         ` Borislav Petkov
  2025-09-19 19:09           ` Mukesh R
  0 siblings, 1 reply; 29+ messages in thread
From: Borislav Petkov @ 2025-09-19  9:06 UTC (permalink / raw)
  To: Michael Kelley, Mukesh R
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, arnd@arndb.de

On Thu, Sep 18, 2025 at 11:52:35PM +0000, Michael Kelley wrote:
> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM
> > 
> > On 9/15/25 10:55, Michael Kelley wrote:
> > > From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
> > >>
> > >> Introduce a small asm stub to transition from the hypervisor to linux
> > >
> > > I'd argue for capitalizing "Linux" here and in other places in commit
> > > text and code comments throughout this patch set.
> > 
> > I'd argue against it. A quick grep indicates it is a common practice,
> > and in the code world goes easy on the eyes :).

But not in commit messages.

Commit messages should be maximally readable and things should start in
capital letters if that is their common spelling.

When it comes to "Linux", yeah, that's so widespread so you have both. If I'm
referring to what Linux does as a policy or in general or so on, I'd spell it
capitalized but I don't think we've enforced that too strictly...

> I'll offer a final comment on this topic, and then let it be. There's
> a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson,
> and other maintainers giving comments to use the capitalized form
> of "Linux", "MSR", "RAM", etc. See:

MSR, RAM and other abbreviations are capitalized and that's the only correct
way to spell them.

> > >> upon devirtualization.

What is "devirtualization"?

> > since control comes back to linux at the callback here, i fail to
> > understand what is vague about it. when hyp completes devirt,
> > devirt is complete.

This "speak" is what gets on my nerves. You're writing here as if everyone is
in your head and everyone knows what "hyp" and "devirt" is.

Commit mesages are not code and they should be maximally readable and
accessible to the widest audience, not only to the three people who develop
the feature.

If this patch were aimed at the things I maintain, it'll need a serious commit
message scrubbing and sanitizing first.

HTH.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor
  2025-09-19  9:06         ` Borislav Petkov
@ 2025-09-19 19:09           ` Mukesh R
  0 siblings, 0 replies; 29+ messages in thread
From: Mukesh R @ 2025-09-19 19:09 UTC (permalink / raw)
  To: Borislav Petkov, Michael Kelley
  Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-arch@vger.kernel.org, kys@microsoft.com,
	haiyangz@microsoft.com, wei.liu@kernel.org, decui@microsoft.com,
	tglx@linutronix.de, mingo@redhat.com, dave.hansen@linux.intel.com,
	x86@kernel.org, hpa@zytor.com, arnd@arndb.de

On 9/19/25 02:06, Borislav Petkov wrote:
> On Thu, Sep 18, 2025 at 11:52:35PM +0000, Michael Kelley wrote:
>> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 2:31 PM
>>>
>>> On 9/15/25 10:55, Michael Kelley wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>>>>
>>>>> Introduce a small asm stub to transition from the hypervisor to linux
>>>>
>>>> I'd argue for capitalizing "Linux" here and in other places in commit
>>>> text and code comments throughout this patch set.
>>>
>>> I'd argue against it. A quick grep indicates it is a common practice,
>>> and in the code world goes easy on the eyes :).
> 
> But not in commit messages.
> 
> Commit messages should be maximally readable and things should start in
> capital letters if that is their common spelling.
> 
> When it comes to "Linux", yeah, that's so widespread so you have both. If I'm
> referring to what Linux does as a policy or in general or so on, I'd spell it
> capitalized but I don't think we've enforced that too strictly...
> 
>> I'll offer a final comment on this topic, and then let it be. There's
>> a history of Greg K-H, Marc Zyngier, Boris Petkov, Sean Christopherson,
>> and other maintainers giving comments to use the capitalized form
>> of "Linux", "MSR", "RAM", etc. See:
> 
> MSR, RAM and other abbreviations are capitalized and that's the only correct
> way to spell them.
> 
>>>>> upon devirtualization.
> 
> What is "devirtualization"?

Hypervisor is disabled, and it transfer control to the root/dom0
partition, so essentially hypervisor is gone when control comes back
to root/dom0 Linux.

>>> since control comes back to linux at the callback here, i fail to
>>> understand what is vague about it. when hyp completes devirt,
>>> devirt is complete.
> 
> This "speak" is what gets on my nerves. You're writing here as if everyone is
> in your head and everyone knows what "hyp" and "devirt" is.

that's just follow up conversation, commit comment says "hypervisor" and
"devirtualization".

> Commit mesages are not code and they should be maximally readable and
> accessible to the widest audience, not only to the three people who develop
> the feature.
>
> If this patch were aimed at the things I maintain, it'll need a serious commit
> message scrubbing and sanitizing first.
> 
> HTH.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-19  2:32         ` Mukesh R
@ 2025-09-19 19:48           ` Michael Kelley
  2025-09-20  1:42           ` Mukesh R
  1 sibling, 0 replies; 29+ messages in thread
From: Michael Kelley @ 2025-09-19 19:48 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Thursday, September 18, 2025 7:32 PM
> 
> On 9/18/25 16:53, Michael Kelley wrote:
> > From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
> >>
> >> On 9/15/25 10:55, Michael Kelley wrote:
> >>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM

[snip]

> >>>> +
> >>>> +/*
> >>>> + * Common function for all cpus before devirtualization.
> >>>> + *
> >>>> + * Hypervisor crash: all cpus get here in nmi context.
> >>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> >>>> + *		context. Note, panicing cpu may not be the bsp.
> >>>> + *
> >>>> + * The function is not inlined so it will show on the stack. It is named so
> >>>> + * because the crash cmd looks for certain well known function names on the
> >>>> + * stack before looking into the cpu saved note in the elf section, and
> >>>> + * that work is currently incomplete.
> >>>> + *
> >>>> + * Notes:
> >>>> + *  Hypervisor crash:
> >>>> + *    - the hypervisor is in a very restrictive mode at this point and any
> >>>> + *	vmexit it cannot handle would result in reboot. For example, console
> >>>> + *	output from here would result in synic ipi hcall, which would result
> >>>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> >>>> + *
> >>>> + *  Devirtualization is supported from the bsp only.
> >>>> + */
> >>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> >>>> +{
> >>>> +	struct hv_input_disable_hyp_ex *input;
> >>>> +	u64 status;
> >>>> +	int msecs = 1000, ccpu = smp_processor_id();
> >>>> +
> >>>> +	if (ccpu == 0) {
> >>>> +		/* crash_save_cpu() will be done in the kexec path */
> >>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >>>> +		atomic_inc(&crash_cpus_wait);
> >>>> +	} else {
> >>>> +		crash_save_cpu(regs, ccpu);
> >>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >>>> +		atomic_inc(&crash_cpus_wait);
> >>>> +		for (;;);			/* cause no vmexits */
> >>>> +	}
> >>>> +
> >>>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> >>>> +		mdelay(1);
> >>>> +
> >>>> +	stop_nmi();
> >>>> +	if (!hv_has_crashed)
> >>>> +		hv_notify_prepare_hyp();
> >>>> +
> >>>> +	if (crashing_cpu == -1)
> >>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> >>>
> >>> Could just be "crashing_cpu = 0" since only the BSP gets here.
> >>
> >> a code change request has been open for while to remove the requirement
> >> of bsp..
> >>
> >>>> +
> >>>> +	hv_hvcrash_ctxt_save();
> >>>> +	hv_mark_tss_not_busy();
> >>>> +	hv_crash_fixup_kernpt();
> >>>> +
> >>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >>>> +	memset(input, 0, sizeof(*input));
> >>>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> >>>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> >>>
> >>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
> >>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
> >>> not access it. It only provides it in EDI when it invokes the trampoline
> >>> function, right?
> >>
> >> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
> >
> > Yes, the comment matches the name of the "devirt_cr3arg" variable.
> > Unfortunately my previous comment was incomplete because the value
> > stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
> > table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
> > PA of struct hv_crash_tramp_data. The CR3 value is stored in the
> > tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
> > of indirection. The (corrected) comment in the header to hv_crash_asm32()
> > describes EDI as containing "PA of struct hv_crash_tramp_data", which
> > ought to match what is described here. I'd say that "devirt_cr3arg" ought
> > to be renamed to "tramp_data_pa" or something else parallel to
> > "trampoline_pa".
> 
> hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy
> back extra information for ourselves needed in trampoline.S. so it's
> all good.
> 

That's a pretty important "detail" that hasn't heretofore been mentioned.
It means the layout of struct hv_crash_tramp_data is not entirely at Linux's
discretion. The tramp32_cr3 field must be first so the hypervisor finds it
where it expects it. Please add code comments describing that the
hypervisor uses the tramp32_cr3 field.

With this new information, I agree the code works. But the devirt_cr3arg
variable is still named incorrectly, and the "PA of trampoline page table L4"
comment is still incorrect. The value in "devirt_cr3arg" is the PA of a memory
location in the trampoline page that contains the devirt CR3 (which itself is
the PA of trampoline page table L4). The CR3 value is in the tramp32_cr3 field
of struct hv_crash_tramp_data in the trampoline page. The CR3 value is
not in static variable devirt_cr3arg, which is why I object to the naming of that
variable.

So rename devirt_cr3arg to devirt_cr3arg_pa. And the comment
becomes "PA of PA of trampoline page table L4", which is rather unwieldy, so
could be shortened to "PA of devirt CR3 value" or something similar. You could
also use "PA of struct hv_crash_tramp_data" as the comment, as I suggested
previously.
 
Michael

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-19  2:32         ` Mukesh R
  2025-09-19 19:48           ` Michael Kelley
@ 2025-09-20  1:42           ` Mukesh R
  2025-09-23  1:35             ` Michael Kelley
  1 sibling, 1 reply; 29+ messages in thread
From: Mukesh R @ 2025-09-20  1:42 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

On 9/18/25 19:32, Mukesh R wrote:
> On 9/18/25 16:53, Michael Kelley wrote:
>> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
>>>
>>> On 9/15/25 10:55, Michael Kelley wrote:
>>>> From: Mukesh Rathor <mrathor@linux.microsoft.com> Sent: Tuesday, September 9, 2025 5:10 PM
>>>>>
>>>>> Introduce a new file to implement collection of hypervisor ram into the
>>>>
>>>> s/ram/RAM/ (multiple places)
>>>
>>> a quick grep indicates using saying ram is common, i like ram over RAM
>>>
>>>>> vmcore collected by linux. By default, the hypervisor ram is locked, ie,
>>>>> protected via hw page table. Hyper-V implements a disable hypercall which
>>>>
>>>> The terminology here is a bit confusing since you have two names for
>>>> the same thing: "disable" hypervisor, and "devirtualize". Is it possible to
>>>> just use "devirtualize" everywhere, and drop the "disable" terminology?
>>>
>>> The concept is devirtualize and the actual hypercall was originally named
>>> disable. so intermixing is natural imo.
>>>
>>>>> essentially devirtualizes the system on the fly. This mechanism makes the
>>>>> hypervisor ram accessible to linux. Because the hypervisor ram is already
>>>>> mapped into linux address space (as reserved ram),
>>>>
>>>> Is the hypervisor RAM mapped into the VMM process user address space,
>>>> or somewhere in the kernel address space? If the latter, where in the kernel
>>>> code, or what mechanism, does that? Just curious, as I wasn't aware that
>>>> this is happening ....
>>>
>>> mapped in kernel as normal ram and we reserve it very early in boot. i
>>> see that patch has not made it here yet, should be coming very soon.
>>
>> OK, that's fine. The answer to my question is coming soon ....
>>
>>>
>>>>> it is automatically
>>>>> collected into the vmcore without extra work. More details of the
>>>>> implementation are available in the file prologue.
>>>>>
>>>>> Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
>>>>> ---
>>>>>  arch/x86/hyperv/hv_crash.c | 622 +++++++++++++++++++++++++++++++++++++
>>>>>  1 file changed, 622 insertions(+)
>>>>>  create mode 100644 arch/x86/hyperv/hv_crash.c
>>>>>
>>>>> diff --git a/arch/x86/hyperv/hv_crash.c b/arch/x86/hyperv/hv_crash.c
>>>>> new file mode 100644
>>>>> index 000000000000..531bac79d598
>>>>> --- /dev/null
>>>>> +++ b/arch/x86/hyperv/hv_crash.c
>>>>> @@ -0,0 +1,622 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0-only
>>>>> +/*
>>>>> + * X86 specific Hyper-V kdump/crash support module
>>>>> + *
>>>>> + * Copyright (C) 2025, Microsoft, Inc.
>>>>> + *
>>>>> + * This module implements hypervisor ram collection into vmcore for both
>>>>> + * cases of the hypervisor crash and linux dom0/root crash.
>>>>
>>>> For a hypervisor crash, does any of this apply to general guest VMs? I'm
>>>> thinking it does not. Hypervisor RAM is collected only into the vmcore
>>>> for the root partition, right? Maybe some additional clarification could be
>>>> added so there's no confusion in this regard.
>>>
>>> it would be odd for guests to collect hyp core, and target audience is
>>> assumed to be those who are somewhat familiar with basic concepts before
>>> getting here.
>>
>> I was unsure because I had not seen any code that adds the hypervisor memory
>> to the Linux memory map. Thought maybe something was going on I hadn?t
>> heard about, so I didn't know the scope of it.
>>
>> Of course, I'm one of those people who was *not* familiar with the basic concepts
>> before getting here. And given that there's no spec available from Hyper-V,
>> the comments in this patch set are all there is for anyone outside of Microsoft.
>> In that vein, I think it's reasonable to provide some description of how this
>> all works in the code comments. And you've done that, which is very
>> helpful. But I encountered a few places where I was confused or unclear, and
>> my suggestions here and in Patch 4 are just about making things as precise as
>> possible without adding a huge amount of additional verbiage. For someone
>> new, English text descriptions that the code can be checked against are
>> helpful, and drawing hard boundaries ("this is only applicable to the root
>> partition") is helpful.
>>
>> If you don't want to deal with it now, I could provide a follow-on patch later
>> that tweaks or augments the wording a bit to clarify some of these places. 
>> You can review, like with any patch. I've done wording work over the years
>> to many places in the VMBus code, and more broadly in providing most of
>> the documentation in Documentation/virt/hyperv.
> 
> with time, things will start making sense... i find comment pretty clear
> that it collects core for both cases of hv crash and dom0 crash, and no
> mention of guest implies has nothing to do with guests. 
> 
>>>
>>>> And what *does* happen to guest VMs after a hypervisor crash?
>>>
>>> they are gone... what else could we do?
>>>
>>>>> + * Hyper-V implements
>>>>> + * a devirtualization hypercall with a 32bit protected mode ABI callback. This
>>>>> + * mechanism must be used to unlock hypervisor ram. Since the hypervisor ram
>>>>> + * is already mapped in linux, it is automatically collected into linux vmcore,
>>>>> + * and can be examined by the crash command (raw ram dump) or windbg.
>>>>> + *
>>>>> + * At a high level:
>>>>> + *
>>>>> + *  Hypervisor Crash:
>>>>> + *    Upon crash, hypervisor goes into an emergency minimal dispatch loop, a
>>>>> + *    restrictive mode with very limited hypercall and msr support.
>>>>
>>>> s/msr/MSR/
>>>
>>> msr is used all over, seems acceptable.
>>>
>>>>> + *    Each cpu then injects NMIs into dom0/root vcpus.
>>>>
>>>> The "Each cpu" part of this sentence is confusing to me -- which CPUs does
>>>> this refer to? Maybe it would be better to say "It then injects an NMI into
>>>> each dom0/root partition vCPU." without being specific as to which CPUs do
>>>> the injecting since that seems more like a hypervisor implementation detail
>>>> that's not relevant here.
>>>
>>> all cpus in the system. there is a dedicated/pinned dom0 vcpu for each cpu.
>>
>> OK, that makes sense now that I think about it. Each physical CPU in the host
>> has a corresponding vCPU in the dom0/root partition. And each of the vCPUs
>> gets an NMI that sends it to the Linux-in-dom0 NMI handler, even if it was off
>> running a vCPU in some guest VM.
>>
>>>
>>>>> + *    A shared page is used to check
>>>>> + *    by linux in the nmi handler if the hypervisor has crashed. This shared
>>>>
>>>> s/nmi/NMI/  (multiple places)
>>>
>>>>> + *    page is setup in hv_root_crash_init during boot.
>>>>> + *
>>>>> + *  Linux Crash:
>>>>> + *    In case of linux crash, the callback hv_crash_stop_other_cpus will send
>>>>> + *    NMIs to all cpus, then proceed to the crash_nmi_callback where it waits
>>>>> + *    for all cpus to be in NMI.
>>>>> + *
>>>>> + *  NMI Handler (upon quorum):
>>>>> + *    Eventually, in both cases, all cpus wil end up in the nmi hanlder.
>>>>
>>>> s/hanlder/handler/
>>>>
>>>> And maybe just drop the word "wil" (which is misspelled).
>>>>
>>>>> + *    Hyper-V requires the disable hypervisor must be done from the bsp. So
>>>>
>>>> s/bsp/BSP  (multiple places)
>>>>
>>>>> + *    the bsp nmi handler saves current context, does some fixups and makes
>>>>> + *    the hypercall to disable the hypervisor, ie, devirtualize. Hypervisor
>>>>> + *    at that point will suspend all vcpus (except the bsp), unlock all its
>>>>> + *    ram, and return to linux at the 32bit mode entry RIP.
>>>>> + *
>>>>> + *  Linux 32bit entry trampoline will then restore long mode and call C
>>>>> + *  function here to restore context and continue execution to crash kexec.
>>>>> + */
>>>>> +
>>>>> +#include <linux/delay.h>
>>>>> +#include <linux/kexec.h>
>>>>> +#include <linux/crash_dump.h>
>>>>> +#include <linux/panic.h>
>>>>> +#include <asm/apic.h>
>>>>> +#include <asm/desc.h>
>>>>> +#include <asm/page.h>
>>>>> +#include <asm/pgalloc.h>
>>>>> +#include <asm/mshyperv.h>
>>>>> +#include <asm/nmi.h>
>>>>> +#include <asm/idtentry.h>
>>>>> +#include <asm/reboot.h>
>>>>> +#include <asm/intel_pt.h>
>>>>> +
>>>>> +int hv_crash_enabled;
>>>>
>>>> Seems like this is conceptually a "bool", not an "int".
>>>
>>> yeah, can change it to bool if i do another iteration.
>>>
>>>>> +EXPORT_SYMBOL_GPL(hv_crash_enabled);
>>>>> +
>>>>> +struct hv_crash_ctxt {
>>>>> +	ulong rsp;
>>>>> +	ulong cr0;
>>>>> +	ulong cr2;
>>>>> +	ulong cr4;
>>>>> +	ulong cr8;
>>>>> +
>>>>> +	u16 cs;
>>>>> +	u16 ss;
>>>>> +	u16 ds;
>>>>> +	u16 es;
>>>>> +	u16 fs;
>>>>> +	u16 gs;
>>>>> +
>>>>> +	u16 gdt_fill;
>>>>> +	struct desc_ptr gdtr;
>>>>> +	char idt_fill[6];
>>>>> +	struct desc_ptr idtr;
>>>>> +
>>>>> +	u64 gsbase;
>>>>> +	u64 efer;
>>>>> +	u64 pat;
>>>>> +};
>>>>> +static struct hv_crash_ctxt hv_crash_ctxt;
>>>>> +
>>>>> +/* Shared hypervisor page that contains crash dump area we peek into.
>>>>> + * NB: windbg looks for "hv_cda" symbol so don't change it.
>>>>> + */
>>>>> +static struct hv_crashdump_area *hv_cda;
>>>>> +
>>>>> +static u32 trampoline_pa, devirt_cr3arg;
>>>>> +static atomic_t crash_cpus_wait;
>>>>> +static void *hv_crash_ptpgs[4];
>>>>> +static int hv_has_crashed, lx_has_crashed;
>>>>
>>>> These are conceptually "bool" as well.
>>>>
>>>>> +
>>>>> +/* This cannot be inlined as it needs stack */
>>>>> +static noinline __noclone void hv_crash_restore_tss(void)
>>>>> +{
>>>>> +	load_TR_desc();
>>>>> +}
>>>>> +
>>>>> +/* This cannot be inlined as it needs stack */
>>>>> +static noinline void hv_crash_clear_kernpt(void)
>>>>> +{
>>>>> +	pgd_t *pgd;
>>>>> +	p4d_t *p4d;
>>>>> +
>>>>> +	/* Clear entry so it's not confusing to someone looking at the core */
>>>>> +	pgd = pgd_offset_k(trampoline_pa);
>>>>> +	p4d = p4d_offset(pgd, trampoline_pa);
>>>>> +	native_p4d_clear(p4d);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * This is the C entry point from the asm glue code after the devirt hypercall.
>>>>> + * We enter here in IA32-e long mode, ie, full 64bit mode running on kernel
>>>>> + * page tables with our below 4G page identity mapped, but using a temporary
>>>>> + * GDT. ds/fs/gs/es are null. ss is not usable. bp is null. stack is not
>>>>> + * available. We restore kernel GDT, and rest of the context, and continue
>>>>> + * to kexec.
>>>>> + */
>>>>> +static asmlinkage void __noreturn hv_crash_c_entry(void)
>>>>> +{
>>>>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>>> +
>>>>> +	/* first thing, restore kernel gdt */
>>>>> +	native_load_gdt(&ctxt->gdtr);
>>>>> +
>>>>> +	asm volatile("movw %%ax, %%ss" : : "a"(ctxt->ss));
>>>>> +	asm volatile("movq %0, %%rsp" : : "m"(ctxt->rsp));
>>>>> +
>>>>> +	asm volatile("movw %%ax, %%ds" : : "a"(ctxt->ds));
>>>>> +	asm volatile("movw %%ax, %%es" : : "a"(ctxt->es));
>>>>> +	asm volatile("movw %%ax, %%fs" : : "a"(ctxt->fs));
>>>>> +	asm volatile("movw %%ax, %%gs" : : "a"(ctxt->gs));
>>>>> +
>>>>> +	native_wrmsrq(MSR_IA32_CR_PAT, ctxt->pat);
>>>>> +	asm volatile("movq %0, %%cr0" : : "r"(ctxt->cr0));
>>>>> +
>>>>> +	asm volatile("movq %0, %%cr8" : : "r"(ctxt->cr8));
>>>>> +	asm volatile("movq %0, %%cr4" : : "r"(ctxt->cr4));
>>>>> +	asm volatile("movq %0, %%cr2" : : "r"(ctxt->cr4));
>>>>> +
>>>>> +	native_load_idt(&ctxt->idtr);
>>>>> +	native_wrmsrq(MSR_GS_BASE, ctxt->gsbase);
>>>>> +	native_wrmsrq(MSR_EFER, ctxt->efer);
>>>>> +
>>>>> +	/* restore the original kernel CS now via far return */
>>>>> +	asm volatile("movzwq %0, %%rax\n\t"
>>>>> +		     "pushq %%rax\n\t"
>>>>> +		     "pushq $1f\n\t"
>>>>> +		     "lretq\n\t"
>>>>> +		     "1:nop\n\t" : : "m"(ctxt->cs) : "rax");
>>>>> +
>>>>> +	/* We are in asmlinkage without stack frame, hence make a C function
>>>>> +	 * call which will buy stack frame to restore the tss or clear PT entry.
>>>>> +	 */
>>>>> +	hv_crash_restore_tss();
>>>>> +	hv_crash_clear_kernpt();
>>>>> +
>>>>> +	/* we are now fully in devirtualized normal kernel mode */
>>>>> +	__crash_kexec(NULL);
>>>>
>>>> The comments for __crash_kexec() say that "panic_cpu" should be set to
>>>> the current CPU. I don't see that such is the case here.
>>>
>>> if linux panic, it would be set by vpanic, if hyp crash, that is
>>> irrelevant.
>>>
>>>>> +
>>>>> +	for (;;)
>>>>> +		cpu_relax();
>>>>
>>>> Is the intent that __crash_kexec() should never return, on any of the vCPUs,
>>>> because devirtualization isn't done unless there's a valid kdump image loaded?
>>>> I wonder if
>>>>
>>>> 	native_wrmsrq(HV_X64_MSR_RESET, 1);
>>>>
>>>> would be better than looping forever in case __crash_kexec() fails
>>>> somewhere along the way even if there's a kdump image loaded.
>>>
>>> yeah, i've gone thru all 3 possibilities here:
>>>   o loop forever
>>>   o reset
>>>   o BUG() : this was in V0
>>>
>>> reset is just bad because system would just reboot without any indication
>>> if hyp crashes. with loop at least there is a hang, and one could make
>>> note of it, and if internal, attach debugger.
>>>
>>> BUG is best imo because with hyp gone linux will try to redo panic
>>> and we would print something extra to help. I think i'll just go
>>> back to my V0: BUG()
>>>
>>>>> +}
>>>>> +/* Tell gcc we are using lretq long jump in the above function intentionally */
>>>>> +STACK_FRAME_NON_STANDARD(hv_crash_c_entry);
>>>>> +
>>>>> +static void hv_mark_tss_not_busy(void)
>>>>> +{
>>>>> +	struct desc_struct *desc = get_current_gdt_rw();
>>>>> +	tss_desc tss;
>>>>> +
>>>>> +	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
>>>>> +	tss.type = 0x9;        /* available 64-bit TSS. 0xB is busy TSS */
>>>>> +	write_gdt_entry(desc, GDT_ENTRY_TSS, &tss, DESC_TSS);
>>>>> +}
>>>>> +
>>>>> +/* Save essential context */
>>>>> +static void hv_hvcrash_ctxt_save(void)
>>>>> +{
>>>>> +	struct hv_crash_ctxt *ctxt = &hv_crash_ctxt;
>>>>> +
>>>>> +	asm volatile("movq %%rsp,%0" : "=m"(ctxt->rsp));
>>>>> +
>>>>> +	ctxt->cr0 = native_read_cr0();
>>>>> +	ctxt->cr4 = native_read_cr4();
>>>>> +
>>>>> +	asm volatile("movq %%cr2, %0" : "=a"(ctxt->cr2));
>>>>> +	asm volatile("movq %%cr8, %0" : "=a"(ctxt->cr8));
>>>>> +
>>>>> +	asm volatile("movl %%cs, %%eax" : "=a"(ctxt->cs));
>>>>> +	asm volatile("movl %%ss, %%eax" : "=a"(ctxt->ss));
>>>>> +	asm volatile("movl %%ds, %%eax" : "=a"(ctxt->ds));
>>>>> +	asm volatile("movl %%es, %%eax" : "=a"(ctxt->es));
>>>>> +	asm volatile("movl %%fs, %%eax" : "=a"(ctxt->fs));
>>>>> +	asm volatile("movl %%gs, %%eax" : "=a"(ctxt->gs));
>>>>> +
>>>>> +	native_store_gdt(&ctxt->gdtr);
>>>>> +	store_idt(&ctxt->idtr);
>>>>> +
>>>>> +	ctxt->gsbase = __rdmsr(MSR_GS_BASE);
>>>>> +	ctxt->efer = __rdmsr(MSR_EFER);
>>>>> +	ctxt->pat = __rdmsr(MSR_IA32_CR_PAT);
>>>>> +}
>>>>> +
>>>>> +/* Add trampoline page to the kernel pagetable for transition to kernel PT */
>>>>> +static void hv_crash_fixup_kernpt(void)
>>>>> +{
>>>>> +	pgd_t *pgd;
>>>>> +	p4d_t *p4d;
>>>>> +
>>>>> +	pgd = pgd_offset_k(trampoline_pa);
>>>>> +	p4d = p4d_offset(pgd, trampoline_pa);
>>>>> +
>>>>> +	/* trampoline_pa is below 4G, so no pre-existing entry to clobber */
>>>>> +	p4d_populate(&init_mm, p4d, (pud_t *)hv_crash_ptpgs[1]);
>>>>> +	p4d->p4d = p4d->p4d & ~(_PAGE_NX);    /* enable execute */
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Now that all cpus are in nmi and spinning, we notify the hyp that linux has
>>>>> + * crashed and will collect core. This will cause the hyp to quiesce and
>>>>> + * suspend all VPs except the bsp. Called if linux crashed and not the hyp.
>>>>> + */
>>>>> +static void hv_notify_prepare_hyp(void)
>>>>> +{
>>>>> +	u64 status;
>>>>> +	struct hv_input_notify_partition_event *input;
>>>>> +	struct hv_partition_event_root_crashdump_input *cda;
>>>>> +
>>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>> +	cda = &input->input.crashdump_input;
>>>>
>>>> The code ordering here is a bit weird. I'd expect this line to be grouped
>>>> with cda->crashdump_action being set.
>>>
>>> we are setting two pointers, and using them later. setting pointers
>>> up front is pretty normal.
>>>
>>>>> +	memset(input, 0, sizeof(*input));
>>>>> +	input->event = HV_PARTITION_EVENT_ROOT_CRASHDUMP;
>>>>> +
>>>>> +	cda->crashdump_action = HV_CRASHDUMP_ENTRY;
>>>>> +	status = hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>>> +	if (!hv_result_success(status))
>>>>> +		return;
>>>>> +
>>>>> +	cda->crashdump_action = HV_CRASHDUMP_SUSPEND_ALL_VPS;
>>>>> +	hv_do_hypercall(HVCALL_NOTIFY_PARTITION_EVENT, input, NULL);
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Common function for all cpus before devirtualization.
>>>>> + *
>>>>> + * Hypervisor crash: all cpus get here in nmi context.
>>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
>>>>> + *		context. Note, panicing cpu may not be the bsp.
>>>>> + *
>>>>> + * The function is not inlined so it will show on the stack. It is named so
>>>>> + * because the crash cmd looks for certain well known function names on the
>>>>> + * stack before looking into the cpu saved note in the elf section, and
>>>>> + * that work is currently incomplete.
>>>>> + *
>>>>> + * Notes:
>>>>> + *  Hypervisor crash:
>>>>> + *    - the hypervisor is in a very restrictive mode at this point and any
>>>>> + *	vmexit it cannot handle would result in reboot. For example, console
>>>>> + *	output from here would result in synic ipi hcall, which would result
>>>>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
>>>>> + *
>>>>> + *  Devirtualization is supported from the bsp only.
>>>>> + */
>>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
>>>>> +{
>>>>> +	struct hv_input_disable_hyp_ex *input;
>>>>> +	u64 status;
>>>>> +	int msecs = 1000, ccpu = smp_processor_id();
>>>>> +
>>>>> +	if (ccpu == 0) {
>>>>> +		/* crash_save_cpu() will be done in the kexec path */
>>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>>>>> +		atomic_inc(&crash_cpus_wait);
>>>>> +	} else {
>>>>> +		crash_save_cpu(regs, ccpu);
>>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
>>>>> +		atomic_inc(&crash_cpus_wait);
>>>>> +		for (;;);			/* cause no vmexits */
>>>>> +	}
>>>>> +
>>>>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
>>>>> +		mdelay(1);
>>>>> +
>>>>> +	stop_nmi();
>>>>> +	if (!hv_has_crashed)
>>>>> +		hv_notify_prepare_hyp();
>>>>> +
>>>>> +	if (crashing_cpu == -1)
>>>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>>>>
>>>> Could just be "crashing_cpu = 0" since only the BSP gets here.
>>>
>>> a code change request has been open for while to remove the requirement
>>> of bsp..
>>>
>>>>> +
>>>>> +	hv_hvcrash_ctxt_save();
>>>>> +	hv_mark_tss_not_busy();
>>>>> +	hv_crash_fixup_kernpt();
>>>>> +
>>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>> +	memset(input, 0, sizeof(*input));
>>>>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
>>>>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
>>>>
>>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
>>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
>>>> not access it. It only provides it in EDI when it invokes the trampoline
>>>> function, right?
>>>
>>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
>>
>> Yes, the comment matches the name of the "devirt_cr3arg" variable.
>> Unfortunately my previous comment was incomplete because the value
>> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
>> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
>> PA of struct hv_crash_tramp_data. The CR3 value is stored in the
>> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
>> of indirection. The (corrected) comment in the header to hv_crash_asm32()
>> describes EDI as containing "PA of struct hv_crash_tramp_data", which
>> ought to match what is described here. I'd say that "devirt_cr3arg" ought
>> to be renamed to "tramp_data_pa" or something else parallel to
>> "trampoline_pa".
> 
> hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy 
> back extra information for ourselves needed in trampoline.S. so it's 
> all good.

actually, what i said earlier was true, not above. that the arg is
opaque and hyp does not use it (we are transitioning paging off after
all!). i did this all almost two years ago, so had vague recollections
but finally had time today to go back to square one and old notes,
and remember things now. so final answer:

the hypercall calls it TrampolineCr3, i guess this is how windows uses it
(they have customized kernel code for core collection). doing that was
becoming too intrusive on linux, so i decided to use the arg to pass the
info i needed in the trampoline code. Since the hypercall calls the arg
TrampolineCr3, i must have just used that name for the arg to match it,
probably falsely assuming hypervisor somehow looked at it. (actually,
the windows hypercall wrapper does look at it to make sure it is a
ram address).

since the hypercall doesn't use the arg, it could just call it
devirtArg, but maybe in the past they used it somehow. in my latest
version, i just call it devirt_arg.


>>> right, comes in edi, i don't know what EDI is (just kidding!)...
>>>
>>>>> +
>>>>> +	status = hv_do_hypercall(HVCALL_DISABLE_HYP_EX, input, NULL);
>>>>> +
>>>>> +	/* Devirt failed, just reboot as things are in very bad state now */
>>>>> +	native_wrmsrq(HV_X64_MSR_RESET, 1);    /* get hv to reboot */
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Generic nmi callback handler: could be called without any crash also.
>>>>> + *   hv crash: hypervisor injects nmi's into all cpus
>>>>> + *   lx crash: panicing cpu sends nmi to all but self via crash_stop_other_cpus
>>>>> + */
>>>>> +static int hv_crash_nmi_local(unsigned int cmd, struct pt_regs *regs)
>>>>> +{
>>>>> +	int ccpu = smp_processor_id();
>>>>> +
>>>>> +	if (!hv_has_crashed && hv_cda && hv_cda->cda_valid)
>>>>> +		hv_has_crashed = 1;
>>>>> +
>>>>> +	if (!hv_has_crashed && !lx_has_crashed)
>>>>> +		return NMI_DONE;	/* ignore the nmi */
>>>>> +
>>>>> +	if (hv_has_crashed) {
>>>>> +		if (!kexec_crash_loaded() || !hv_crash_enabled) {
>>>>> +			if (ccpu == 0) {
>>>>> +				native_wrmsrq(HV_X64_MSR_RESET, 1); /* reboot */
>>>>> +			} else
>>>>> +				for (;;);	/* cause no vmexits */
>>>>> +		}
>>>>> +	}
>>>>> +
>>>>> +	crash_nmi_callback(regs);
>>>>> +
>>>>> +	return NMI_DONE;
>>>>
>>>> crash_nmi_callback() should never return, right? Normally one would
>>>> expect to return NMI_HANDLED here, but I guess it doesn't matter
>>>> if the return is never executed.
>>>
>>> correct.
>>>
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * hv_crash_stop_other_cpus() == smp_ops.crash_stop_other_cpus
>>>>> + *
>>>>> + * On normal linux panic, this is called twice: first from panic and then again
>>>>> + * from native_machine_crash_shutdown.
>>>>> + *
>>>>> + * In case of mshv, 3 ways to get here:
>>>>> + *  1. hv crash (only bsp will get here):
>>>>> + *	BSP : nmi callback -> DisableHv -> hv_crash_asm32 -> hv_crash_c_entry
>>>>> + *		  -> __crash_kexec -> native_machine_crash_shutdown
>>>>> + *		  -> crash_smp_send_stop -> smp_ops.crash_stop_other_cpus
>>>>> + *  linux panic:
>>>>> + *	2. panic cpu x: panic() -> crash_smp_send_stop
>>>>> + *				     -> smp_ops.crash_stop_other_cpus
>>>>> + *	3. bsp: native_machine_crash_shutdown -> crash_smp_send_stop
>>>>> + *
>>>>> + * NB: noclone and non standard stack because of call to crash_setup_regs().
>>>>> + */
>>>>> +static void __noclone hv_crash_stop_other_cpus(void)
>>>>> +{
>>>>> +	static int crash_stop_done;
>>>>> +	struct pt_regs lregs;
>>>>> +	int ccpu = smp_processor_id();
>>>>> +
>>>>> +	if (hv_has_crashed)
>>>>> +		return;		/* all cpus already in nmi handler path */
>>>>> +
>>>>> +	if (!kexec_crash_loaded())
>>>>> +		return;
>>>>
>>>> If we're in a normal panic path (your Case #2 above) with no kdump kernel
>>>> loaded, why leave the other vCPUs running? Seems like that could violate
>>>> expectations in vpanic(), where it calls panic_other_cpus_shutdown() and
>>>> thereafter assumes other vCPUs are not running.
>>>
>>> no, there is lots of complexity here!
>>>
>>> if we hang vcpus here, hyp will note and may trigger its own watchdog.
>>> also, machine_crash_shutdown() does another ipi.
>>>
>>> I think the best thing to do here is go back to my V0 which did not
>>> have check for kexec_crash_loaded(), but had this in hv_crash_c_entry:
>>>
>>> +       /* we are now fully in devirtualized normal kernel mode */
>>> +       __crash_kexec(NULL);
>>> +
>>> +       BUG();
>>>
>>>
>>> this way hyp would be disabled, ie, system devirtualized, and
>>> __crash_kernel() will return, resulting in BUG() that will cause
>>> it to go thru panic and honor panic= parameter with either hang
>>> or reset. instead of bug, i could just call panic() also.
>>>
>>>>> +
>>>>> +	if (crash_stop_done)
>>>>> +		return;
>>>>> +	crash_stop_done = 1;
>>>>
>>>> Is crash_stop_done necessary?  hv_crash_stop_other_cpus() is called
>>>> from crash_smp_send_stop(), which has its own static variable
>>>> "cpus_stopped" that does the same thing.
>>>
>>> yes. for error paths.
>>>
>>>>> +
>>>>> +	/* linux has crashed: hv is healthy, we can ipi safely */
>>>>> +	lx_has_crashed = 1;
>>>>> +	wmb();			/* nmi handlers look at lx_has_crashed */
>>>>> +
>>>>> +	apic->send_IPI_allbutself(NMI_VECTOR);
>>>>
>>>> The default .crash_stop_other_cpus function is kdump_nmi_shootdown_cpus().
>>>> In addition to sending the NMI IPI, it does disable_local_APIC(). I don't know, but
>>>> should disable_local_APIC() be done somewhere here as well?
>>>
>>> no, hyp does that.
>>
>> As part of the devirt operation initiated by the HVCALL_DISABLE_HYP_EX
>> hypercall in crash_nmi_callback()? This gets back to an earlier question/comment
>> where I was trying to figure out if the APIC is still enabled, and in what mode,
>> when hv_crash_asm32() is invoked.
> 
>>>
>>>>> +
>>>>> +	if (crashing_cpu == -1)
>>>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
>>>>> +
>>>>> +	/* crash_setup_regs() happens in kexec also, but for the kexec cpu which
>>>>> +	 * is the bsp. We could be here on non-bsp cpu, collect regs if so.
>>>>> +	 */
>>>>> +	if (ccpu)
>>>>> +		crash_setup_regs(&lregs, NULL);
>>>>> +
>>>>> +	crash_nmi_callback(&lregs);
>>>>> +}
>>>>> +STACK_FRAME_NON_STANDARD(hv_crash_stop_other_cpus);
>>>>> +
>>>>> +/* This GDT is accessed in IA32-e compat mode which uses 32bits addresses */
>>>>> +struct hv_gdtreg_32 {
>>>>> +	u16 fill;
>>>>> +	u16 limit;
>>>>> +	u32 address;
>>>>> +} __packed;
>>>>> +
>>>>> +/* We need a CS with L bit to goto IA32-e long mode from 32bit compat mode */
>>>>> +struct hv_crash_tramp_gdt {
>>>>> +	u64 null;	/* index 0, selector 0, null selector */
>>>>> +	u64 cs64;	/* index 1, selector 8, cs64 selector */
>>>>> +} __packed;
>>>>> +
>>>>> +/* No stack, so jump via far ptr in memory to load the 64bit CS */
>>>>> +struct hv_cs_jmptgt {
>>>>> +	u32 address;
>>>>> +	u16 csval;
>>>>> +	u16 fill;
>>>>> +} __packed;
>>>>> +
>>>>> +/* This trampoline data is copied onto the trampoline page after the asm code */
>>>>> +struct hv_crash_tramp_data {
>>>>> +	u64 tramp32_cr3;
>>>>> +	u64 kernel_cr3;
>>>>> +	struct hv_gdtreg_32 gdtr32;
>>>>> +	struct hv_crash_tramp_gdt tramp_gdt;
>>>>> +	struct hv_cs_jmptgt cs_jmptgt;
>>>>> +	u64 c_entry_addr;
>>>>> +} __packed;
>>>>> +
>>>>> +/*
>>>>> + * Setup a temporary gdt to allow the asm code to switch to the long mode.
>>>>> + * Since the asm code is relocated/copied to a below 4G page, it cannot use rip
>>>>> + * relative addressing, hence we must use trampoline_pa here. Also, save other
>>>>> + * info like jmp and C entry targets for same reasons.
>>>>> + *
>>>>> + * Returns: 0 on success, -1 on error
>>>>> + */
>>>>> +static int hv_crash_setup_trampdata(u64 trampoline_va)
>>>>> +{
>>>>> +	int size, offs;
>>>>> +	void *dest;
>>>>> +	struct hv_crash_tramp_data *tramp;
>>>>> +
>>>>> +	/* These must match exactly the ones in the corresponding asm file */
>>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, tramp32_cr3) != 0);
>>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, kernel_cr3) != 8);
>>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data, gdtr32.limit) != 18);
>>>>> +	BUILD_BUG_ON(offsetof(struct hv_crash_tramp_data,
>>>>> +						     cs_jmptgt.address) != 40);
>>>>
>>>> It would be nice to pick up the constants from a #include file that is
>>>> shared with the asm code in Patch 4 of the series.
>>>
>>> yeah, could go either way, some don't like tiny headers...  if there are
>>> no objections to new header for this, i could go that way too.
>>
>> Saw your follow-on comments about this as well. The tiny header
>> is ugly. It's a judgment call that can go either way, so go with your
>> preference.
>>
>>>
>>>>> +
>>>>> +	/* hv_crash_asm_end is beyond last byte by 1 */
>>>>> +	size = &hv_crash_asm_end - &hv_crash_asm32;
>>>>> +	if (size + sizeof(struct hv_crash_tramp_data) > PAGE_SIZE) {
>>>>> +		pr_err("%s: trampoline page overflow\n", __func__);
>>>>> +		return -1;
>>>>> +	}
>>>>> +
>>>>> +	dest = (void *)trampoline_va;
>>>>> +	memcpy(dest, &hv_crash_asm32, size);
>>>>> +
>>>>> +	dest += size;
>>>>> +	dest = (void *)round_up((ulong)dest, 16);
>>>>> +	tramp = (struct hv_crash_tramp_data *)dest;
>>>>> +
>>>>> +	/* see MAX_ASID_AVAILABLE in tlb.c: "PCID 0 is reserved for use by
>>>>> +	 * non-PCID-aware users". Build cr3 with pcid 0
>>>>> +	 */
>>>>> +	tramp->tramp32_cr3 = __sme_pa(hv_crash_ptpgs[0]);
>>>>> +
>>>>> +	/* Note, when restoring X86_CR4_PCIDE, cr3[11:0] must be zero */
>>>>> +	tramp->kernel_cr3 = __sme_pa(init_mm.pgd);
>>>>> +
>>>>> +	tramp->gdtr32.limit = sizeof(struct hv_crash_tramp_gdt);
>>>>> +	tramp->gdtr32.address = trampoline_pa +
>>>>> +				   (ulong)&tramp->tramp_gdt - trampoline_va;
>>>>> +
>>>>> +	 /* base:0 limit:0xfffff type:b dpl:0 P:1 L:1 D:0 avl:0 G:1 */
>>>>> +	tramp->tramp_gdt.cs64 = 0x00af9a000000ffff;
>>>>> +
>>>>> +	tramp->cs_jmptgt.csval = 0x8;
>>>>> +	offs = (ulong)&hv_crash_asm64_lbl - (ulong)&hv_crash_asm32;
>>>>> +	tramp->cs_jmptgt.address = trampoline_pa + offs;
>>>>> +
>>>>> +	tramp->c_entry_addr = (u64)&hv_crash_c_entry;
>>>>> +
>>>>> +	devirt_cr3arg = trampoline_pa + (ulong)dest - trampoline_va;
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Build 32bit trampoline page table for transition from protected mode
>>>>> + * non-paging to long-mode paging. This transition needs pagetables below 4G.
>>>>> + */
>>>>> +static void hv_crash_build_tramp_pt(void)
>>>>> +{
>>>>> +	p4d_t *p4d;
>>>>> +	pud_t *pud;
>>>>> +	pmd_t *pmd;
>>>>> +	pte_t *pte;
>>>>> +	u64 pa, addr = trampoline_pa;
>>>>> +
>>>>> +	p4d = hv_crash_ptpgs[0] + pgd_index(addr) * sizeof(p4d);
>>>>> +	pa = virt_to_phys(hv_crash_ptpgs[1]);
>>>>> +	set_p4d(p4d, __p4d(_PAGE_TABLE | pa));
>>>>> +	p4d->p4d &= ~(_PAGE_NX);	/* disable no execute */
>>>>> +
>>>>> +	pud = hv_crash_ptpgs[1] + pud_index(addr) * sizeof(pud);
>>>>> +	pa = virt_to_phys(hv_crash_ptpgs[2]);
>>>>> +	set_pud(pud, __pud(_PAGE_TABLE | pa));
>>>>> +
>>>>> +	pmd = hv_crash_ptpgs[2] + pmd_index(addr) * sizeof(pmd);
>>>>> +	pa = virt_to_phys(hv_crash_ptpgs[3]);
>>>>> +	set_pmd(pmd, __pmd(_PAGE_TABLE | pa));
>>>>> +
>>>>> +	pte = hv_crash_ptpgs[3] + pte_index(addr) * sizeof(pte);
>>>>> +	set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_EXEC));
>>>>> +}
>>>>> +
>>>>> +/*
>>>>> + * Setup trampoline for devirtualization:
>>>>> + *  - a page below 4G, ie 32bit addr containing asm glue code that mshv jmps to
>>>>> + *    in protected mode.
>>>>> + *  - 4 pages for a temporary page table that asm code uses to turn paging on
>>>>> + *  - a temporary gdt to use in the compat mode.
>>>>> + *
>>>>> + *  Returns: 0 on success
>>>>> + */
>>>>> +static int hv_crash_trampoline_setup(void)
>>>>> +{
>>>>> +	int i, rc, order;
>>>>> +	struct page *page;
>>>>> +	u64 trampoline_va;
>>>>> +	gfp_t flags32 = GFP_KERNEL | GFP_DMA32 | __GFP_ZERO;
>>>>> +
>>>>> +	/* page for 32bit trampoline assembly code + hv_crash_tramp_data */
>>>>> +	page = alloc_page(flags32);
>>>>> +	if (page == NULL) {
>>>>> +		pr_err("%s: failed to alloc asm stub page\n", __func__);
>>>>> +		return -1;
>>>>> +	}
>>>>> +
>>>>> +	trampoline_va = (u64)page_to_virt(page);
>>>>> +	trampoline_pa = (u32)page_to_phys(page);
>>>>> +
>>>>> +	order = 2;	   /* alloc 2^2 pages */
>>>>> +	page = alloc_pages(flags32, order);
>>>>> +	if (page == NULL) {
>>>>> +		pr_err("%s: failed to alloc pt pages\n", __func__);
>>>>> +		free_page(trampoline_va);
>>>>> +		return -1;
>>>>> +	}
>>>>> +
>>>>> +	for (i = 0; i < 4; i++, page++)
>>>>> +		hv_crash_ptpgs[i] = page_to_virt(page);
>>>>> +
>>>>> +	hv_crash_build_tramp_pt();
>>>>> +
>>>>> +	rc = hv_crash_setup_trampdata(trampoline_va);
>>>>> +	if (rc)
>>>>> +		goto errout;
>>>>> +
>>>>> +	return 0;
>>>>> +
>>>>> +errout:
>>>>> +	free_page(trampoline_va);
>>>>> +	free_pages((ulong)hv_crash_ptpgs[0], order);
>>>>> +
>>>>> +	return rc;
>>>>> +}
>>>>> +
>>>>> +/* Setup for kdump kexec to collect hypervisor ram when running as mshv root */
>>>>> +void hv_root_crash_init(void)
>>>>> +{
>>>>> +	int rc;
>>>>> +	struct hv_input_get_system_property *input;
>>>>> +	struct hv_output_get_system_property *output;
>>>>> +	unsigned long flags;
>>>>> +	u64 status;
>>>>> +	union hv_pfn_range cda_info;
>>>>> +
>>>>> +	if (pgtable_l5_enabled()) {
>>>>> +		pr_err("Hyper-V: crash dump not yet supported on 5level PTs\n");
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	rc = register_nmi_handler(NMI_LOCAL, hv_crash_nmi_local, NMI_FLAG_FIRST,
>>>>> +				  "hv_crash_nmi");
>>>>> +	if (rc) {
>>>>> +		pr_err("Hyper-V: failed to register crash nmi handler\n");
>>>>> +		return;
>>>>> +	}
>>>>> +
>>>>> +	local_irq_save(flags);
>>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>>>>> +	output = *this_cpu_ptr(hyperv_pcpu_output_arg);
>>>>> +
>>>>> +	memset(input, 0, sizeof(*input));
>>>>> +	memset(output, 0, sizeof(*output));
>>>>
>>>> Why zero the output area? This is one of those hypercall things that we're
>>>> inconsistent about. A few hypercall call sites zero the output area, and it's
>>>> not clear why they do. Hyper-V should be responsible for properly filling in
>>>> the output area. Linux should not need to do this zero'ing, unless there's some
>>>> known bug in Hyper-V for certain hypercalls, in which case there should be
>>>> a code comment stating "why".
>>>
>>> for the same reason sometimes you see char *p = NULL, either leftover
>>> code or someone was debugging or just copy and paste. this is just copy
>>> paste. i agree in general that we don't need to clear it at all, in fact,
>>> i'd like to remove them all! but i also understand people with different
>>> skills and junior members find it easier to debug, and also we were in
>>> early product development. for that reason, it doesn't have to be
>>> consistent either, if some complex hypercalls are failing repeatedly,
>>> just for ease of debug, one might leave it there temporarily.  but
>>> now that things are stable, i think we should just remove them all and
>>> get used to a bit more inconvenient debugging...
>>
>> I see your point about debugging, but on balance I agree that they
>> should all be removed. If there's some debug case, add it back
>> temporarily to debug, but leave upstream without it. The zero'ing is
>> also unnecessary code in the interrupt disabled window, which you
>> have expressed concern about in a different thread.
> 
> yeah, i've been extremely busy so not able to pay much attention to
> upstreaming, but imo they should have been removed before upstreaming.
> a simple patch that just removes memset of output would be welcome.
> 
>>>
>>>>> +	input->property_id = HV_SYSTEM_PROPERTY_CRASHDUMPAREA;
>>>>> +
>>>>> +	status = hv_do_hypercall(HVCALL_GET_SYSTEM_PROPERTY, input, output);
>>>>> +	cda_info.as_uint64 = output->hv_cda_info.as_uint64;
>>>>> +	local_irq_restore(flags);
>>>>> +
>>>>> +	if (!hv_result_success(status)) {
>>>>> +		pr_err("Hyper-V: %s: property:%d %s\n", __func__,
>>>>> +		       input->property_id, hv_result_to_string(status));
>>>>> +		goto err_out;
>>>>> +	}
>>>>> +
>>>>> +	if (cda_info.base_pfn == 0) {
>>>>> +		pr_err("Hyper-V: hypervisor crash dump area pfn is 0\n");
>>>>> +		goto err_out;
>>>>> +	}
>>>>> +
>>>>> +	hv_cda = phys_to_virt(cda_info.base_pfn << PAGE_SHIFT);
>>>>
>>>> Use HV_HYP_PAGE_SHIFT, since PFNs provided by Hyper-V are always in
>>>> terms of the Hyper-V page size, which isn't necessarily the guest page size.
>>>> Yes, on x86 there's no difference, but for future robustness ....
>>>
>>> i don't know about guests, but we won't even boot if dom0 pg size
>>> didn't match.. but easier to change than to make the case..
>>
>> FWIW, a normal Linux guest on ARM64 works just fine with a page
>> size of 16K or 64K, even though the underlying Hyper-V page size
>> is only 4K. That's why we have HV_HYP_PAGE_SHIFT and related in
>> the first place. Using it properly really matters for normal guests.
>> (Having the guest page size smaller than the Hyper-V page size
>> does *not* work, but there are no such use cases.)
>>
>> Even on ARM64, I know the root partition page size is required to
>> match the Hyper-V page size. But using HV_HYP_PAGE_SIZE is
>> still appropriate just to not leave code that will go wrong if the
>> match requirement should ever change.
>>
>>>
>>>>> +
>>>>> +	rc = hv_crash_trampoline_setup();
>>>>> +	if (rc)
>>>>> +		goto err_out;
>>>>> +
>>>>> +	smp_ops.crash_stop_other_cpus = hv_crash_stop_other_cpus;
>>>>> +
>>>>> +	crash_kexec_post_notifiers = true;
>>>>> +	hv_crash_enabled = 1;
>>>>> +	pr_info("Hyper-V: linux and hv kdump support enabled\n");
>>>>
>>>> This message and the message below aren't consistent. One refers
>>>> to "hv kdump" and the other to "hyp kdump".
>>>
>>>>> +
>>>>> +	return;
>>>>> +
>>>>> +err_out:
>>>>> +	unregister_nmi_handler(NMI_LOCAL, "hv_crash_nmi");
>>>>> +	pr_err("Hyper-V: only linux (but not hyp) kdump support enabled\n");
>>>>> +}
>>>>> --
>>>>> 2.36.1.vfs.0.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore
  2025-09-20  1:42           ` Mukesh R
@ 2025-09-23  1:35             ` Michael Kelley
  0 siblings, 0 replies; 29+ messages in thread
From: Michael Kelley @ 2025-09-23  1:35 UTC (permalink / raw)
  To: Mukesh R, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-arch@vger.kernel.org
  Cc: kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
	decui@microsoft.com, tglx@linutronix.de, mingo@redhat.com,
	bp@alien8.de, dave.hansen@linux.intel.com, x86@kernel.org,
	hpa@zytor.com, arnd@arndb.de

From: Mukesh R <mrathor@linux.microsoft.com> Sent: Friday, September 19, 2025 6:43 PM
> 
> On 9/18/25 19:32, Mukesh R wrote:
> > On 9/18/25 16:53, Michael Kelley wrote:
> >> From: Mukesh R <mrathor@linux.microsoft.com> Sent: Tuesday, September 16, 2025 6:13 PM
> >>>
> >>> On 9/15/25 10:55, Michael Kelley wrote:

[snip]

> >>>>> +/*
> >>>>> + * Common function for all cpus before devirtualization.
> >>>>> + *
> >>>>> + * Hypervisor crash: all cpus get here in nmi context.
> >>>>> + * Linux crash: the panicing cpu gets here at base level, all others in nmi
> >>>>> + *		context. Note, panicing cpu may not be the bsp.
> >>>>> + *
> >>>>> + * The function is not inlined so it will show on the stack. It is named so
> >>>>> + * because the crash cmd looks for certain well known function names on the
> >>>>> + * stack before looking into the cpu saved note in the elf section, and
> >>>>> + * that work is currently incomplete.
> >>>>> + *
> >>>>> + * Notes:
> >>>>> + *  Hypervisor crash:
> >>>>> + *    - the hypervisor is in a very restrictive mode at this point and any
> >>>>> + *	vmexit it cannot handle would result in reboot. For example, console
> >>>>> + *	output from here would result in synic ipi hcall, which would result
> >>>>> + *	in reboot. So, no mumbo jumbo, just get to kexec as quickly as possible.
> >>>>> + *
> >>>>> + *  Devirtualization is supported from the bsp only.
> >>>>> + */
> >>>>> +static noinline __noclone void crash_nmi_callback(struct pt_regs *regs)
> >>>>> +{
> >>>>> +	struct hv_input_disable_hyp_ex *input;
> >>>>> +	u64 status;
> >>>>> +	int msecs = 1000, ccpu = smp_processor_id();
> >>>>> +
> >>>>> +	if (ccpu == 0) {
> >>>>> +		/* crash_save_cpu() will be done in the kexec path */
> >>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >>>>> +		atomic_inc(&crash_cpus_wait);
> >>>>> +	} else {
> >>>>> +		crash_save_cpu(regs, ccpu);
> >>>>> +		cpu_emergency_stop_pt();	/* disable performance trace */
> >>>>> +		atomic_inc(&crash_cpus_wait);
> >>>>> +		for (;;);			/* cause no vmexits */
> >>>>> +	}
> >>>>> +
> >>>>> +	while (atomic_read(&crash_cpus_wait) < num_online_cpus() && msecs--)
> >>>>> +		mdelay(1);
> >>>>> +
> >>>>> +	stop_nmi();
> >>>>> +	if (!hv_has_crashed)
> >>>>> +		hv_notify_prepare_hyp();
> >>>>> +
> >>>>> +	if (crashing_cpu == -1)
> >>>>> +		crashing_cpu = ccpu;		/* crash cmd uses this */
> >>>>
> >>>> Could just be "crashing_cpu = 0" since only the BSP gets here.
> >>>
> >>> a code change request has been open for while to remove the requirement
> >>> of bsp..
> >>>
> >>>>> +
> >>>>> +	hv_hvcrash_ctxt_save();
> >>>>> +	hv_mark_tss_not_busy();
> >>>>> +	hv_crash_fixup_kernpt();
> >>>>> +
> >>>>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> >>>>> +	memset(input, 0, sizeof(*input));
> >>>>> +	input->rip = trampoline_pa;	/* PA of hv_crash_asm32 */
> >>>>> +	input->arg = devirt_cr3arg;	/* PA of trampoline page table L4 */
> >>>>
> >>>> Is this comment correct? Isn't it the PA of struct hv_crash_tramp_data?
> >>>> And just for clarification, Hyper-V treats this "arg" value as opaque and does
> >>>> not access it. It only provides it in EDI when it invokes the trampoline
> >>>> function, right?
> >>>
> >>> comment is correct. cr3 always points to l4 (or l5 if 5 level page tables).
> >>
> >> Yes, the comment matches the name of the "devirt_cr3arg" variable.
> >> Unfortunately my previous comment was incomplete because the value
> >> stored in the static variable "devirt_cr3arg" isn?t the address of an L4 page
> >> table. It's not a CR3 value. The value stored in devirt_cr3arg is actually the
> >> PA of struct hv_crash_tramp_data. The CR3 value is stored in the
> >> tramp32_cr3 field (at offset 0) of that structure, so there's an additional level
> >> of indirection. The (corrected) comment in the header to hv_crash_asm32()
> >> describes EDI as containing "PA of struct hv_crash_tramp_data", which
> >> ought to match what is described here. I'd say that "devirt_cr3arg" ought
> >> to be renamed to "tramp_data_pa" or something else parallel to
> >> "trampoline_pa".
> >
> > hyp needs trampoline cr3 for transition, we pass it as an arg. we piggy
> > back extra information for ourselves needed in trampoline.S. so it's
> > all good.
> 
> actually, what i said earlier was true, not above. that the arg is
> opaque and hyp does not use it (we are transitioning paging off after
> all!). i did this all almost two years ago, so had vague recollections
> but finally had time today to go back to square one and old notes,
> and remember things now. so final answer:
> 
> the hypercall calls it TrampolineCr3, i guess this is how windows uses it
> (they have customized kernel code for core collection). doing that was
> becoming too intrusive on linux, so i decided to use the arg to pass the
> info i needed in the trampoline code. Since the hypercall calls the arg
> TrampolineCr3, i must have just used that name for the arg to match it,
> probably falsely assuming hypervisor somehow looked at it. (actually,
> the windows hypercall wrapper does look at it to make sure it is a
> ram address).
> 
> since the hypercall doesn't use the arg, it could just call it
> devirtArg, but maybe in the past they used it somehow. in my latest
> version, i just call it devirt_arg.

OK.  Good to get this all straightened out. Please leave a code
comment to the effect that the hypercall doesn't use the arg, and
that the value is provided solely to be passed to hv_crash_asm32()
for it to use. That means that struct hv_crash_tramp_data is owned
by Linux and can be changed/updated as needed.

The assignment statement to the hypercall input could look like:

input->arg = devirt_arg;	/* PA of struct hv_crash_tramp_data */

which would align with the comment in the header of hv_crash_asm32().

Michael

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-09-23  1:35 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-10  0:10 [PATCH v1 0/6] Hyper-V: Implement hypervisor core collection Mukesh Rathor
2025-09-10  0:10 ` [PATCH v1 1/6] x86/hyperv: Rename guest crash shutdown function Mukesh Rathor
2025-09-10  0:10 ` [PATCH v1 2/6] hyperv: Add two new hypercall numbers to guest ABI public header Mukesh Rathor
2025-09-10  0:10 ` [PATCH v1 3/6] hyperv: Add definitions for hypervisor crash dump support Mukesh Rathor
2025-09-15 17:54   ` Michael Kelley
2025-09-16  1:15     ` Mukesh R
2025-09-18 23:52       ` Michael Kelley
2025-09-10  0:10 ` [PATCH v1 4/6] x86/hyperv: Add trampoline asm code to transition from hypervisor Mukesh Rathor
2025-09-15 17:55   ` Michael Kelley
2025-09-16 21:30     ` Mukesh R
2025-09-18 23:52       ` Michael Kelley
2025-09-19  9:06         ` Borislav Petkov
2025-09-19 19:09           ` Mukesh R
2025-09-10  0:10 ` [PATCH v1 5/6] x86/hyperv: Implement hypervisor ram collection into vmcore Mukesh Rathor
2025-09-15 17:55   ` Michael Kelley
2025-09-17  1:13     ` Mukesh R
2025-09-17 20:37       ` Mukesh R
2025-09-18 23:53       ` Michael Kelley
2025-09-19  2:32         ` Mukesh R
2025-09-19 19:48           ` Michael Kelley
2025-09-20  1:42           ` Mukesh R
2025-09-23  1:35             ` Michael Kelley
2025-09-18 17:11   ` Stanislav Kinsburskii
2025-09-10  0:10 ` [PATCH v1 6/6] x86/hyperv: Enable build of hypervisor crashdump collection files Mukesh Rathor
2025-09-13  4:53   ` kernel test robot
2025-09-13  5:57   ` kernel test robot
2025-09-15 17:56   ` Michael Kelley
2025-09-17  1:15     ` Mukesh R
2025-09-18 23:53       ` Michael Kelley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).