[PATCH v3] x86/hyperv: Fix kdump on Azure CVMs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs
@ 2025-08-21 15:16 Vitaly Kuznetsov
  2025-08-27  0:11 ` Michael Kelley
  0 siblings, 1 reply; 4+ messages in thread
From: Vitaly Kuznetsov @ 2025-08-21 15:16 UTC (permalink / raw)
  To: linux-hyperv, Michael Kelley
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, x86,
	linux-kernel, Nuno Das Neves, Tianyu Lan, Li Tian, Philipp Rudo

Azure CVM instance types featuring a paravisor hang upon kdump. The
investigation shows that makedumpfile causes a hang when it steps on a page
which was previously share with the host
(HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
knowledge of these 'special' regions (which are Vmbus connection pages,
GPADL buffers, ...). There are several ways to approach the issue:
- Convey the knowledge about these regions to the new kernel somehow.
- Unshare these regions before accessing in the new kernel (it is unclear
if there's a way to query the status for a given GPA range).
- Unshare these regions before jumping to the new kernel (which this patch
implements).

To make the procedure as robust as possible, store PFN ranges of shared
regions in a linked list instead of storing GVAs and re-using
hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
on the kdump/kexec path.

The patch skips implementing weird corner case in hv_list_enc_remove()
when a PFN in the middle of a region is unshared. First, it is unlikely
that such requests happen. Second, it is not a big problem if
hv_list_enc_remove() doesn't actually remove some regions as this will
only result in an extra hypercall doing nothing upon kexec/kdump; there's
no need to be perfect.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
Changes since v2 [Michael Kelley]:
 - Rebase to hyperv-next.
 - Move hv_ivm_clear_host_access() call to hyperv_cleanup(). This also
   makes ARM stub unneeded.
 - Implement the missing corner case in hv_list_enc_remove(). With this,
   the math should (hopefully!) always be correct so we don't rely on
   the idempotency of HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY
   hypercall. As the case is not something we see, I tested the code
   with a few synthetic tests.
 - Fix the math in hv_list_enc_remove() (count -> ent->count).
 - Typos.
---
 arch/x86/hyperv/hv_init.c       |   3 +
 arch/x86/hyperv/ivm.c           | 213 ++++++++++++++++++++++++++++++--
 arch/x86/include/asm/mshyperv.h |   2 +
 3 files changed, 210 insertions(+), 8 deletions(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 2979d15223cf..4bb1578237eb 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -596,6 +596,9 @@ void hyperv_cleanup(void)
 	union hv_x64_msr_hypercall_contents hypercall_msr;
 	union hv_reference_tsc_msr tsc_msr;
 
+	/* Retract host access to shared memory in case of isolation */
+	hv_ivm_clear_host_access();
+
 	/* Reset our OS id */
 	wrmsrq(HV_X64_MSR_GUEST_OS_ID, 0);
 	hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index 3084ae8a3eed..0d74156ad6a7 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -462,6 +462,188 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
 		hv_ghcb_msr_read(msr, value);
 }
 
+/*
+ * Keep track of the PFN regions which were shared with the host. The access
+ * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
+ */
+struct hv_enc_pfn_region {
+	struct list_head list;
+	u64 pfn;
+	int count;
+};
+
+static LIST_HEAD(hv_list_enc);
+static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
+
+static int hv_list_enc_add(const u64 *pfn_list, int count)
+{
+	struct hv_enc_pfn_region *ent;
+	unsigned long flags;
+	u64 pfn;
+	int i;
+
+	for (i = 0; i < count; i++) {
+		pfn = pfn_list[i];
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		/* Check if the PFN already exists in some region first */
+		list_for_each_entry(ent, &hv_list_enc, list) {
+			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn))
+				/* Nothing to do - pfn is already in the list */
+				goto unlock_done;
+		}
+
+		/*
+		 * Check if the PFN is adjacent to an existing region. Growing
+		 * a region can make it adjacent to another one but merging is
+		 * not (yet) implemented for simplicity. A PFN cannot be added
+		 * to two regions to keep the logic in hv_list_enc_remove()
+		 * correct.
+		 */
+		list_for_each_entry(ent, &hv_list_enc, list) {
+			if (ent->pfn + ent->count == pfn) {
+				/* Grow existing region up */
+				ent->count++;
+				goto unlock_done;
+			} else if (pfn + 1 == ent->pfn) {
+				/* Grow existing region down */
+				ent->pfn--;
+				ent->count++;
+				goto unlock_done;
+			}
+		}
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+
+		/* No adjacent region found -- create a new one */
+		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
+		if (!ent)
+			return -ENOMEM;
+
+		ent->pfn = pfn;
+		ent->count = 1;
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_add(&ent->list, &hv_list_enc);
+
+unlock_done:
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+	}
+
+	return 0;
+}
+
+static void hv_list_enc_remove(const u64 *pfn_list, int count)
+{
+	struct hv_enc_pfn_region *ent, *t;
+	struct hv_enc_pfn_region new_region;
+	unsigned long flags;
+	u64 pfn;
+	int i;
+
+	for (i = 0; i < count; i++) {
+		pfn = pfn_list[i];
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
+			if (pfn == ent->pfn + ent->count - 1) {
+				/* Removing tail pfn */
+				ent->count--;
+				if (!ent->count) {
+					list_del(&ent->list);
+					kfree(ent);
+				}
+				goto unlock_done;
+			} else if (pfn == ent->pfn) {
+				/* Removing head pfn */
+				ent->count--;
+				ent->pfn++;
+				if (!ent->count) {
+					list_del(&ent->list);
+					kfree(ent);
+				}
+				goto unlock_done;
+			} else if (pfn > ent->pfn && pfn < ent->pfn + ent->count - 1) {
+				/*
+				 * Removing a pfn in the middle. Cut off the tail
+				 * of the existing region and create a template for
+				 * the new one.
+				 */
+				new_region.pfn = pfn + 1;
+				new_region.count = ent->count - (pfn - ent->pfn + 1);
+				ent->count = pfn - ent->pfn;
+				goto unlock_split;
+			}
+
+		}
+unlock_done:
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+		continue;
+
+unlock_split:
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+
+		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
+		/*
+		 * There is no apparent good way to recover from -ENOMEM
+		 * situation, the accouting is going to be wrong either way.
+		 * Proceed with the rest of the list to make it 'less wrong'.
+		 */
+		if (WARN_ON_ONCE(!ent))
+			continue;
+
+		ent->pfn = new_region.pfn;
+		ent->count = new_region.count;
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_add(&ent->list, &hv_list_enc);
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+	}
+}
+
+void hv_ivm_clear_host_access(void)
+{
+	struct hv_gpa_range_for_visibility *input;
+	struct hv_enc_pfn_region *ent;
+	unsigned long flags;
+	u64 hv_status;
+	int batch_size, cur, i;
+
+	if (!hv_is_isolation_supported())
+		return;
+
+	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+
+	batch_size = MIN(hv_setup_in_array(&input, sizeof(*input),
+					   sizeof(input->gpa_page_list[0])),
+			 HV_MAX_MODIFY_GPA_REP_COUNT);
+	if (unlikely(!input))
+		goto unlock;
+
+	list_for_each_entry(ent, &hv_list_enc, list) {
+		for (i = 0, cur = 0; i < ent->count; i++) {
+			input->gpa_page_list[cur] = ent->pfn + i;
+			cur++;
+
+			if (cur == batch_size || i == ent->count - 1) {
+				input->partition_id = HV_PARTITION_ID_SELF;
+				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
+				input->reserved0 = 0;
+				input->reserved1 = 0;
+				hv_status = hv_do_rep_hypercall(
+					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
+					cur, 0, input, NULL);
+				WARN_ON_ONCE(!hv_result_success(hv_status));
+				cur = 0;
+			}
+		}
+
+	};
+
+unlock:
+	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
+
 /*
  * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
  *
@@ -476,24 +658,33 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 	u64 hv_status;
 	int batch_size;
 	unsigned long flags;
+	int ret;
 
 	/* no-op if partition isolation is not enabled */
 	if (!hv_is_isolation_supported())
 		return 0;
 
+	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
+		hv_list_enc_remove(pfn, count);
+	} else {
+		ret = hv_list_enc_add(pfn, count);
+		if (ret)
+			return ret;
+	}
+
 	local_irq_save(flags);
 	batch_size = hv_setup_in_array(&input, sizeof(*input),
 					sizeof(input->gpa_page_list[0]));
 	if (unlikely(!input)) {
-		local_irq_restore(flags);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto unlock;
 	}
 
 	if (count > batch_size) {
 		pr_err("Hyper-V: GPA count:%d exceeds supported:%u\n", count,
 		       batch_size);
-		local_irq_restore(flags);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto unlock;
 	}
 
 	input->partition_id = HV_PARTITION_ID_SELF;
@@ -502,12 +693,18 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 	hv_status = hv_do_rep_hypercall(
 			HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
 			0, input, NULL);
-	local_irq_restore(flags);
-
 	if (hv_result_success(hv_status))
-		return 0;
+		ret = 0;
 	else
-		return -EFAULT;
+		ret = -EFAULT;
+
+unlock:
+	local_irq_restore(flags);
+
+	if (ret)
+		hv_list_enc_remove(pfn, count);
+
+	return ret;
 }
 
 /*
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index abc4659f5809..6a988001e46f 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip,
 void hv_vtom_init(void);
 void hv_ivm_msr_write(u64 msr, u64 value);
 void hv_ivm_msr_read(u64 msr, u64 *value);
+void hv_ivm_clear_host_access(void);
 #else
 static inline void hv_vtom_init(void) {}
 static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
 static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
+static inline void hv_ivm_clear_host_access(void) {}
 #endif
 
 static inline bool hv_is_synic_msr(unsigned int reg)
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* RE: [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-21 15:16 [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs Vitaly Kuznetsov
@ 2025-08-27  0:11 ` Michael Kelley
  2025-08-27 12:50   ` Vitaly Kuznetsov
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Kelley @ 2025-08-27  0:11 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, August 21, 2025 8:17 AM
> 
> Azure CVM instance types featuring a paravisor hang upon kdump. The
> investigation shows that makedumpfile causes a hang when it steps on a page
> which was previously share with the host
> (HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
> knowledge of these 'special' regions (which are Vmbus connection pages,
> GPADL buffers, ...). There are several ways to approach the issue:
> - Convey the knowledge about these regions to the new kernel somehow.
> - Unshare these regions before accessing in the new kernel (it is unclear
> if there's a way to query the status for a given GPA range).
> - Unshare these regions before jumping to the new kernel (which this patch
> implements).
> 
> To make the procedure as robust as possible, store PFN ranges of shared
> regions in a linked list instead of storing GVAs and re-using
> hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
> on the kdump/kexec path.
> 
> The patch skips implementing weird corner case in hv_list_enc_remove()
> when a PFN in the middle of a region is unshared. First, it is unlikely
> that such requests happen. Second, it is not a big problem if
> hv_list_enc_remove() doesn't actually remove some regions as this will
> only result in an extra hypercall doing nothing upon kexec/kdump; there's
> no need to be perfect.

This last paragraph is left over from the previous version. It's no
longer correct and should be removed.

> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> Changes since v2 [Michael Kelley]:
>  - Rebase to hyperv-next.
>  - Move hv_ivm_clear_host_access() call to hyperv_cleanup(). This also
>    makes ARM stub unneeded.
>  - Implement the missing corner case in hv_list_enc_remove(). With this,
>    the math should (hopefully!) always be correct so we don't rely on
>    the idempotency of HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY
>    hypercall. As the case is not something we see, I tested the code
>    with a few synthetic tests.
>  - Fix the math in hv_list_enc_remove() (count -> ent->count).
>  - Typos.
> ---
>  arch/x86/hyperv/hv_init.c       |   3 +
>  arch/x86/hyperv/ivm.c           | 213 ++++++++++++++++++++++++++++++--
>  arch/x86/include/asm/mshyperv.h |   2 +
>  3 files changed, 210 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
> index 2979d15223cf..4bb1578237eb 100644
> --- a/arch/x86/hyperv/hv_init.c
> +++ b/arch/x86/hyperv/hv_init.c
> @@ -596,6 +596,9 @@ void hyperv_cleanup(void)
>  	union hv_x64_msr_hypercall_contents hypercall_msr;
>  	union hv_reference_tsc_msr tsc_msr;
> 
> +	/* Retract host access to shared memory in case of isolation */
> +	hv_ivm_clear_host_access();
> +
>  	/* Reset our OS id */
>  	wrmsrq(HV_X64_MSR_GUEST_OS_ID, 0);
>  	hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index 3084ae8a3eed..0d74156ad6a7 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -462,6 +462,188 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
>  		hv_ghcb_msr_read(msr, value);
>  }
> 
> +/*
> + * Keep track of the PFN regions which were shared with the host. The access
> + * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
> + */
> +struct hv_enc_pfn_region {
> +	struct list_head list;
> +	u64 pfn;
> +	int count;
> +};
> +
> +static LIST_HEAD(hv_list_enc);
> +static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
> +
> +static int hv_list_enc_add(const u64 *pfn_list, int count)
> +{
> +	struct hv_enc_pfn_region *ent;
> +	unsigned long flags;
> +	u64 pfn;
> +	int i;
> +
> +	for (i = 0; i < count; i++) {
> +		pfn = pfn_list[i];
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		/* Check if the PFN already exists in some region first */
> +		list_for_each_entry(ent, &hv_list_enc, list) {
> +			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn))
> +				/* Nothing to do - pfn is already in the list */
> +				goto unlock_done;
> +		}
> +
> +		/*
> +		 * Check if the PFN is adjacent to an existing region. Growing
> +		 * a region can make it adjacent to another one but merging is
> +		 * not (yet) implemented for simplicity. A PFN cannot be added
> +		 * to two regions to keep the logic in hv_list_enc_remove()
> +		 * correct.
> +		 */
> +		list_for_each_entry(ent, &hv_list_enc, list) {
> +			if (ent->pfn + ent->count == pfn) {
> +				/* Grow existing region up */
> +				ent->count++;
> +				goto unlock_done;
> +			} else if (pfn + 1 == ent->pfn) {
> +				/* Grow existing region down */
> +				ent->pfn--;
> +				ent->count++;
> +				goto unlock_done;
> +			}
> +		}
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +
> +		/* No adjacent region found -- create a new one */
> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
> +		if (!ent)
> +			return -ENOMEM;
> +
> +		ent->pfn = pfn;
> +		ent->count = 1;
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_add(&ent->list, &hv_list_enc);
> +
> +unlock_done:
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +	}
> +
> +	return 0;
> +}
> +
> +static void hv_list_enc_remove(const u64 *pfn_list, int count)
> +{
> +	struct hv_enc_pfn_region *ent, *t;
> +	struct hv_enc_pfn_region new_region;
> +	unsigned long flags;
> +	u64 pfn;
> +	int i;
> +
> +	for (i = 0; i < count; i++) {
> +		pfn = pfn_list[i];
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
> +			if (pfn == ent->pfn + ent->count - 1) {
> +				/* Removing tail pfn */
> +				ent->count--;
> +				if (!ent->count) {
> +					list_del(&ent->list);
> +					kfree(ent);
> +				}
> +				goto unlock_done;
> +			} else if (pfn == ent->pfn) {
> +				/* Removing head pfn */
> +				ent->count--;
> +				ent->pfn++;
> +				if (!ent->count) {
> +					list_del(&ent->list);
> +					kfree(ent);
> +				}
> +				goto unlock_done;
> +			} else if (pfn > ent->pfn && pfn < ent->pfn + ent->count - 1) {
> +				/*
> +				 * Removing a pfn in the middle. Cut off the tail
> +				 * of the existing region and create a template for
> +				 * the new one.
> +				 */
> +				new_region.pfn = pfn + 1;
> +				new_region.count = ent->count - (pfn - ent->pfn + 1);
> +				ent->count = pfn - ent->pfn;
> +				goto unlock_split;
> +			}
> +
> +		}
> +unlock_done:
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +		continue;
> +
> +unlock_split:
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +
> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
> +		/*
> +		 * There is no apparent good way to recover from -ENOMEM
> +		 * situation, the accouting is going to be wrong either way.
> +		 * Proceed with the rest of the list to make it 'less wrong'.
> +		 */
> +		if (WARN_ON_ONCE(!ent))
> +			continue;
> +
> +		ent->pfn = new_region.pfn;
> +		ent->count = new_region.count;
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_add(&ent->list, &hv_list_enc);
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +	}
> +}
> +
> +void hv_ivm_clear_host_access(void)
> +{
> +	struct hv_gpa_range_for_visibility *input;
> +	struct hv_enc_pfn_region *ent;
> +	unsigned long flags;
> +	u64 hv_status;
> +	int batch_size, cur, i;
> +
> +	if (!hv_is_isolation_supported())
> +		return;

I seem to recall that some separate work has been done
to support kexec/kdump for the more generic SEV-SNP and
TDX cases where there's no paravisor mediating. I haven't
gone looking for that code to see when it runs. 
hv_ivm_clear_host_access() is needed to update the
paravisor records about the page state, but if other code
has already updated the hypervisor/processor state, that
might be problematic. If this runs first, then the more
generic code will presumably find nothing to do, which
should be OK.

I'll try to go look further at this situation, unless you already
have. If necessary, this function could be gated to run
only when a paravisor is present.

> +
> +	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);

Since this function is now called after other CPUs have
been stopped, the spin lock is no longer necessary, unless
you were counting on it to provide the interrupt disable
needed for accessing the per-cpu hypercall argument page.
But even then, I'd suggest just doing the interrupt disable
instead of the spin lock so there's no chance of the
panic or kexec path getting hung waiting on the spin lock.

There's also a potentially rare problem if other CPUs are
stopped while hv_list_enc_add() or hv_list_nec_remove()
is being executed. The list might be inconsistent, or not
fully reflect what the paravisor and hypervisor think about
the private/shared state of the page. But I don't think there's
anything we can do about that. Again, I'd suggest a code
comment acknowledging this case, and that there's nothing
that can be done.

> +
> +	batch_size = MIN(hv_setup_in_array(&input, sizeof(*input),
> +					   sizeof(input->gpa_page_list[0])),
> +			 HV_MAX_MODIFY_GPA_REP_COUNT);

The patches that added hv_setup_in_array() were pulled from
hyperv-next due to some renewed discussion. You'll need to revert
back to the previous handling of hyperv_pcpu_input_arg. :-(

> +	if (unlikely(!input))
> +		goto unlock;
> +
> +	list_for_each_entry(ent, &hv_list_enc, list) {
> +		for (i = 0, cur = 0; i < ent->count; i++) {
> +			input->gpa_page_list[cur] = ent->pfn + i;
> +			cur++;
> +
> +			if (cur == batch_size || i == ent->count - 1) {
> +				input->partition_id = HV_PARTITION_ID_SELF;
> +				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
> +				input->reserved0 = 0;
> +				input->reserved1 = 0;
> +				hv_status = hv_do_rep_hypercall(
> +					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
> +					cur, 0, input, NULL);
> +				WARN_ON_ONCE(!hv_result_success(hv_status));
> +				cur = 0;
> +			}
> +		}
> +
> +	};
> +
> +unlock:
> +	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
> +
>  /*
>   * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
>   *
> @@ -476,24 +658,33 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  	u64 hv_status;
>  	int batch_size;
>  	unsigned long flags;
> +	int ret;
> 
>  	/* no-op if partition isolation is not enabled */
>  	if (!hv_is_isolation_supported())
>  		return 0;
> 
> +	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
> +		hv_list_enc_remove(pfn, count);
> +	} else {
> +		ret = hv_list_enc_add(pfn, count);
> +		if (ret)
> +			return ret;
> +	}
> +
>  	local_irq_save(flags);
>  	batch_size = hv_setup_in_array(&input, sizeof(*input),
>  					sizeof(input->gpa_page_list[0]));
>  	if (unlikely(!input)) {
> -		local_irq_restore(flags);
> -		return -EINVAL;
> +		ret = -EINVAL;
> +		goto unlock;
>  	}
> 
>  	if (count > batch_size) {
>  		pr_err("Hyper-V: GPA count:%d exceeds supported:%u\n", count,
>  		       batch_size);
> -		local_irq_restore(flags);
> -		return -EINVAL;
> +		ret = -EINVAL;
> +		goto unlock;
>  	}
> 
>  	input->partition_id = HV_PARTITION_ID_SELF;
> @@ -502,12 +693,18 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  	hv_status = hv_do_rep_hypercall(
>  			HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
>  			0, input, NULL);
> -	local_irq_restore(flags);
> -
>  	if (hv_result_success(hv_status))
> -		return 0;
> +		ret = 0;
>  	else
> -		return -EFAULT;
> +		ret = -EFAULT;
> +
> +unlock:
> +	local_irq_restore(flags);
> +
> +	if (ret)
> +		hv_list_enc_remove(pfn, count);

If making the pages shared fails, this is an "undo". But what
about an "undo" if making the pages private fails? Maybe
your thinking is that leaving the pages on the shared list just
sets things up for a failure when hv_ivm_clear_host_access()
tries to make them private. But I wonder if it would be better
to go ahead and "undo" here in case the failure is transient
and hv_ivm_clear_host_access() might later succeed. I don't
understand the hypercall failure modes well enough to
know whether that's plausible. But if you think the "undo"
here really isn't warranted, please add a code comment to
that effect since a future reader might expect to the two
cases to be symmetrical here.

> +
> +	return ret;
>  }
> 
>  /*
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index abc4659f5809..6a988001e46f 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long
> start_ip,
>  void hv_vtom_init(void);
>  void hv_ivm_msr_write(u64 msr, u64 value);
>  void hv_ivm_msr_read(u64 msr, u64 *value);
> +void hv_ivm_clear_host_access(void);
>  #else
>  static inline void hv_vtom_init(void) {}
>  static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
>  static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
> +static inline void hv_ivm_clear_host_access(void) {}
>  #endif
> 
>  static inline bool hv_is_synic_msr(unsigned int reg)
> --
> 2.50.1


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-27  0:11 ` Michael Kelley
@ 2025-08-27 12:50   ` Vitaly Kuznetsov
  2025-08-27 16:01     ` Michael Kelley
  0 siblings, 1 reply; 4+ messages in thread
From: Vitaly Kuznetsov @ 2025-08-27 12:50 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

Michael Kelley <mhklinux@outlook.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, August 21, 2025 8:17 AM
>> 
>> Azure CVM instance types featuring a paravisor hang upon kdump. The
>> investigation shows that makedumpfile causes a hang when it steps on a page
>> which was previously share with the host
>> (HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
>> knowledge of these 'special' regions (which are Vmbus connection pages,
>> GPADL buffers, ...). There are several ways to approach the issue:
>> - Convey the knowledge about these regions to the new kernel somehow.
>> - Unshare these regions before accessing in the new kernel (it is unclear
>> if there's a way to query the status for a given GPA range).
>> - Unshare these regions before jumping to the new kernel (which this patch
>> implements).
>> 
>> To make the procedure as robust as possible, store PFN ranges of shared
>> regions in a linked list instead of storing GVAs and re-using
>> hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
>> on the kdump/kexec path.
>> 
>> The patch skips implementing weird corner case in hv_list_enc_remove()
>> when a PFN in the middle of a region is unshared. First, it is unlikely
>> that such requests happen. Second, it is not a big problem if
>> hv_list_enc_remove() doesn't actually remove some regions as this will
>> only result in an extra hypercall doing nothing upon kexec/kdump; there's
>> no need to be perfect.
>
> This last paragraph is left over from the previous version. It's no
> longer correct and should be removed.
>

Indeed, will drop!

>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> Changes since v2 [Michael Kelley]:
>>  - Rebase to hyperv-next.
>>  - Move hv_ivm_clear_host_access() call to hyperv_cleanup(). This also
>>    makes ARM stub unneeded.
>>  - Implement the missing corner case in hv_list_enc_remove(). With this,
>>    the math should (hopefully!) always be correct so we don't rely on
>>    the idempotency of HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY
>>    hypercall. As the case is not something we see, I tested the code
>>    with a few synthetic tests.
>>  - Fix the math in hv_list_enc_remove() (count -> ent->count).
>>  - Typos.
>> ---
>>  arch/x86/hyperv/hv_init.c       |   3 +
>>  arch/x86/hyperv/ivm.c           | 213 ++++++++++++++++++++++++++++++--
>>  arch/x86/include/asm/mshyperv.h |   2 +
>>  3 files changed, 210 insertions(+), 8 deletions(-)
>> 
>> diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
>> index 2979d15223cf..4bb1578237eb 100644
>> --- a/arch/x86/hyperv/hv_init.c
>> +++ b/arch/x86/hyperv/hv_init.c
>> @@ -596,6 +596,9 @@ void hyperv_cleanup(void)
>>  	union hv_x64_msr_hypercall_contents hypercall_msr;
>>  	union hv_reference_tsc_msr tsc_msr;
>> 
>> +	/* Retract host access to shared memory in case of isolation */
>> +	hv_ivm_clear_host_access();
>> +
>>  	/* Reset our OS id */
>>  	wrmsrq(HV_X64_MSR_GUEST_OS_ID, 0);
>>  	hv_ivm_msr_write(HV_X64_MSR_GUEST_OS_ID, 0);
>> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
>> index 3084ae8a3eed..0d74156ad6a7 100644
>> --- a/arch/x86/hyperv/ivm.c
>> +++ b/arch/x86/hyperv/ivm.c
>> @@ -462,6 +462,188 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
>>  		hv_ghcb_msr_read(msr, value);
>>  }
>> 
>> +/*
>> + * Keep track of the PFN regions which were shared with the host. The access
>> + * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
>> + */
>> +struct hv_enc_pfn_region {
>> +	struct list_head list;
>> +	u64 pfn;
>> +	int count;
>> +};
>> +
>> +static LIST_HEAD(hv_list_enc);
>> +static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
>> +
>> +static int hv_list_enc_add(const u64 *pfn_list, int count)
>> +{
>> +	struct hv_enc_pfn_region *ent;
>> +	unsigned long flags;
>> +	u64 pfn;
>> +	int i;
>> +
>> +	for (i = 0; i < count; i++) {
>> +		pfn = pfn_list[i];
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		/* Check if the PFN already exists in some region first */
>> +		list_for_each_entry(ent, &hv_list_enc, list) {
>> +			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn))
>> +				/* Nothing to do - pfn is already in the list */
>> +				goto unlock_done;
>> +		}
>> +
>> +		/*
>> +		 * Check if the PFN is adjacent to an existing region. Growing
>> +		 * a region can make it adjacent to another one but merging is
>> +		 * not (yet) implemented for simplicity. A PFN cannot be added
>> +		 * to two regions to keep the logic in hv_list_enc_remove()
>> +		 * correct.
>> +		 */
>> +		list_for_each_entry(ent, &hv_list_enc, list) {
>> +			if (ent->pfn + ent->count == pfn) {
>> +				/* Grow existing region up */
>> +				ent->count++;
>> +				goto unlock_done;
>> +			} else if (pfn + 1 == ent->pfn) {
>> +				/* Grow existing region down */
>> +				ent->pfn--;
>> +				ent->count++;
>> +				goto unlock_done;
>> +			}
>> +		}
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +
>> +		/* No adjacent region found -- create a new one */
>> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
>> +		if (!ent)
>> +			return -ENOMEM;
>> +
>> +		ent->pfn = pfn;
>> +		ent->count = 1;
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_add(&ent->list, &hv_list_enc);
>> +
>> +unlock_done:
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void hv_list_enc_remove(const u64 *pfn_list, int count)
>> +{
>> +	struct hv_enc_pfn_region *ent, *t;
>> +	struct hv_enc_pfn_region new_region;
>> +	unsigned long flags;
>> +	u64 pfn;
>> +	int i;
>> +
>> +	for (i = 0; i < count; i++) {
>> +		pfn = pfn_list[i];
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
>> +			if (pfn == ent->pfn + ent->count - 1) {
>> +				/* Removing tail pfn */
>> +				ent->count--;
>> +				if (!ent->count) {
>> +					list_del(&ent->list);
>> +					kfree(ent);
>> +				}
>> +				goto unlock_done;
>> +			} else if (pfn == ent->pfn) {
>> +				/* Removing head pfn */
>> +				ent->count--;
>> +				ent->pfn++;
>> +				if (!ent->count) {
>> +					list_del(&ent->list);
>> +					kfree(ent);
>> +				}
>> +				goto unlock_done;
>> +			} else if (pfn > ent->pfn && pfn < ent->pfn + ent->count - 1) {
>> +				/*
>> +				 * Removing a pfn in the middle. Cut off the tail
>> +				 * of the existing region and create a template for
>> +				 * the new one.
>> +				 */
>> +				new_region.pfn = pfn + 1;
>> +				new_region.count = ent->count - (pfn - ent->pfn + 1);
>> +				ent->count = pfn - ent->pfn;
>> +				goto unlock_split;
>> +			}
>> +
>> +		}
>> +unlock_done:
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +		continue;
>> +
>> +unlock_split:
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +
>> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
>> +		/*
>> +		 * There is no apparent good way to recover from -ENOMEM
>> +		 * situation, the accouting is going to be wrong either way.
>> +		 * Proceed with the rest of the list to make it 'less wrong'.
>> +		 */
>> +		if (WARN_ON_ONCE(!ent))
>> +			continue;
>> +
>> +		ent->pfn = new_region.pfn;
>> +		ent->count = new_region.count;
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_add(&ent->list, &hv_list_enc);
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +	}
>> +}
>> +
>> +void hv_ivm_clear_host_access(void)
>> +{
>> +	struct hv_gpa_range_for_visibility *input;
>> +	struct hv_enc_pfn_region *ent;
>> +	unsigned long flags;
>> +	u64 hv_status;
>> +	int batch_size, cur, i;
>> +
>> +	if (!hv_is_isolation_supported())
>> +		return;
>
> I seem to recall that some separate work has been done
> to support kexec/kdump for the more generic SEV-SNP and
> TDX cases where there's no paravisor mediating. I haven't
> gone looking for that code to see when it runs. 
> hv_ivm_clear_host_access() is needed to update the
> paravisor records about the page state, but if other code
> has already updated the hypervisor/processor state, that
> might be problematic. If this runs first, then the more
> generic code will presumably find nothing to do, which
> should be OK.
>
> I'll try to go look further at this situation, unless you already
> have. If necessary, this function could be gated to run
> only when a paravisor is present.

Yes, there are SEV-SNP and TDX specific
snp_kexec_finish()/tdx_kexec_finish() which do memory unsharing but I
convinced myself that these are not called on Azure CVM which uses
paravisor. In particular, for SEV-SNP 'sme_me_mask == 0' in
sme_early_init(). 

I have to admit I've never seen an Azure/Hyper-V CVM without a
paravisor, but I agree it may make sense to hide all this tracking and
cleanup logic under 'if (hyperv_paravisor_present)' (or even 'if
(hv_is_isolation_supported() && hyperv_paravisor_present)').

>
>> +
>> +	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>
> Since this function is now called after other CPUs have
> been stopped, the spin lock is no longer necessary, unless
> you were counting on it to provide the interrupt disable
> needed for accessing the per-cpu hypercall argument page.
> But even then, I'd suggest just doing the interrupt disable
> instead of the spin lock so there's no chance of the
> panic or kexec path getting hung waiting on the spin lock.

Makes sense, will do.

>
> There's also a potentially rare problem if other CPUs are
> stopped while hv_list_enc_add() or hv_list_nec_remove()
> is being executed. The list might be inconsistent, or not
> fully reflect what the paravisor and hypervisor think about
> the private/shared state of the page. But I don't think there's
> anything we can do about that. Again, I'd suggest a code
> comment acknowledging this case, and that there's nothing
> that can be done.

True, will add a comment. Just like with a lot of other corner cases in
panic, it's hard to guarantee correctness in ALL cases as the system can
be in any state (e.g. if the panic is caused by memory corruption -- who
knows what's corrupted?). I'm hoping that with the newly added logic
we're covering the most common kdump case and it'll 'generally work' on
Azure CVMs.

>
>> +
>> +	batch_size = MIN(hv_setup_in_array(&input, sizeof(*input),
>> +					   sizeof(input->gpa_page_list[0])),
>> +			 HV_MAX_MODIFY_GPA_REP_COUNT);
>
> The patches that added hv_setup_in_array() were pulled from
> hyperv-next due to some renewed discussion. You'll need to revert
> back to the previous handling of hyperv_pcpu_input_arg. :-(
>

No worries, it's not a big change for this patch)

>> +	if (unlikely(!input))
>> +		goto unlock;
>> +
>> +	list_for_each_entry(ent, &hv_list_enc, list) {
>> +		for (i = 0, cur = 0; i < ent->count; i++) {
>> +			input->gpa_page_list[cur] = ent->pfn + i;
>> +			cur++;
>> +
>> +			if (cur == batch_size || i == ent->count - 1) {
>> +				input->partition_id = HV_PARTITION_ID_SELF;
>> +				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
>> +				input->reserved0 = 0;
>> +				input->reserved1 = 0;
>> +				hv_status = hv_do_rep_hypercall(
>> +					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
>> +					cur, 0, input, NULL);
>> +				WARN_ON_ONCE(!hv_result_success(hv_status));
>> +				cur = 0;
>> +			}
>> +		}
>> +
>> +	};
>> +
>> +unlock:
>> +	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
>> +
>>  /*
>>   * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
>>   *
>> @@ -476,24 +658,33 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>>  	u64 hv_status;
>>  	int batch_size;
>>  	unsigned long flags;
>> +	int ret;
>> 
>>  	/* no-op if partition isolation is not enabled */
>>  	if (!hv_is_isolation_supported())
>>  		return 0;
>> 
>> +	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
>> +		hv_list_enc_remove(pfn, count);
>> +	} else {
>> +		ret = hv_list_enc_add(pfn, count);
>> +		if (ret)
>> +			return ret;
>> +	}
>> +
>>  	local_irq_save(flags);
>>  	batch_size = hv_setup_in_array(&input, sizeof(*input),
>>  					sizeof(input->gpa_page_list[0]));
>>  	if (unlikely(!input)) {
>> -		local_irq_restore(flags);
>> -		return -EINVAL;
>> +		ret = -EINVAL;
>> +		goto unlock;
>>  	}
>> 
>>  	if (count > batch_size) {
>>  		pr_err("Hyper-V: GPA count:%d exceeds supported:%u\n", count,
>>  		       batch_size);
>> -		local_irq_restore(flags);
>> -		return -EINVAL;
>> +		ret = -EINVAL;
>> +		goto unlock;
>>  	}
>> 
>>  	input->partition_id = HV_PARTITION_ID_SELF;
>> @@ -502,12 +693,18 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>>  	hv_status = hv_do_rep_hypercall(
>>  			HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY, count,
>>  			0, input, NULL);
>> -	local_irq_restore(flags);
>> -
>>  	if (hv_result_success(hv_status))
>> -		return 0;
>> +		ret = 0;
>>  	else
>> -		return -EFAULT;
>> +		ret = -EFAULT;
>> +
>> +unlock:
>> +	local_irq_restore(flags);
>> +
>> +	if (ret)
>> +		hv_list_enc_remove(pfn, count);
>
> If making the pages shared fails, this is an "undo". But what
> about an "undo" if making the pages private fails? Maybe
> your thinking is that leaving the pages on the shared list just
> sets things up for a failure when hv_ivm_clear_host_access()
> tries to make them private. But I wonder if it would be better
> to go ahead and "undo" here in case the failure is transient
> and hv_ivm_clear_host_access() might later succeed. I don't
> understand the hypercall failure modes well enough to
> know whether that's plausible. But if you think the "undo"
> here really isn't warranted, please add a code comment to
> that effect since a future reader might expect to the two
> cases to be symmetrical here.

Nice catch, I missed the fact that making pages private can fail too. As
the assumption we have is that the hypercall does 'all or nothing', it's
better to handle the case symmetrically. Will add the logic.

>
>> +
>> +	return ret;
>>  }
>> 
>>  /*
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index abc4659f5809..6a988001e46f 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long
>> start_ip,
>>  void hv_vtom_init(void);
>>  void hv_ivm_msr_write(u64 msr, u64 value);
>>  void hv_ivm_msr_read(u64 msr, u64 *value);
>> +void hv_ivm_clear_host_access(void);
>>  #else
>>  static inline void hv_vtom_init(void) {}
>>  static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
>>  static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
>> +static inline void hv_ivm_clear_host_access(void) {}
>>  #endif
>> 
>>  static inline bool hv_is_synic_msr(unsigned int reg)
>> --
>> 2.50.1
>

-- 
Vitaly


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-27 12:50   ` Vitaly Kuznetsov
@ 2025-08-27 16:01     ` Michael Kelley
  0 siblings, 0 replies; 4+ messages in thread
From: Michael Kelley @ 2025-08-27 16:01 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Wednesday, August 27, 2025 5:51 AM
> 
> Michael Kelley <mhklinux@outlook.com> writes:
> 
> > From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, August 21, 2025 8:17 AM
> >>

[snip]

> >
> > I seem to recall that some separate work has been done
> > to support kexec/kdump for the more generic SEV-SNP and
> > TDX cases where there's no paravisor mediating. I haven't
> > gone looking for that code to see when it runs.
> > hv_ivm_clear_host_access() is needed to update the
> > paravisor records about the page state, but if other code
> > has already updated the hypervisor/processor state, that
> > might be problematic. If this runs first, then the more
> > generic code will presumably find nothing to do, which
> > should be OK.
> >
> > I'll try to go look further at this situation, unless you already
> > have. If necessary, this function could be gated to run
> > only when a paravisor is present.
> 
> Yes, there are SEV-SNP and TDX specific
> snp_kexec_finish()/tdx_kexec_finish() which do memory unsharing but I
> convinced myself that these are not called on Azure CVM which uses
> paravisor. In particular, for SEV-SNP 'sme_me_mask == 0' in
> sme_early_init().

Thanks for the pointer to snp_kexec_finish() and tdx_kexec_finish().
Yes, I agree they won't get called in the Hyper-V paravisor case.

> 
> I have to admit I've never seen an Azure/Hyper-V CVM without a
> paravisor, but I agree it may make sense to hide all this tracking and
> cleanup logic under 'if (hyperv_paravisor_present)' (or even 'if
> (hv_is_isolation_supported() && hyperv_paravisor_present)').

I think this patch should be plugged into the .enc_kexec_begin and
.enc_kexec_finish mechanism. By using .enc_kexec_begin similarly
to snp/tdx_kexec_begin(), synchronizing with in-progress private<->shared
conversions is done in the kexec() case, though not in the panic/kdump
case. That machinery is there, and the Hyper-V paravisor case can use it.
hv_ivm_clear_host_access() should be the .enc_kexec_finish function,
and it will get invoked at the right time without having to be
explicitly called in hyperv_cleanup().

Hyper-V *does* support running SNP and TDX in what Microsoft
internally calls the "fully enlightened" case, where the guest OS does
everything necessary to operate in SNP or TDX mode, without
depending on a paravisor. In such a case, sme_me_mask will be
non-zero in SNP mode, for example. But I'm unsure if Microsoft has kept
the Linux guest support on Hyper-V up-to-date enough for this to actually
work in practice. I thought at one point there was an internal use case
for SNP without a paravisor, but it's been nearly two years now
since I retired, and so I don't really know anymore. For running on
Hyper-V in TDX mode without a paravisor, I know there's one Linux
patch missing that is needed by the netvsc driver. That patch would
provide the TDX support for set_memory_encrypted/decrypted() to work
on a vmalloc area where the underlying physical memory is not
contiguous. SNP has the support, but it's a known gap for TDX.

If you wire up the .enc_kexec_begin and .enc_kexec_finish functions
in hv_vtom_init() along with the other x86_platform.guest.enc_*
functions, then you don't need to test for a paravisor being present
because hv_vtom_init() is called only when a paravisor is present.
The non-paravisor cases on Hyper-V will fall back to the existing
snp/tdx_kexec_being/finish() functions, and everything *should*
just work. If it doesn't just work, that's a different problem for
the Microsoft folks to look at if they care.

I will ping my contacts on the Microsoft side to see if the "no
paravisor" case is still of interest, and if so, whether someone
can test it. My only test environment is as a normal Azure user,
so like you, I don't have a way to do such a test.

> 
> >
> >> +
> >> +	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> >
> > Since this function is now called after other CPUs have
> > been stopped, the spin lock is no longer necessary, unless
> > you were counting on it to provide the interrupt disable
> > needed for accessing the per-cpu hypercall argument page.
> > But even then, I'd suggest just doing the interrupt disable
> > instead of the spin lock so there's no chance of the
> > panic or kexec path getting hung waiting on the spin lock.
> 
> Makes sense, will do.
> 
> >
> > There's also a potentially rare problem if other CPUs are
> > stopped while hv_list_enc_add() or hv_list_nec_remove()
> > is being executed. The list might be inconsistent, or not
> > fully reflect what the paravisor and hypervisor think about
> > the private/shared state of the page. But I don't think there's
> > anything we can do about that. Again, I'd suggest a code
> > comment acknowledging this case, and that there's nothing
> > that can be done.
> 
> True, will add a comment. Just like with a lot of other corner cases in
> panic, it's hard to guarantee correctness in ALL cases as the system can
> be in any state (e.g. if the panic is caused by memory corruption -- who
> knows what's corrupted?). I'm hoping that with the newly added logic
> we're covering the most common kdump case and it'll 'generally work' on
> Azure CVMs.

As noted above, in a non-panic kexec(), the synchronization with
in-progress private<->shared conversions can be solved.

Michael

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-08-27 16:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-21 15:16 [PATCH v3] x86/hyperv: Fix kdump on Azure CVMs Vitaly Kuznetsov
2025-08-27  0:11 ` Michael Kelley
2025-08-27 12:50   ` Vitaly Kuznetsov
2025-08-27 16:01     ` Michael Kelley

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).