linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
@ 2025-08-18  9:54 Vitaly Kuznetsov
  2025-08-19 17:30 ` Michael Kelley
  2025-08-19 21:19 ` kernel test robot
  0 siblings, 2 replies; 5+ messages in thread
From: Vitaly Kuznetsov @ 2025-08-18  9:54 UTC (permalink / raw)
  To: linux-hyperv
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, x86,
	linux-kernel, Nuno Das Neves, Tianyu Lan, Michael Kelley, Li Tian,
	Philipp Rudo

Azure CVM instance types featuring a paravisor hang upon kdump. The
investigation shows that makedumpfile causes a hang when it steps on a page
which was previously share with the host
(HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
knowledge of these 'special' regions (which are Vmbus connection pages,
GPADL buffers, ...). There are several ways to approach the issue:
- Convey the knowledge about these regions to the new kernel somehow.
- Unshare these regions before accessing in the new kernel (it is unclear
if there's a way to query the status for a given GPA range).
- Unshare these regions before jumping to the new kernel (which this patch
implements).

To make the procedure as robust as possible, store PFN ranges of shared
regions in a linked list instead of storing GVAs and re-using
hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
on the kdump/kexec path.

The patch skips implementing weird corner case in hv_list_enc_remove()
when a PFN in the middle of a region is unshared. First, it is unlikely
that such requests happen. Second, it is not a big problem if
hv_list_enc_remove() doesn't actually remove some regions as this will
only result in an extra hypercall doing nothing upon kexec/kdump; there's
no need to be perfect.

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
---
Changes since v1:
 - fix build on ARM [kernel test robot]
---
 arch/arm64/include/asm/mshyperv.h |   3 +
 arch/x86/hyperv/ivm.c             | 153 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/mshyperv.h   |   2 +
 drivers/hv/vmbus_drv.c            |   2 +
 4 files changed, 160 insertions(+)

diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab6..af11abf403b4 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -53,6 +53,9 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 	return hv_get_msr(reg);
 }
 
+/* Isolated VMs are unsupported on ARM, no cleanup needed */
+static inline void hv_ivm_clear_host_access(void) {}
+
 /* SMCCC hypercall parameters */
 #define HV_SMCCC_FUNC_NUMBER	1
 #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
index ade6c665c97e..a6e614672836 100644
--- a/arch/x86/hyperv/ivm.c
+++ b/arch/x86/hyperv/ivm.c
@@ -462,6 +462,150 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
 		hv_ghcb_msr_read(msr, value);
 }
 
+/*
+ * Keep track of the PFN regions which were shared with the host. The access
+ * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
+ */
+struct hv_enc_pfn_region {
+	struct list_head list;
+	u64 pfn;
+	int count;
+};
+
+static LIST_HEAD(hv_list_enc);
+static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
+
+static int hv_list_enc_add(const u64 *pfn_list, int count)
+{
+	struct hv_enc_pfn_region *ent;
+	unsigned long flags;
+	bool found = false;
+	u64 pfn;
+	int i;
+
+	for (i = 0; i < count; i++) {
+		pfn = pfn_list[i];
+
+		found = false;
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_for_each_entry(ent, &hv_list_enc, list) {
+			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn)) {
+				/* Nothin to do - pfn is already in the list */
+				found = true;
+			} else if (ent->pfn + ent->count == pfn) {
+				/* Grow existing region up */
+				found = true;
+				ent->count++;
+			} else if (pfn + 1 == ent->pfn) {
+				/* Grow existing region down */
+				found = true;
+				ent->pfn--;
+				ent->count++;
+			}
+		};
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+
+		if (found)
+			continue;
+
+		/* No adajacent region found -- creating a new one */
+		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
+		if (!ent)
+			return -ENOMEM;
+
+		ent->pfn = pfn;
+		ent->count = 1;
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_add(&ent->list, &hv_list_enc);
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+	}
+
+	return 0;
+}
+
+static void hv_list_enc_remove(const u64 *pfn_list, int count)
+{
+	struct hv_enc_pfn_region *ent, *t;
+	unsigned long flags;
+	u64 pfn;
+	int i;
+
+	for (i = 0; i < count; i++) {
+		pfn = pfn_list[i];
+
+		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
+			if (pfn == ent->pfn + count - 1) {
+				/* Removing tail pfn */
+				ent->count--;
+				if (!ent->count) {
+					list_del(&ent->list);
+					kfree(ent);
+				}
+			} else if (pfn == ent->pfn) {
+				/* Removing head pfn */
+				ent->count--;
+				ent->pfn++;
+				if (!ent->count) {
+					list_del(&ent->list);
+					kfree(ent);
+				}
+			}
+
+			/*
+			 * Removing PFNs in the middle of a region is not implemented; the
+			 * list is currently only used for cleanup upon kexec and there's
+			 * no harm done if we issue an extra unneeded hypercall making some
+			 * region encrypted when it already is.
+			 */
+		};
+		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+	}
+}
+
+void hv_ivm_clear_host_access(void)
+{
+	struct hv_gpa_range_for_visibility *input;
+	struct hv_enc_pfn_region *ent;
+	unsigned long flags;
+	u64 hv_status;
+	int cur, i;
+
+	if (!hv_is_isolation_supported())
+		return;
+
+	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
+
+	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
+	if (!input)
+		goto unlock;
+
+	list_for_each_entry(ent, &hv_list_enc, list) {
+		for (i = 0, cur = 0; i < ent->count; i++) {
+			input->gpa_page_list[cur] = ent->pfn + i;
+			cur++;
+
+			if (cur == HV_MAX_MODIFY_GPA_REP_COUNT || i == ent->count - 1) {
+				input->partition_id = HV_PARTITION_ID_SELF;
+				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
+				input->reserved0 = 0;
+				input->reserved1 = 0;
+				hv_status = hv_do_rep_hypercall(
+					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
+					cur, 0, input, NULL);
+				WARN_ON_ONCE(!hv_result_success(hv_status));
+				cur = 0;
+			}
+		}
+
+	};
+
+unlock:
+	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
+}
+EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
+
 /*
  * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
  *
@@ -475,6 +619,7 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 	struct hv_gpa_range_for_visibility *input;
 	u64 hv_status;
 	unsigned long flags;
+	int ret;
 
 	/* no-op if partition isolation is not enabled */
 	if (!hv_is_isolation_supported())
@@ -486,6 +631,14 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
 		return -EINVAL;
 	}
 
+	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
+		hv_list_enc_remove(pfn, count);
+	} else {
+		ret = hv_list_enc_add(pfn, count);
+		if (ret)
+			return ret;
+	}
+
 	local_irq_save(flags);
 	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
 
diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index abc4659f5809..6a988001e46f 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long start_ip,
 void hv_vtom_init(void);
 void hv_ivm_msr_write(u64 msr, u64 value);
 void hv_ivm_msr_read(u64 msr, u64 *value);
+void hv_ivm_clear_host_access(void);
 #else
 static inline void hv_vtom_init(void) {}
 static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
 static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
+static inline void hv_ivm_clear_host_access(void) {}
 #endif
 
 static inline bool hv_is_synic_msr(unsigned int reg)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index 2ed5a1e89d69..2e891e108218 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2784,6 +2784,7 @@ static void hv_kexec_handler(void)
 	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
 	mb();
 	cpuhp_remove_state(hyperv_cpuhp_online);
+	hv_ivm_clear_host_access();
 };
 
 static void hv_crash_handler(struct pt_regs *regs)
@@ -2799,6 +2800,7 @@ static void hv_crash_handler(struct pt_regs *regs)
 	cpu = smp_processor_id();
 	hv_stimer_cleanup(cpu);
 	hv_synic_disable_regs(cpu);
+	hv_ivm_clear_host_access();
 };
 
 static int hv_synic_suspend(void)
-- 
2.50.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-18  9:54 [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs Vitaly Kuznetsov
@ 2025-08-19 17:30 ` Michael Kelley
  2025-08-21  9:36   ` Vitaly Kuznetsov
  2025-08-19 21:19 ` kernel test robot
  1 sibling, 1 reply; 5+ messages in thread
From: Michael Kelley @ 2025-08-19 17:30 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Monday, August 18, 2025 2:54 AM
> 
> Azure CVM instance types featuring a paravisor hang upon kdump. The
> investigation shows that makedumpfile causes a hang when it steps on a page
> which was previously share with the host
> (HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
> knowledge of these 'special' regions (which are Vmbus connection pages,
> GPADL buffers, ...). There are several ways to approach the issue:
> - Convey the knowledge about these regions to the new kernel somehow.
> - Unshare these regions before accessing in the new kernel (it is unclear
> if there's a way to query the status for a given GPA range).
> - Unshare these regions before jumping to the new kernel (which this patch
> implements).
> 
> To make the procedure as robust as possible, store PFN ranges of shared
> regions in a linked list instead of storing GVAs and re-using
> hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
> on the kdump/kexec path.
> 
> The patch skips implementing weird corner case in hv_list_enc_remove()
> when a PFN in the middle of a region is unshared. First, it is unlikely
> that such requests happen. Second, it is not a big problem if
> hv_list_enc_remove() doesn't actually remove some regions as this will
> only result in an extra hypercall doing nothing upon kexec/kdump; there's
> no need to be perfect.
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> ---
> Changes since v1:
>  - fix build on ARM [kernel test robot]
> ---
>  arch/arm64/include/asm/mshyperv.h |   3 +
>  arch/x86/hyperv/ivm.c             | 153 ++++++++++++++++++++++++++++++
>  arch/x86/include/asm/mshyperv.h   |   2 +
>  drivers/hv/vmbus_drv.c            |   2 +
>  4 files changed, 160 insertions(+)
> 
> diff --git a/arch/arm64/include/asm/mshyperv.h
> b/arch/arm64/include/asm/mshyperv.h
> index b721d3134ab6..af11abf403b4 100644
> --- a/arch/arm64/include/asm/mshyperv.h
> +++ b/arch/arm64/include/asm/mshyperv.h
> @@ -53,6 +53,9 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>  	return hv_get_msr(reg);
>  }
> 
> +/* Isolated VMs are unsupported on ARM, no cleanup needed */
> +static inline void hv_ivm_clear_host_access(void) {}

Stubs such as this should be handled differently. We've instead
put __weak stubs in drivers/hv/hv_common.c, and let x86 code
override. That approach avoids needing to update arch/arm64
code and to get acks from arm64 maintainers for functionality that
is (currently) x86-only. arch/arm64/include/asm/mshyperv.h is
pretty small because of this approach.

For consistency, this stub should follow that existing pattern. See
hv_is_isolation_supported() as an example.

> +
>  /* SMCCC hypercall parameters */
>  #define HV_SMCCC_FUNC_NUMBER	1
>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
> index ade6c665c97e..a6e614672836 100644
> --- a/arch/x86/hyperv/ivm.c
> +++ b/arch/x86/hyperv/ivm.c
> @@ -462,6 +462,150 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
>  		hv_ghcb_msr_read(msr, value);
>  }
> 
> +/*
> + * Keep track of the PFN regions which were shared with the host. The access
> + * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
> + */
> +struct hv_enc_pfn_region {
> +	struct list_head list;
> +	u64 pfn;
> +	int count;
> +};

I'm wondering if there's an existing kernel data structure that would handle
the requirements here. Did you look at using xarray()? It's probably not as
memory efficient since it presumably needs a separate entry for each PFN,
whereas your code below uses a single entry for a range of PFNs. But
maybe that's a worthwhile tradeoff to simplify the code and avoid some
of the messy issues I point out below.  Just a thought ....

> +
> +static LIST_HEAD(hv_list_enc);
> +static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
> +
> +static int hv_list_enc_add(const u64 *pfn_list, int count)
> +{
> +	struct hv_enc_pfn_region *ent;
> +	unsigned long flags;
> +	bool found = false;
> +	u64 pfn;
> +	int i;
> +
> +	for (i = 0; i < count; i++) {
> +		pfn = pfn_list[i];
> +
> +		found = false;
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_for_each_entry(ent, &hv_list_enc, list) {
> +			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn)) {
> +				/* Nothin to do - pfn is already in the list */

s/Nothin/Nothing/

> +				found = true;
> +			} else if (ent->pfn + ent->count == pfn) {
> +				/* Grow existing region up */
> +				found = true;
> +				ent->count++;
> +			} else if (pfn + 1 == ent->pfn) {
> +				/* Grow existing region down */
> +				found = true;
> +				ent->pfn--;
> +				ent->count++;
> +			}

Observations that might be worth a comment here in the code:
After a region is grown up or down, there's no check to see if the
region is now adjacent to an existing region. Additionally, if a PFN
that is already in some region is added, it might get appended to
some other adjacent region that occurs earlier in the list, rather than
being recognized as a duplicate. Hence the PFN could be included
in two different regions.

> +		};
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +
> +		if (found)
> +			continue;
> +
> +		/* No adajacent region found -- creating a new one */

s/adajacent/adjacent/

> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
> +		if (!ent)
> +			return -ENOMEM;
> +
> +		ent->pfn = pfn;
> +		ent->count = 1;
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_add(&ent->list, &hv_list_enc);
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +	}
> +
> +	return 0;
> +}
> +
> +static void hv_list_enc_remove(const u64 *pfn_list, int count)
> +{
> +	struct hv_enc_pfn_region *ent, *t;
> +	unsigned long flags;
> +	u64 pfn;
> +	int i;
> +
> +	for (i = 0; i < count; i++) {
> +		pfn = pfn_list[i];
> +
> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
> +			if (pfn == ent->pfn + count - 1) {

This should be:
			if (pfn == ent->pfn + ent->count - 1) {

> +				/* Removing tail pfn */
> +				ent->count--;
> +				if (!ent->count) {
> +					list_del(&ent->list);
> +					kfree(ent);
> +				}
> +			} else if (pfn == ent->pfn) {
> +				/* Removing head pfn */
> +				ent->count--;
> +				ent->pfn++;
> +				if (!ent->count) {
> +					list_del(&ent->list);
> +					kfree(ent);
> +				}

Apropos my comment on hv_list_enc_add(), if a PFN does appear in
more than one region, this code removes it from all such regions.

> +			}
> +
> +			/*
> +			 * Removing PFNs in the middle of a region is not implemented; the
> +			 * list is currently only used for cleanup upon kexec and there's
> +			 * no harm done if we issue an extra unneeded hypercall making some
> +			 * region encrypted when it already is.
> +			 */

In working with Hyper-V CVMs, I have never been entirely clear on whether the
HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY hypercall is idempotent.
Consequently, in other parts of the code, we've made sure not to re-encrypt
memory that is already encrypted. There may have been some issues back in the
early days of CVMs that led to me think that it is not idempotent, but I don't
remember for sure.

Do you have a particular basis for asserting that it is idempotent? I just ran an
experiment on a TDX and a SEV-SNP VM in Azure, and the behavior is idempotent
in both cases, so that's good. But both are configurations with a paravisor, which
intercepts the hypercall and then makes its own decision about whether to invoke
the hypervisor. I don't have the ability to run configurations with no paravisor, and
see whether the hypercall as implemented by the hypervisor is idempotent. Also,
there's the new OpenHCL paravisor that similarly intercepts the hypercall, and 
its behavior could be different.

Lacking a spec for any of this, it's hard to know what behavior can be depended
upon. Probably should get clarity from someone at Microsoft who can check.

> +		};
> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +	}
> +}
> +
> +void hv_ivm_clear_host_access(void)
> +{
> +	struct hv_gpa_range_for_visibility *input;
> +	struct hv_enc_pfn_region *ent;
> +	unsigned long flags;
> +	u64 hv_status;
> +	int cur, i;
> +
> +	if (!hv_is_isolation_supported())
> +		return;
> +
> +	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
> +
> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	if (!input)
> +		goto unlock;

The latest hyperv-next tree has code changes in how the
per-cpu hypercall input arg is handled. Check it for examples.

> +
> +	list_for_each_entry(ent, &hv_list_enc, list) {
> +		for (i = 0, cur = 0; i < ent->count; i++) {
> +			input->gpa_page_list[cur] = ent->pfn + i;
> +			cur++;
> +
> +			if (cur == HV_MAX_MODIFY_GPA_REP_COUNT || i == ent->count - 1) {
> +				input->partition_id = HV_PARTITION_ID_SELF;
> +				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
> +				input->reserved0 = 0;
> +				input->reserved1 = 0;
> +				hv_status = hv_do_rep_hypercall(
> +					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
> +					cur, 0, input, NULL);
> +				WARN_ON_ONCE(!hv_result_success(hv_status));
> +				cur = 0;
> +			}
> +		}
> +
> +	};
> +
> +unlock:
> +	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
> +}
> +EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
> +
>  /*
>   * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
>   *
> @@ -475,6 +619,7 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  	struct hv_gpa_range_for_visibility *input;
>  	u64 hv_status;
>  	unsigned long flags;
> +	int ret;
> 
>  	/* no-op if partition isolation is not enabled */
>  	if (!hv_is_isolation_supported())
> @@ -486,6 +631,14 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>  		return -EINVAL;
>  	}
> 
> +	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
> +		hv_list_enc_remove(pfn, count);
> +	} else {
> +		ret = hv_list_enc_add(pfn, count);
> +		if (ret)
> +			return ret;
> +	}

What's the strategy if there's a failure from the hypercall
further down in this function? The list could then be out-of-sync
with what the paravisor/hypervisor thinks.

> +
>  	local_irq_save(flags);
>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
> 
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index abc4659f5809..6a988001e46f 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long
> start_ip,
>  void hv_vtom_init(void);
>  void hv_ivm_msr_write(u64 msr, u64 value);
>  void hv_ivm_msr_read(u64 msr, u64 *value);
> +void hv_ivm_clear_host_access(void);
>  #else
>  static inline void hv_vtom_init(void) {}
>  static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
>  static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
> +static inline void hv_ivm_clear_host_access(void) {}
>  #endif
> 
>  static inline bool hv_is_synic_msr(unsigned int reg)
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 2ed5a1e89d69..2e891e108218 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2784,6 +2784,7 @@ static void hv_kexec_handler(void)
>  	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
>  	mb();
>  	cpuhp_remove_state(hyperv_cpuhp_online);

At this point, all vCPUs are still running. Changing the state of decrypted pages
to encrypted has the potential to upset code running on those other vCPUs.
They might try to access a page that has become encrypted using a PTE that
indicates decrypted. And changing a page from decrypted to encrypted changes
the memory contents of the page that would be seen by the other vCPU.
Either situation could cause a panic, and ruin the kexec().

It seems to me that it would be safer to do hv_ivm_clear_host_access()
at the beginning of hyperv_cleanup(), before clearing the guest OS ID
and the hypercall page. But maybe there's a reason that doesn't work
that I'm missing.

> +	hv_ivm_clear_host_access();
>  };
> 
>  static void hv_crash_handler(struct pt_regs *regs)
> @@ -2799,6 +2800,7 @@ static void hv_crash_handler(struct pt_regs *regs)
>  	cpu = smp_processor_id();
>  	hv_stimer_cleanup(cpu);
>  	hv_synic_disable_regs(cpu);

Same here about waiting until only one vCPU is running.

> +	hv_ivm_clear_host_access();
>  };
> 
>  static int hv_synic_suspend(void)
> --
> 2.50.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-18  9:54 [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs Vitaly Kuznetsov
  2025-08-19 17:30 ` Michael Kelley
@ 2025-08-19 21:19 ` kernel test robot
  1 sibling, 0 replies; 5+ messages in thread
From: kernel test robot @ 2025-08-19 21:19 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv
  Cc: oe-kbuild-all, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, x86, linux-kernel, Nuno Das Neves, Tianyu Lan,
	Michael Kelley, Li Tian, Philipp Rudo

Hi Vitaly,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/x86/core]
[also build test WARNING on arm64/for-next/core tip/master linus/master v6.17-rc2]
[cannot apply to tip/auto-latest next-20250819]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Vitaly-Kuznetsov/x86-hyperv-Fix-kdump-on-Azure-CVMs/20250818-175830
base:   tip/x86/core
patch link:    https://lore.kernel.org/r/20250818095400.1610209-1-vkuznets%40redhat.com
patch subject: [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
config: x86_64-randconfig-101-20250819 (https://download.01.org/0day-ci/archive/20250820/202508200507.78h11riS-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202508200507.78h11riS-lkp@intel.com/

cocci warnings: (new ones prefixed by >>)
>> arch/x86/hyperv/ivm.c:601:2-3: Unneeded semicolon
   arch/x86/hyperv/ivm.c:504:3-4: Unneeded semicolon
   arch/x86/hyperv/ivm.c:561:3-4: Unneeded semicolon

vim +601 arch/x86/hyperv/ivm.c

   565	
   566	void hv_ivm_clear_host_access(void)
   567	{
   568		struct hv_gpa_range_for_visibility *input;
   569		struct hv_enc_pfn_region *ent;
   570		unsigned long flags;
   571		u64 hv_status;
   572		int cur, i;
   573	
   574		if (!hv_is_isolation_supported())
   575			return;
   576	
   577		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
   578	
   579		input = *this_cpu_ptr(hyperv_pcpu_input_arg);
   580		if (!input)
   581			goto unlock;
   582	
   583		list_for_each_entry(ent, &hv_list_enc, list) {
   584			for (i = 0, cur = 0; i < ent->count; i++) {
   585				input->gpa_page_list[cur] = ent->pfn + i;
   586				cur++;
   587	
   588				if (cur == HV_MAX_MODIFY_GPA_REP_COUNT || i == ent->count - 1) {
   589					input->partition_id = HV_PARTITION_ID_SELF;
   590					input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
   591					input->reserved0 = 0;
   592					input->reserved1 = 0;
   593					hv_status = hv_do_rep_hypercall(
   594						HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
   595						cur, 0, input, NULL);
   596					WARN_ON_ONCE(!hv_result_success(hv_status));
   597					cur = 0;
   598				}
   599			}
   600	
 > 601		};
   602	
   603	unlock:
   604		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
   605	}
   606	EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
   607	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-19 17:30 ` Michael Kelley
@ 2025-08-21  9:36   ` Vitaly Kuznetsov
  2025-08-21 15:45     ` Michael Kelley
  0 siblings, 1 reply; 5+ messages in thread
From: Vitaly Kuznetsov @ 2025-08-21  9:36 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

Michael Kelley <mhklinux@outlook.com> writes:

> From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Monday, August 18, 2025 2:54 AM
>> 
>> Azure CVM instance types featuring a paravisor hang upon kdump. The
>> investigation shows that makedumpfile causes a hang when it steps on a page
>> which was previously share with the host
>> (HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY). The new kernel has no
>> knowledge of these 'special' regions (which are Vmbus connection pages,
>> GPADL buffers, ...). There are several ways to approach the issue:
>> - Convey the knowledge about these regions to the new kernel somehow.
>> - Unshare these regions before accessing in the new kernel (it is unclear
>> if there's a way to query the status for a given GPA range).
>> - Unshare these regions before jumping to the new kernel (which this patch
>> implements).
>> 
>> To make the procedure as robust as possible, store PFN ranges of shared
>> regions in a linked list instead of storing GVAs and re-using
>> hv_vtom_set_host_visibility(). This also allows to avoid memory allocation
>> on the kdump/kexec path.
>> 
>> The patch skips implementing weird corner case in hv_list_enc_remove()
>> when a PFN in the middle of a region is unshared. First, it is unlikely
>> that such requests happen. Second, it is not a big problem if
>> hv_list_enc_remove() doesn't actually remove some regions as this will
>> only result in an extra hypercall doing nothing upon kexec/kdump; there's
>> no need to be perfect.
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> ---
>> Changes since v1:
>>  - fix build on ARM [kernel test robot]
>> ---
>>  arch/arm64/include/asm/mshyperv.h |   3 +
>>  arch/x86/hyperv/ivm.c             | 153 ++++++++++++++++++++++++++++++
>>  arch/x86/include/asm/mshyperv.h   |   2 +
>>  drivers/hv/vmbus_drv.c            |   2 +
>>  4 files changed, 160 insertions(+)
>> 
>> diff --git a/arch/arm64/include/asm/mshyperv.h
>> b/arch/arm64/include/asm/mshyperv.h
>> index b721d3134ab6..af11abf403b4 100644
>> --- a/arch/arm64/include/asm/mshyperv.h
>> +++ b/arch/arm64/include/asm/mshyperv.h
>> @@ -53,6 +53,9 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
>>  	return hv_get_msr(reg);
>>  }
>> 
>> +/* Isolated VMs are unsupported on ARM, no cleanup needed */
>> +static inline void hv_ivm_clear_host_access(void) {}
>

Michael!

Thanks a bunch for your detailed review!

> Stubs such as this should be handled differently. We've instead
> put __weak stubs in drivers/hv/hv_common.c, and let x86 code
> override. That approach avoids needing to update arch/arm64
> code and to get acks from arm64 maintainers for functionality that
> is (currently) x86-only. arch/arm64/include/asm/mshyperv.h is
> pretty small because of this approach.
>
> For consistency, this stub should follow that existing pattern. See
> hv_is_isolation_supported() as an example.
>

Sure, will switch to __weak.

>> +
>>  /* SMCCC hypercall parameters */
>>  #define HV_SMCCC_FUNC_NUMBER	1
>>  #define HV_FUNC_ID	ARM_SMCCC_CALL_VAL(			\
>> diff --git a/arch/x86/hyperv/ivm.c b/arch/x86/hyperv/ivm.c
>> index ade6c665c97e..a6e614672836 100644
>> --- a/arch/x86/hyperv/ivm.c
>> +++ b/arch/x86/hyperv/ivm.c
>> @@ -462,6 +462,150 @@ void hv_ivm_msr_read(u64 msr, u64 *value)
>>  		hv_ghcb_msr_read(msr, value);
>>  }
>> 
>> +/*
>> + * Keep track of the PFN regions which were shared with the host. The access
>> + * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
>> + */
>> +struct hv_enc_pfn_region {
>> +	struct list_head list;
>> +	u64 pfn;
>> +	int count;
>> +};
>
> I'm wondering if there's an existing kernel data structure that would handle
> the requirements here. Did you look at using xarray()? It's probably not as
> memory efficient since it presumably needs a separate entry for each PFN,
> whereas your code below uses a single entry for a range of PFNs. But
> maybe that's a worthwhile tradeoff to simplify the code and avoid some
> of the messy issues I point out below.  Just a thought ....

I thought about it before I looked at how these regions really look
like. Here's what I see on a DC2ads instance upon kdump (with debug
printk added):

[   37.255921] hv_ivm_clear_host_access: PFN_START: 102548 COUNT:8
[   37.256833] hv_ivm_clear_host_access: PFN_START: 10bc60 COUNT:16
[   37.257743] hv_ivm_clear_host_access: PFN_START: 10bd00 COUNT:256
[   37.259177] hv_ivm_clear_host_access: PFN_START: 10ada0 COUNT:255
[   37.260639] hv_ivm_clear_host_access: PFN_START: 1097e8 COUNT:24
[   37.261630] hv_ivm_clear_host_access: PFN_START: 103ce3 COUNT:45
[   37.262741] hv_ivm_clear_host_access: PFN_START: 103ce1 COUNT:1

... 57 more items with 1-4 PFNs ...

[   37.320659] hv_ivm_clear_host_access: PFN_START: 103c98 COUNT:1
[   37.321611] hv_ivm_clear_host_access: PFN_START: 109d00 COUNT:4199
[   37.331656] hv_ivm_clear_host_access: PFN_START: 10957f COUNT:129
[   37.332902] hv_ivm_clear_host_access: PFN_START: 103c9b COUNT:2
[   37.333811] hv_ivm_clear_host_access: PFN_START: 1000 COUNT:256
[   37.335066] hv_ivm_clear_host_access: PFN_START: 100 COUNT:256
[   37.336340] hv_ivm_clear_host_access: PFN_START: 100e00 COUNT:256
[   37.337626] hv_ivm_clear_host_access: PFN_START: 7b000 COUNT:131072

Overall, the liked list contains 72 items of 32 bytes each so we're
consuming 2k of extra memory. Handling of such a short list should also
be pretty fast.

If we switch to handling each PFN separately, that would be 136862
items. I'm not exactly sure about xarray's memory consumption but I'm
afraid we are talking megabytes for this case. This is the price
every CVM user is going to pay. Also, the chance of getting into
(interim) memory allocation problems is going to be much higher.

>
>> +
>> +static LIST_HEAD(hv_list_enc);
>> +static DEFINE_RAW_SPINLOCK(hv_list_enc_lock);
>> +
>> +static int hv_list_enc_add(const u64 *pfn_list, int count)
>> +{
>> +	struct hv_enc_pfn_region *ent;
>> +	unsigned long flags;
>> +	bool found = false;
>> +	u64 pfn;
>> +	int i;
>> +
>> +	for (i = 0; i < count; i++) {
>> +		pfn = pfn_list[i];
>> +
>> +		found = false;
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_for_each_entry(ent, &hv_list_enc, list) {
>> +			if ((ent->pfn <= pfn) && (ent->pfn + ent->count - 1 >= pfn)) {
>> +				/* Nothin to do - pfn is already in the list */
>
> s/Nothin/Nothing/
>
>> +				found = true;
>> +			} else if (ent->pfn + ent->count == pfn) {
>> +				/* Grow existing region up */
>> +				found = true;
>> +				ent->count++;
>> +			} else if (pfn + 1 == ent->pfn) {
>> +				/* Grow existing region down */
>> +				found = true;
>> +				ent->pfn--;
>> +				ent->count++;
>> +			}
>
> Observations that might be worth a comment here in the code:
> After a region is grown up or down, there's no check to see if the
> region is now adjacent to an existing region. Additionally, if a PFN
> that is already in some region is added, it might get appended to
> some other adjacent region that occurs earlier in the list, rather than
> being recognized as a duplicate. Hence the PFN could be included
> in two different regions.

Having two regions which can be merged should not be a problem but the
fact that we can have the same PFN in two regions is. I think I can
solve this by checking for existence first and doing the addition only
if the PFN was not found.

>
>> +		};
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +
>> +		if (found)
>> +			continue;
>> +
>> +		/* No adajacent region found -- creating a new one */
>
> s/adajacent/adjacent/
>
>> +		ent = kzalloc(sizeof(struct hv_enc_pfn_region), GFP_KERNEL);
>> +		if (!ent)
>> +			return -ENOMEM;
>> +
>> +		ent->pfn = pfn;
>> +		ent->count = 1;
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_add(&ent->list, &hv_list_enc);
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void hv_list_enc_remove(const u64 *pfn_list, int count)
>> +{
>> +	struct hv_enc_pfn_region *ent, *t;
>> +	unsigned long flags;
>> +	u64 pfn;
>> +	int i;
>> +
>> +	for (i = 0; i < count; i++) {
>> +		pfn = pfn_list[i];
>> +
>> +		raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +		list_for_each_entry_safe(ent, t, &hv_list_enc, list) {
>> +			if (pfn == ent->pfn + count - 1) {
>
> This should be:
> 			if (pfn == ent->pfn + ent->count - 1) {
>
>> +				/* Removing tail pfn */
>> +				ent->count--;
>> +				if (!ent->count) {
>> +					list_del(&ent->list);
>> +					kfree(ent);
>> +				}
>> +			} else if (pfn == ent->pfn) {
>> +				/* Removing head pfn */
>> +				ent->count--;
>> +				ent->pfn++;
>> +				if (!ent->count) {
>> +					list_del(&ent->list);
>> +					kfree(ent);
>> +				}
>
> Apropos my comment on hv_list_enc_add(), if a PFN does appear in
> more than one region, this code removes it from all such regions.
>
>> +			}
>> +
>> +			/*
>> +			 * Removing PFNs in the middle of a region is not implemented; the
>> +			 * list is currently only used for cleanup upon kexec and there's
>> +			 * no harm done if we issue an extra unneeded hypercall making some
>> +			 * region encrypted when it already is.
>> +			 */
>
> In working with Hyper-V CVMs, I have never been entirely clear on whether the
> HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY hypercall is idempotent.
> Consequently, in other parts of the code, we've made sure not to re-encrypt
> memory that is already encrypted. There may have been some issues back in the
> early days of CVMs that led to me think that it is not idempotent, but I don't
> remember for sure.
>
> Do you have a particular basis for asserting that it is idempotent? I just ran an
> experiment on a TDX and a SEV-SNP VM in Azure, and the behavior is idempotent
> in both cases, so that's good. But both are configurations with a paravisor, which
> intercepts the hypercall and then makes its own decision about whether to invoke
> the hypervisor. I don't have the ability to run configurations with no paravisor, and
> see whether the hypercall as implemented by the hypervisor is idempotent. Also,
> there's the new OpenHCL paravisor that similarly intercepts the hypercall, and 
> its behavior could be different.
>
> Lacking a spec for any of this, it's hard to know what behavior can be depended
> upon. Probably should get clarity from someone at Microsoft who can
> check.

I'm only relying on the results of my experiments, unfortunately. It
would be great indeed to see HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY's 
documentation. At the same time, with your comments I'm leaning towards
implementing the corner case which I highlight above even though we
don't see such cases now: hypervisor's behavior may change in the
future, drivers may start doing weird things, ... Better safe than sorry.

>
>> +		};
>> +		raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +	}
>> +}
>> +
>> +void hv_ivm_clear_host_access(void)
>> +{
>> +	struct hv_gpa_range_for_visibility *input;
>> +	struct hv_enc_pfn_region *ent;
>> +	unsigned long flags;
>> +	u64 hv_status;
>> +	int cur, i;
>> +
>> +	if (!hv_is_isolation_supported())
>> +		return;
>> +
>> +	raw_spin_lock_irqsave(&hv_list_enc_lock, flags);
>> +
>> +	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> +	if (!input)
>> +		goto unlock;
>
> The latest hyperv-next tree has code changes in how the
> per-cpu hypercall input arg is handled. Check it for examples.
>
>> +
>> +	list_for_each_entry(ent, &hv_list_enc, list) {
>> +		for (i = 0, cur = 0; i < ent->count; i++) {
>> +			input->gpa_page_list[cur] = ent->pfn + i;
>> +			cur++;
>> +
>> +			if (cur == HV_MAX_MODIFY_GPA_REP_COUNT || i == ent->count - 1) {
>> +				input->partition_id = HV_PARTITION_ID_SELF;
>> +				input->host_visibility = VMBUS_PAGE_NOT_VISIBLE;
>> +				input->reserved0 = 0;
>> +				input->reserved1 = 0;
>> +				hv_status = hv_do_rep_hypercall(
>> +					HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY,
>> +					cur, 0, input, NULL);
>> +				WARN_ON_ONCE(!hv_result_success(hv_status));
>> +				cur = 0;
>> +			}
>> +		}
>> +
>> +	};
>> +
>> +unlock:
>> +	raw_spin_unlock_irqrestore(&hv_list_enc_lock, flags);
>> +}
>> +EXPORT_SYMBOL_GPL(hv_ivm_clear_host_access);
>> +
>>  /*
>>   * hv_mark_gpa_visibility - Set pages visible to host via hvcall.
>>   *
>> @@ -475,6 +619,7 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>>  	struct hv_gpa_range_for_visibility *input;
>>  	u64 hv_status;
>>  	unsigned long flags;
>> +	int ret;
>> 
>>  	/* no-op if partition isolation is not enabled */
>>  	if (!hv_is_isolation_supported())
>> @@ -486,6 +631,14 @@ static int hv_mark_gpa_visibility(u16 count, const u64 pfn[],
>>  		return -EINVAL;
>>  	}
>> 
>> +	if (visibility == VMBUS_PAGE_NOT_VISIBLE) {
>> +		hv_list_enc_remove(pfn, count);
>> +	} else {
>> +		ret = hv_list_enc_add(pfn, count);
>> +		if (ret)
>> +			return ret;
>> +	}
>
> What's the strategy if there's a failure from the hypercall
> further down in this function? The list could then be out-of-sync
> with what the paravisor/hypervisor thinks.

I wrote this under the assumption that extra items on the list is not a
problem but this is changing. What we don't really know is what
HVCALL_MODIFY_SPARSE_GPA_PAGE_HOST_VISIBILITY failure actually means: if
we gave it a list of PFNs and it managed to change the visibilisy of
some of them before it hits a problem: will it revert back or do we end
in an inconsistent state? In case it's the later, I think that broken
accounting is not the only problem we have. So let's assume it's 'all or
nothing'.

>
>> +
>>  	local_irq_save(flags);
>>  	input = *this_cpu_ptr(hyperv_pcpu_input_arg);
>> 
>> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
>> index abc4659f5809..6a988001e46f 100644
>> --- a/arch/x86/include/asm/mshyperv.h
>> +++ b/arch/x86/include/asm/mshyperv.h
>> @@ -263,10 +263,12 @@ static inline int hv_snp_boot_ap(u32 apic_id, unsigned long
>> start_ip,
>>  void hv_vtom_init(void);
>>  void hv_ivm_msr_write(u64 msr, u64 value);
>>  void hv_ivm_msr_read(u64 msr, u64 *value);
>> +void hv_ivm_clear_host_access(void);
>>  #else
>>  static inline void hv_vtom_init(void) {}
>>  static inline void hv_ivm_msr_write(u64 msr, u64 value) {}
>>  static inline void hv_ivm_msr_read(u64 msr, u64 *value) {}
>> +static inline void hv_ivm_clear_host_access(void) {}
>>  #endif
>> 
>>  static inline bool hv_is_synic_msr(unsigned int reg)
>> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
>> index 2ed5a1e89d69..2e891e108218 100644
>> --- a/drivers/hv/vmbus_drv.c
>> +++ b/drivers/hv/vmbus_drv.c
>> @@ -2784,6 +2784,7 @@ static void hv_kexec_handler(void)
>>  	/* Make sure conn_state is set as hv_synic_cleanup checks for it */
>>  	mb();
>>  	cpuhp_remove_state(hyperv_cpuhp_online);
>
> At this point, all vCPUs are still running. Changing the state of decrypted pages
> to encrypted has the potential to upset code running on those other vCPUs.
> They might try to access a page that has become encrypted using a PTE that
> indicates decrypted. And changing a page from decrypted to encrypted changes
> the memory contents of the page that would be seen by the other vCPU.
> Either situation could cause a panic, and ruin the kexec().
>
> It seems to me that it would be safer to do hv_ivm_clear_host_access()
> at the beginning of hyperv_cleanup(), before clearing the guest OS ID
> and the hypercall page. But maybe there's a reason that doesn't work
> that I'm missing.

I agree hyperv_cleanup() looks like a more suitable place, will move!

>
>> +	hv_ivm_clear_host_access();
>>  };
>> 
>>  static void hv_crash_handler(struct pt_regs *regs)
>> @@ -2799,6 +2800,7 @@ static void hv_crash_handler(struct pt_regs *regs)
>>  	cpu = smp_processor_id();
>>  	hv_stimer_cleanup(cpu);
>>  	hv_synic_disable_regs(cpu);
>
> Same here about waiting until only one vCPU is running.
>
>> +	hv_ivm_clear_host_access();
>>  };
>> 
>>  static int hv_synic_suspend(void)
>> --
>> 2.50.0
>

Thanks for the feedback, will try to address in v3!

-- 
Vitaly


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs
  2025-08-21  9:36   ` Vitaly Kuznetsov
@ 2025-08-21 15:45     ` Michael Kelley
  0 siblings, 0 replies; 5+ messages in thread
From: Michael Kelley @ 2025-08-21 15:45 UTC (permalink / raw)
  To: Vitaly Kuznetsov, linux-hyperv@vger.kernel.org
  Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	x86@kernel.org, linux-kernel@vger.kernel.org, Nuno Das Neves,
	Tianyu Lan, Li Tian, Philipp Rudo

From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Thursday, August 21, 2025 2:37 AM
> 
> Michael Kelley <mhklinux@outlook.com> writes:
> 
> > From: Vitaly Kuznetsov <vkuznets@redhat.com> Sent: Monday, August 18, 2025 2:54 AM

[snip]

> >>
> >> +/*
> >> + * Keep track of the PFN regions which were shared with the host. The access
> >> + * must be revoked upon kexec/kdump (see hv_ivm_clear_host_access()).
> >> + */
> >> +struct hv_enc_pfn_region {
> >> +	struct list_head list;
> >> +	u64 pfn;
> >> +	int count;
> >> +};
> >
> > I'm wondering if there's an existing kernel data structure that would handle
> > the requirements here. Did you look at using xarray()? It's probably not as
> > memory efficient since it presumably needs a separate entry for each PFN,
> > whereas your code below uses a single entry for a range of PFNs. But
> > maybe that's a worthwhile tradeoff to simplify the code and avoid some
> > of the messy issues I point out below.  Just a thought ....
> 
> I thought about it before I looked at how these regions really look
> like. Here's what I see on a DC2ads instance upon kdump (with debug
> printk added):
> 
> [   37.255921] hv_ivm_clear_host_access: PFN_START: 102548 COUNT:8
> [   37.256833] hv_ivm_clear_host_access: PFN_START: 10bc60 COUNT:16
> [   37.257743] hv_ivm_clear_host_access: PFN_START: 10bd00 COUNT:256
> [   37.259177] hv_ivm_clear_host_access: PFN_START: 10ada0 COUNT:255
> [   37.260639] hv_ivm_clear_host_access: PFN_START: 1097e8 COUNT:24
> [   37.261630] hv_ivm_clear_host_access: PFN_START: 103ce3 COUNT:45
> [   37.262741] hv_ivm_clear_host_access: PFN_START: 103ce1 COUNT:1
> 
> ... 57 more items with 1-4 PFNs ...
> 
> [   37.320659] hv_ivm_clear_host_access: PFN_START: 103c98 COUNT:1
> [   37.321611] hv_ivm_clear_host_access: PFN_START: 109d00 COUNT:4199
> [   37.331656] hv_ivm_clear_host_access: PFN_START: 10957f COUNT:129
> [   37.332902] hv_ivm_clear_host_access: PFN_START: 103c9b COUNT:2
> [   37.333811] hv_ivm_clear_host_access: PFN_START: 1000 COUNT:256
> [   37.335066] hv_ivm_clear_host_access: PFN_START: 100 COUNT:256
> [   37.336340] hv_ivm_clear_host_access: PFN_START: 100e00 COUNT:256
> [   37.337626] hv_ivm_clear_host_access: PFN_START: 7b000 COUNT:131072
> 
> Overall, the liked list contains 72 items of 32 bytes each so we're
> consuming 2k of extra memory. Handling of such a short list should also
> be pretty fast.
> 
> If we switch to handling each PFN separately, that would be 136862
> items. I'm not exactly sure about xarray's memory consumption but I'm
> afraid we are talking megabytes for this case. This is the price
> every CVM user is going to pay. Also, the chance of getting into
> (interim) memory allocation problems is going to be much higher.
> 

OK, make sense. The entry above with 4199 PFNs is probably the netvsc
receive buffer array. And the big entry with 131072 PFNs is almost certainly
the swiotlb, which I had forgotten about. That one really blows up the count,
and would be 262144 PFNs on bigger CVMs with 16 Gbytes or more of memory.
So, yes, having an xarray entry per PFN almost certainly gets too big.

Michael

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-21 15:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-18  9:54 [PATCH v2] x86/hyperv: Fix kdump on Azure CVMs Vitaly Kuznetsov
2025-08-19 17:30 ` Michael Kelley
2025-08-21  9:36   ` Vitaly Kuznetsov
2025-08-21 15:45     ` Michael Kelley
2025-08-19 21:19 ` kernel test robot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).