* [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory()
@ 2026-01-12 11:31 Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 1/2] Add HvExtCallQueryCapabilities Florian Schmidt
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Florian Schmidt @ 2026-01-12 11:31 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel, Florian Schmidt
This patch series implements two new Hyper-V hypercalls. The primary goal is
to support "zeroed memory" enlightenment via HvExtCallGetBootZeroedMemory().
To do so, we also need to implement HvExtCallQueryCapabilities(). While
HvExtCallQueryCapabilities() uses another bit in EBX on CPUID leaf
0x40000003, it then allows the guest to query for further "extended"
hypercalls via the return value of that call.
HvExtCallGetBootZeroedMemory() allows a Windows guest to inquire which areas
of memory that were provided by the hypervisor are already zeroed out,
meaning that there's no need for the guest to do so again. This has obvious
performance benefits.
To get a rough idea of the performance benefit, looking at a 4-core, 64GB
Windows 11 guest, we can see that an otherwise idle guest takes a bit less
than 90 seconds to finish zeroing and settle down from the time the qemu
process is started. If we look at CPU usage for each vCPU task 90 seconds
after starting, we get:
1559 498
1294 343
1451 314
4015 2729
Conversely, after enabling HvExtCallGetBootZeroedMemory(), CPU usage is
reduced:
1583 458
1441 361
1279 312
1337 264
These are taken from the respective /proc/<pid>/task/<tid>/stat entries,
denoting user and system time per "CPU X/KVM" task, in ticks.
We can clearly see which to which vCPU fell the tasks of zeroing.
It also has the side benefit of not touching all pages at boot. This can be
useful if, for example, the VM is started with a balloon target already set.
In this case, we don't risk memory usage spikes on the host if the VM touches
all memory before the balloon can catch up.
Note that this is still an RFC series. There are a number of open questions,
most importantly, which memory areas to send back to the guest as
pre-zeroed. The Microsoft documentation is a bit sparse, but points out:
> Ranges can include memory that don’t exist and can overlap.
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/hypercalls/hvextcallgetbootzeroedmemory
So for testing purposes, I just set all memory from 0 to the end of the
64-bit address space as pre-zeroed. Anecdotally, this worked fine in my
preliminary testing so far, though it feels a bit cheeky. I'd appreciate
some suggestions here on how to handle this in the most robust way. For
example, I was worried about mapped memory, legacy craziness in the first MB of
address space, etc.
As a side note, I could not get this to work for hotplugging memory via adding
a memdimm0 at runtime; that memory would be zeroed out by the guest. It may
indeed just be about _BOOT_ZeroedMemory...
I also wonder whether HvExtCallQueryCapabilities() should have a prop bit,
like HvExtCallGetBootZeroedMemory(). Right now, HvExtCallQueryCapabilities()
is always enabled after this patch, and simply returns "no ext capabilities"
if HvExtCallGetBootZeroedMemory() is not enabled. This does technically
change guest-visible behaviour. [1] But then we'll have to implement a
dependency between the props, because you can't have
HvExtCallGetBootZeroedMemory() without HvExtCallQueryCapabilities(), because
the guest won't have a way to figure out it's available.
[1] There's a little quirk here, in case someone else gets stung by this:
Windows seems to have a long-standing bug/"feature", where it _always_
issues a HvExtCallQueryCapabilities() hypercall, even if it's not signalled
in the CPUID leaf. Except in those cases, it seems to ignore the return
value. See here for someone else who noticed that years ago:
https://lists.xenproject.org/archives/html/xen-devel/2017-03/msg02809.html
And it's still happening to this day.
Florian Schmidt (2):
Add HvExtCallQueryCapabilities
Add HvExtCallGetBootZeroedMemory
docs/system/i386/hyperv.rst | 5 +++
hw/hyperv/hyperv.c | 62 ++++++++++++++++++++++++++++++++
include/hw/hyperv/hyperv-proto.h | 12 +++++++
include/hw/hyperv/hyperv.h | 10 ++++++
target/i386/cpu.c | 2 ++
target/i386/cpu.h | 1 +
target/i386/kvm/hyperv-proto.h | 6 ++++
target/i386/kvm/hyperv.c | 24 +++++++++++++
target/i386/kvm/kvm.c | 3 ++
9 files changed, 125 insertions(+)
--
2.39.5
^ permalink raw reply [flat|nested] 15+ messages in thread
* [RFC PATCH 1/2] Add HvExtCallQueryCapabilities
2026-01-12 11:31 [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
@ 2026-01-12 11:31 ` Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory Florian Schmidt
2026-02-02 14:26 ` [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2 siblings, 0 replies; 15+ messages in thread
From: Florian Schmidt @ 2026-01-12 11:31 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel, Florian Schmidt
On CPUID leaf 0x40000003, EBX bit 20 signals that we support
the HvExtCallQueryCapabilities hypercall. This returns a bit field which
signals which further extended hypercalls are supported (as a way to
conserve CPUID leaf bits).
For now, we don't support any extended hypercalls, but we'll add
HvExtCallGetBootZeroedMemory in a followup patch.
Signed-off-by: Florian Schmidt <flosch@nutanix.com>
---
hw/hyperv/hyperv.c | 28 ++++++++++++++++++++++++++++
include/hw/hyperv/hyperv-proto.h | 1 +
include/hw/hyperv/hyperv.h | 5 +++++
target/i386/kvm/hyperv-proto.h | 1 +
target/i386/kvm/hyperv.c | 14 ++++++++++++++
target/i386/kvm/kvm.c | 3 +++
6 files changed, 52 insertions(+)
diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
index 27e323a819..13a42a68b2 100644
--- a/hw/hyperv/hyperv.c
+++ b/hw/hyperv/hyperv.c
@@ -699,6 +699,34 @@ int hyperv_set_event_flag_handler(uint32_t conn_id, EventNotifier *notifier)
return set_event_flag_handler(conn_id, notifier);
}
+uint16_t hyperv_ext_hcall_query_caps(uint64_t sup, uint64_t outgpa, bool fast)
+{
+ uint16_t ret;
+ uint64_t *supported = NULL;
+ hwaddr len;
+
+ if (fast) {
+ ret = HV_STATUS_INVALID_HYPERCALL_CODE;
+ goto cleanup;
+ }
+
+ len = sizeof(*supported);
+ supported = cpu_physical_memory_map(outgpa, &len, 1);
+ if (!supported || len < sizeof(*supported)) {
+ ret = HV_STATUS_INSUFFICIENT_MEMORY;
+ goto cleanup;
+ }
+
+ *supported = sup;
+ ret = HV_STATUS_SUCCESS;
+
+cleanup:
+ if (supported) {
+ cpu_physical_memory_unmap(supported, sizeof(*supported), 1, len);
+ }
+ return ret;
+}
+
uint16_t hyperv_hcall_signal_event(uint64_t param, bool fast)
{
EventFlagHandler *handler;
diff --git a/include/hw/hyperv/hyperv-proto.h b/include/hw/hyperv/hyperv-proto.h
index fffc5ce342..f1d1d2eb26 100644
--- a/include/hw/hyperv/hyperv-proto.h
+++ b/include/hw/hyperv/hyperv-proto.h
@@ -35,6 +35,7 @@
#define HV_POST_DEBUG_DATA 0x0069
#define HV_RETRIEVE_DEBUG_DATA 0x006a
#define HV_RESET_DEBUG_SESSION 0x006b
+#define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
#define HV_HYPERCALL_FAST (1u << 16)
/*
diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
index 63a8b65278..921e1623f7 100644
--- a/include/hw/hyperv/hyperv.h
+++ b/include/hw/hyperv/hyperv.h
@@ -96,6 +96,11 @@ uint16_t hyperv_hcall_retreive_dbg_data(uint64_t ingpa, uint64_t outgpa,
*/
uint16_t hyperv_hcall_post_dbg_data(uint64_t ingpa, uint64_t outgpa, bool fast);
+/*
+ * Process HVCALL_EXT_QUERY_CAPABILITIES hypercall.
+ */
+uint16_t hyperv_ext_hcall_query_caps(uint64_t sup, uint64_t outgpa, bool fast);
+
uint32_t hyperv_syndbg_send(uint64_t ingpa, uint32_t count);
uint32_t hyperv_syndbg_recv(uint64_t ingpa, uint32_t count);
void hyperv_syndbg_set_pending_page(uint64_t ingpa);
diff --git a/target/i386/kvm/hyperv-proto.h b/target/i386/kvm/hyperv-proto.h
index a9f056f2f3..4eb2955ac5 100644
--- a/target/i386/kvm/hyperv-proto.h
+++ b/target/i386/kvm/hyperv-proto.h
@@ -46,6 +46,7 @@
*/
#define HV_POST_MESSAGES (1u << 4)
#define HV_SIGNAL_EVENTS (1u << 5)
+#define HV_ENABLE_EXT_HYPERCALLS (1u << 20)
/*
* HV_CPUID_FEATURES.EDX bits
diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
index f7a81bd270..1ac5c26799 100644
--- a/target/i386/kvm/hyperv.c
+++ b/target/i386/kvm/hyperv.c
@@ -51,6 +51,15 @@ static void async_synic_update(CPUState *cs, run_on_cpu_data data)
bql_unlock();
}
+static uint64_t calc_supported_ext_hypercalls(X86CPU *cpu)
+{
+ uint64_t ret = 0;
+
+ /* For now, no extended hypercalls are supported. */
+
+ return ret;
+}
+
int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit)
{
CPUX86State *env = &cpu->env;
@@ -108,6 +117,11 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit)
exit->u.hcall.result =
hyperv_hcall_reset_dbg_session(out_param);
break;
+ case HV_EXT_CALL_QUERY_CAPABILITIES:
+ exit->u.hcall.result =
+ hyperv_ext_hcall_query_caps(calc_supported_ext_hypercalls(cpu),
+ out_param, fast);
+ break;
default:
exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
}
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7b9b740a8e..5d8553ef0c 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1594,6 +1594,9 @@ static int hyperv_fill_cpuids(CPUState *cs,
/* Unconditionally required with any Hyper-V enlightenment */
c->eax |= HV_HYPERCALL_AVAILABLE;
+ /* No reason to not always support HvExtCallQueryCapabilities (?) */
+ c->ebx |= HV_ENABLE_EXT_HYPERCALLS;
+
/* SynIC and Vmbus devices require messages/signals hypercalls */
if (hyperv_feat_enabled(cpu, HYPERV_FEAT_SYNIC) &&
!cpu->hyperv_synic_kvm_only) {
--
2.39.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-01-12 11:31 [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 1/2] Add HvExtCallQueryCapabilities Florian Schmidt
@ 2026-01-12 11:31 ` Florian Schmidt
2026-04-16 12:47 ` Paolo Bonzini
2026-02-02 14:26 ` [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2 siblings, 1 reply; 15+ messages in thread
From: Florian Schmidt @ 2026-01-12 11:31 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel, Florian Schmidt
This call allows a guest to ask the hypervisor which of its (guest
physical) memory ranges were already zeroed out by the hypervisor, which
means there's no need for the guest to zero them out again at boot.
For now, we simply send one entry back that says "all
64-bit-adddressable memory is zeroed out". This seems to work fine.
Note that memory devies hotplugged later will still be zeroed out.
Signed-off-by: Florian Schmidt <flosch@nutanix.com>
---
docs/system/i386/hyperv.rst | 5 +++++
hw/hyperv/hyperv.c | 34 ++++++++++++++++++++++++++++++++
include/hw/hyperv/hyperv-proto.h | 11 +++++++++++
include/hw/hyperv/hyperv.h | 5 +++++
target/i386/cpu.c | 2 ++
target/i386/cpu.h | 1 +
target/i386/kvm/hyperv-proto.h | 5 +++++
target/i386/kvm/hyperv.c | 12 ++++++++++-
8 files changed, 74 insertions(+), 1 deletion(-)
diff --git a/docs/system/i386/hyperv.rst b/docs/system/i386/hyperv.rst
index 1c1de77feb..7d2be2109e 100644
--- a/docs/system/i386/hyperv.rst
+++ b/docs/system/i386/hyperv.rst
@@ -256,6 +256,11 @@ Existing enlightenments
Recommended: ``hv-evmcs`` (Intel)
+``hv-boot-zeroed-mem``
+ Enables the HvExtGetBootZeroedMemory hypercall. This allows a Windows guest to
+ inquire which memory has already been zeroed out by the host and thus doesn't
+ need to be zeroed out at boot again.
+
Supplementary features
----------------------
diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
index 13a42a68b2..d1b15c089e 100644
--- a/hw/hyperv/hyperv.c
+++ b/hw/hyperv/hyperv.c
@@ -727,6 +727,40 @@ cleanup:
return ret;
}
+uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast)
+{
+ uint16_t ret;
+ struct hyperv_get_boot_zeroed_memory_output *zero_ranges = NULL;
+ hwaddr len;
+
+ if (fast) {
+ ret = HV_STATUS_INVALID_HYPERCALL_CODE;
+ goto cleanup;
+ }
+
+ len = sizeof(*zero_ranges);
+ zero_ranges = cpu_physical_memory_map(outgpa, &len, 1);
+ if (!zero_ranges || len < sizeof(*zero_ranges)) {
+ ret = HV_STATUS_INSUFFICIENT_MEMORY;
+ goto cleanup;
+ }
+
+ /*
+ * All memory we pass through will always be zeroed.
+ * (Check if that's actually true!)
+ */
+ zero_ranges->range_count = 1;
+ zero_ranges->ranges[0].start_pfn = 0x0;
+ zero_ranges->ranges[0].page_count = 0x10000000000000;
+ ret = HV_STATUS_SUCCESS;
+
+cleanup:
+ if (zero_ranges) {
+ cpu_physical_memory_unmap(zero_ranges, sizeof(*zero_ranges), 1, len);
+ }
+ return ret;
+}
+
uint16_t hyperv_hcall_signal_event(uint64_t param, bool fast)
{
EventFlagHandler *handler;
diff --git a/include/hw/hyperv/hyperv-proto.h b/include/hw/hyperv/hyperv-proto.h
index f1d1d2eb26..5bf5684d11 100644
--- a/include/hw/hyperv/hyperv-proto.h
+++ b/include/hw/hyperv/hyperv-proto.h
@@ -36,6 +36,7 @@
#define HV_RETRIEVE_DEBUG_DATA 0x006a
#define HV_RESET_DEBUG_SESSION 0x006b
#define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
+#define HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY 0x8002
#define HV_HYPERCALL_FAST (1u << 16)
/*
@@ -192,4 +193,14 @@ struct hyperv_retrieve_debug_data_output {
uint32_t retrieved_count;
uint32_t remaining_count;
} __attribute__ ((__packed__));
+
+struct hyperv_get_boot_zeroed_memory_range {
+ uint64_t start_pfn;
+ uint64_t page_count;
+} __attribute__ ((__packed__));
+
+struct hyperv_get_boot_zeroed_memory_output {
+ uint64_t range_count;
+ struct hyperv_get_boot_zeroed_memory_range ranges[255];
+} __attribute__ ((__packed__));
#endif
diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
index 921e1623f7..54cd2fff72 100644
--- a/include/hw/hyperv/hyperv.h
+++ b/include/hw/hyperv/hyperv.h
@@ -101,6 +101,11 @@ uint16_t hyperv_hcall_post_dbg_data(uint64_t ingpa, uint64_t outgpa, bool fast);
*/
uint16_t hyperv_ext_hcall_query_caps(uint64_t sup, uint64_t outgpa, bool fast);
+/*
+ * Process HVCALL_EXT_GET_BOOT_ZEROED_MEMORY hypercall.
+ */
+uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast);
+
uint32_t hyperv_syndbg_send(uint64_t ingpa, uint32_t count);
uint32_t hyperv_syndbg_recv(uint64_t ingpa, uint32_t count);
void hyperv_syndbg_set_pending_page(uint64_t ingpa);
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 37803cd724..d4160f3334 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -10492,6 +10492,8 @@ static const Property x86_cpu_properties[] = {
HYPERV_FEAT_TLBFLUSH_EXT, 0),
DEFINE_PROP_BIT64("hv-tlbflush-direct", X86CPU, hyperv_features,
HYPERV_FEAT_TLBFLUSH_DIRECT, 0),
+ DEFINE_PROP_BIT64("hv-boot-zeroed-mem", X86CPU, hyperv_features,
+ HYPERV_FEAT_BOOT_ZEROED_MEMORY, 0),
DEFINE_PROP_ON_OFF_AUTO("hv-no-nonarch-coresharing", X86CPU,
hyperv_no_nonarch_cs, ON_OFF_AUTO_OFF),
#ifdef CONFIG_SYNDBG
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index 2bbc977d90..a42eacd800 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -1463,6 +1463,7 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w);
#define HYPERV_FEAT_XMM_INPUT 18
#define HYPERV_FEAT_TLBFLUSH_EXT 19
#define HYPERV_FEAT_TLBFLUSH_DIRECT 20
+#define HYPERV_FEAT_BOOT_ZEROED_MEMORY 21
#ifndef HYPERV_SPINLOCK_NEVER_NOTIFY
#define HYPERV_SPINLOCK_NEVER_NOTIFY 0xFFFFFFFF
diff --git a/target/i386/kvm/hyperv-proto.h b/target/i386/kvm/hyperv-proto.h
index 4eb2955ac5..ec38b717e4 100644
--- a/target/i386/kvm/hyperv-proto.h
+++ b/target/i386/kvm/hyperv-proto.h
@@ -94,6 +94,11 @@
#define HV_NESTED_DIRECT_FLUSH (1u << 17)
#define HV_NESTED_MSR_BITMAP (1u << 19)
+/*
+ * HV_EXT_CALL_QUERY_CAPABILITIES bits
+ */
+#define HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY (1u << 0)
+
/*
* Basic virtualized MSRs
*/
diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
index 1ac5c26799..f92ea7e0a2 100644
--- a/target/i386/kvm/hyperv.c
+++ b/target/i386/kvm/hyperv.c
@@ -55,7 +55,9 @@ static uint64_t calc_supported_ext_hypercalls(X86CPU *cpu)
{
uint64_t ret = 0;
- /* For now, no extended hypercalls are supported. */
+ if (hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
+ ret |= HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY;
+ }
return ret;
}
@@ -122,6 +124,14 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit)
hyperv_ext_hcall_query_caps(calc_supported_ext_hypercalls(cpu),
out_param, fast);
break;
+ case HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY:
+ if (!hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
+ exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
+ } else {
+ exit->u.hcall.result =
+ hyperv_ext_hcall_get_boot_zeroed_memory(out_param, fast);
+ }
+ break;
default:
exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
}
--
2.39.5
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory()
2026-01-12 11:31 [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 1/2] Add HvExtCallQueryCapabilities Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory Florian Schmidt
@ 2026-02-02 14:26 ` Florian Schmidt
2026-02-23 11:23 ` Florian Schmidt
2 siblings, 1 reply; 15+ messages in thread
From: Florian Schmidt @ 2026-02-02 14:26 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel
On 2026-01-12 11:31, Florian Schmidt wrote:
> This patch series implements two new Hyper-V hypercalls.
Friendly ping on this. Any concerns, any suggestions about open
questions such as these below?
> Note that this is still an RFC series. There are a number of open questions,
> most importantly, which memory areas to send back to the guest as
> pre-zeroed.
> I also wonder whether HvExtCallQueryCapabilities() should have a prop bit,
Thanks,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory()
2026-02-02 14:26 ` [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
@ 2026-02-23 11:23 ` Florian Schmidt
0 siblings, 0 replies; 15+ messages in thread
From: Florian Schmidt @ 2026-02-23 11:23 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel
On 2026-02-02 14:26, Florian Schmidt wrote:
> On 2026-01-12 11:31, Florian Schmidt wrote:
>> This patch series implements two new Hyper-V hypercalls.
>
> Friendly ping on this. Any concerns, any suggestions about open
> questions such as these below?
>
>> Note that this is still an RFC series. There are a number of open
>> questions,
>> most importantly, which memory areas to send back to the guest as
>> pre-zeroed.
>
>> I also wonder whether HvExtCallQueryCapabilities() should have a prop
>> bit,
Another ping. Does anybody have cycles to give me some feedback on the
RFC items, so I can start working towards a proper v1 version?
Cheers,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-01-12 11:31 ` [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory Florian Schmidt
@ 2026-04-16 12:47 ` Paolo Bonzini
2026-04-16 15:33 ` Florian Schmidt
2026-04-17 5:51 ` Marcelo Tosatti
0 siblings, 2 replies; 15+ messages in thread
From: Paolo Bonzini @ 2026-04-16 12:47 UTC (permalink / raw)
To: Florian Schmidt, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel
On 1/12/26 12:31, Florian Schmidt wrote:
> This call allows a guest to ask the hypervisor which of its (guest
> physical) memory ranges were already zeroed out by the hypervisor, which
> means there's no need for the guest to zero them out again at boot.
>
> For now, we simply send one entry back that says "all
> 64-bit-adddressable memory is zeroed out". This seems to work fine.
This is risky and depends on the firmware.
As discussed on IRC, there are multiple sources of writes and each of
them needs to be tracked, and we can say that any write potentially
makes the page nonzero.
The main two are QEMU and the guest, i.e. KVM.
Dirty page tracking is suitable for QEMU, but not so much for KVM. When
QEMU enables dirty page tracking in KVM, it does so lazily, i.e. all
pages are assumed dirty at the beginning. This is exactly the opposite
of what you want here (almost all pages are zero at the beginning). So
maybe only enable this hypercall when dirty page tracking uses a ring
buffer?
But as far as QEMU is concerned you could indeed add a fourth dirty
memory bitmap, DIRTY_MEMORY_NONZERO, and turn it off after the first
call to the hypercall (hopefully Windows only calls it once)?
As an alternative to tracking dirty pages in KVM, it would be possible
to ask KVM to build a bitmap of pages that were mapped, even at high
granularity (e.g. with 32MB you'd fit the bitmap for 1TB of memory in a
single page!). QEMU could use dirty page tracking, and build a
combination (NOR) of the bitmap from KVM and QEMU's dirty page bitmap.
QEMU could also do the same high-granularity tracking, but that leaves
out other sources of writes like VFIO or vhost, both of which probably
matter to Nutanix :) and which would be stuck with 4k-granularity dirty
page tracking.
Sorry for the delay, I was hoping to have some better ideas.
Paolo
> Note that memory devies hotplugged later will still be zeroed out.
>
> Signed-off-by: Florian Schmidt <flosch@nutanix.com>
> ---
> docs/system/i386/hyperv.rst | 5 +++++
> hw/hyperv/hyperv.c | 34 ++++++++++++++++++++++++++++++++
> include/hw/hyperv/hyperv-proto.h | 11 +++++++++++
> include/hw/hyperv/hyperv.h | 5 +++++
> target/i386/cpu.c | 2 ++
> target/i386/cpu.h | 1 +
> target/i386/kvm/hyperv-proto.h | 5 +++++
> target/i386/kvm/hyperv.c | 12 ++++++++++-
> 8 files changed, 74 insertions(+), 1 deletion(-)
>
> diff --git a/docs/system/i386/hyperv.rst b/docs/system/i386/hyperv.rst
> index 1c1de77feb..7d2be2109e 100644
> --- a/docs/system/i386/hyperv.rst
> +++ b/docs/system/i386/hyperv.rst
> @@ -256,6 +256,11 @@ Existing enlightenments
>
> Recommended: ``hv-evmcs`` (Intel)
>
> +``hv-boot-zeroed-mem``
> + Enables the HvExtGetBootZeroedMemory hypercall. This allows a Windows guest to
> + inquire which memory has already been zeroed out by the host and thus doesn't
> + need to be zeroed out at boot again.
> +
> Supplementary features
> ----------------------
>
> diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
> index 13a42a68b2..d1b15c089e 100644
> --- a/hw/hyperv/hyperv.c
> +++ b/hw/hyperv/hyperv.c
> @@ -727,6 +727,40 @@ cleanup:
> return ret;
> }
>
> +uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast)
> +{
> + uint16_t ret;
> + struct hyperv_get_boot_zeroed_memory_output *zero_ranges = NULL;
> + hwaddr len;
> +
> + if (fast) {
> + ret = HV_STATUS_INVALID_HYPERCALL_CODE;
> + goto cleanup;
> + }
> +
> + len = sizeof(*zero_ranges);
> + zero_ranges = cpu_physical_memory_map(outgpa, &len, 1);
> + if (!zero_ranges || len < sizeof(*zero_ranges)) {
> + ret = HV_STATUS_INSUFFICIENT_MEMORY;
> + goto cleanup;
> + }
> +
> + /*
> + * All memory we pass through will always be zeroed.
> + * (Check if that's actually true!)
> + */
> + zero_ranges->range_count = 1;
> + zero_ranges->ranges[0].start_pfn = 0x0;
> + zero_ranges->ranges[0].page_count = 0x10000000000000;
> + ret = HV_STATUS_SUCCESS;
> +
> +cleanup:
> + if (zero_ranges) {
> + cpu_physical_memory_unmap(zero_ranges, sizeof(*zero_ranges), 1, len);
> + }
> + return ret;
> +}
> +
> uint16_t hyperv_hcall_signal_event(uint64_t param, bool fast)
> {
> EventFlagHandler *handler;
> diff --git a/include/hw/hyperv/hyperv-proto.h b/include/hw/hyperv/hyperv-proto.h
> index f1d1d2eb26..5bf5684d11 100644
> --- a/include/hw/hyperv/hyperv-proto.h
> +++ b/include/hw/hyperv/hyperv-proto.h
> @@ -36,6 +36,7 @@
> #define HV_RETRIEVE_DEBUG_DATA 0x006a
> #define HV_RESET_DEBUG_SESSION 0x006b
> #define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
> +#define HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY 0x8002
> #define HV_HYPERCALL_FAST (1u << 16)
>
> /*
> @@ -192,4 +193,14 @@ struct hyperv_retrieve_debug_data_output {
> uint32_t retrieved_count;
> uint32_t remaining_count;
> } __attribute__ ((__packed__));
> +
> +struct hyperv_get_boot_zeroed_memory_range {
> + uint64_t start_pfn;
> + uint64_t page_count;
> +} __attribute__ ((__packed__));
> +
> +struct hyperv_get_boot_zeroed_memory_output {
> + uint64_t range_count;
> + struct hyperv_get_boot_zeroed_memory_range ranges[255];
> +} __attribute__ ((__packed__));
> #endif
> diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
> index 921e1623f7..54cd2fff72 100644
> --- a/include/hw/hyperv/hyperv.h
> +++ b/include/hw/hyperv/hyperv.h
> @@ -101,6 +101,11 @@ uint16_t hyperv_hcall_post_dbg_data(uint64_t ingpa, uint64_t outgpa, bool fast);
> */
> uint16_t hyperv_ext_hcall_query_caps(uint64_t sup, uint64_t outgpa, bool fast);
>
> +/*
> + * Process HVCALL_EXT_GET_BOOT_ZEROED_MEMORY hypercall.
> + */
> +uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast);
> +
> uint32_t hyperv_syndbg_send(uint64_t ingpa, uint32_t count);
> uint32_t hyperv_syndbg_recv(uint64_t ingpa, uint32_t count);
> void hyperv_syndbg_set_pending_page(uint64_t ingpa);
> diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> index 37803cd724..d4160f3334 100644
> --- a/target/i386/cpu.c
> +++ b/target/i386/cpu.c
> @@ -10492,6 +10492,8 @@ static const Property x86_cpu_properties[] = {
> HYPERV_FEAT_TLBFLUSH_EXT, 0),
> DEFINE_PROP_BIT64("hv-tlbflush-direct", X86CPU, hyperv_features,
> HYPERV_FEAT_TLBFLUSH_DIRECT, 0),
> + DEFINE_PROP_BIT64("hv-boot-zeroed-mem", X86CPU, hyperv_features,
> + HYPERV_FEAT_BOOT_ZEROED_MEMORY, 0),
> DEFINE_PROP_ON_OFF_AUTO("hv-no-nonarch-coresharing", X86CPU,
> hyperv_no_nonarch_cs, ON_OFF_AUTO_OFF),
> #ifdef CONFIG_SYNDBG
> diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> index 2bbc977d90..a42eacd800 100644
> --- a/target/i386/cpu.h
> +++ b/target/i386/cpu.h
> @@ -1463,6 +1463,7 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w);
> #define HYPERV_FEAT_XMM_INPUT 18
> #define HYPERV_FEAT_TLBFLUSH_EXT 19
> #define HYPERV_FEAT_TLBFLUSH_DIRECT 20
> +#define HYPERV_FEAT_BOOT_ZEROED_MEMORY 21
>
> #ifndef HYPERV_SPINLOCK_NEVER_NOTIFY
> #define HYPERV_SPINLOCK_NEVER_NOTIFY 0xFFFFFFFF
> diff --git a/target/i386/kvm/hyperv-proto.h b/target/i386/kvm/hyperv-proto.h
> index 4eb2955ac5..ec38b717e4 100644
> --- a/target/i386/kvm/hyperv-proto.h
> +++ b/target/i386/kvm/hyperv-proto.h
> @@ -94,6 +94,11 @@
> #define HV_NESTED_DIRECT_FLUSH (1u << 17)
> #define HV_NESTED_MSR_BITMAP (1u << 19)
>
> +/*
> + * HV_EXT_CALL_QUERY_CAPABILITIES bits
> + */
> +#define HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY (1u << 0)
> +
> /*
> * Basic virtualized MSRs
> */
> diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
> index 1ac5c26799..f92ea7e0a2 100644
> --- a/target/i386/kvm/hyperv.c
> +++ b/target/i386/kvm/hyperv.c
> @@ -55,7 +55,9 @@ static uint64_t calc_supported_ext_hypercalls(X86CPU *cpu)
> {
> uint64_t ret = 0;
>
> - /* For now, no extended hypercalls are supported. */
> + if (hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
> + ret |= HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY;
> + }
>
> return ret;
> }
> @@ -122,6 +124,14 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit)
> hyperv_ext_hcall_query_caps(calc_supported_ext_hypercalls(cpu),
> out_param, fast);
> break;
> + case HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY:
> + if (!hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
> + exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
> + } else {
> + exit->u.hcall.result =
> + hyperv_ext_hcall_get_boot_zeroed_memory(out_param, fast);
> + }
> + break;
> default:
> exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
> }
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-16 12:47 ` Paolo Bonzini
@ 2026-04-16 15:33 ` Florian Schmidt
2026-04-16 16:04 ` Paolo Bonzini
2026-04-23 2:16 ` Marcelo Tosatti
2026-04-17 5:51 ` Marcelo Tosatti
1 sibling, 2 replies; 15+ messages in thread
From: Florian Schmidt @ 2026-04-16 15:33 UTC (permalink / raw)
To: Paolo Bonzini, Zhao Liu, Marcelo Tosatti; +Cc: qemu-devel
Hi Paolo, thank you for your reply!
On 2026-04-16 13:47, Paolo Bonzini wrote:
> As discussed on IRC, there are multiple sources of writes and each of
> them needs to be tracked, and we can say that any write potentially
> makes the page nonzero.
>
> But as far as QEMU is concerned you could indeed add a fourth dirty
> memory bitmap, DIRTY_MEMORY_NONZERO, and turn it off after the first
> call to the hypercall (hopefully Windows only calls it once)?
I'm not sure we can rely on that. I'd have to double-check. Crucially,
I'm pretty sure some Windows versions may call this more than once, and
which of those results it then uses is the big question: if it's always
the first one, we could stop there, return "nothing" as answer for
further ones, and be good. If it's not the first one... that's a
problem, because I don't think we want to do this kind of tracking for
the whole lifetime of the guest?
> As an alternative to tracking dirty pages in KVM, it would be possible
> to ask KVM to build a bitmap of pages that were mapped, even at high
> granularity (e.g. with 32MB you'd fit the bitmap for 1TB of memory in a
> single page!). QEMU could use dirty page tracking, and build a
> combination (NOR) of the bitmap from KVM and QEMU's dirty page bitmap.
>
> QEMU could also do the same high-granularity tracking, but that leaves
> out other sources of writes like VFIO or vhost, both of which probably
> matter to Nutanix :) and which would be stuck with 4k-granularity dirty
> page tracking.
Are you thinking of creating a new KVM ioctl that would do that for us?
That's possible, but... isn't good old mincore enough in this case,
since qemu knows the host-virtual addresses of the guest memory? At the
cost of 8 times as much memory for the (temporary) data structure, since
it's a byte per page.
One thing I was wondering about was races. Unless we pause the guest
while we're scanning the tables, the guest could touch pages as we scan.
But: at the point the hypercall is invoked by the guest, the guest OS is
up by definition. So at this point, the guest OS must be aware of any
memory that is put to use, correct? Even including vfio/vhost stuff,
since the buffers used to write data into the guest would have been set
aside for that purpose by the guest. So even if we overestimate and
announce some pages as pre-zeroed, that shouldn't matter if the guest OS
already handed them out for some usage (and pre-zeroed them in the
meantime). What we really care about is not announcing any pages as
pre-zeroed that are in fact dirty, *and* that the guest OS does not
realise were ever dirtied.
... Famous last words and I'm not 100% sure, I appreciate any thoughts
on this.
Cheers,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-16 15:33 ` Florian Schmidt
@ 2026-04-16 16:04 ` Paolo Bonzini
2026-04-20 10:51 ` Florian Schmidt
2026-04-23 2:16 ` Marcelo Tosatti
1 sibling, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2026-04-16 16:04 UTC (permalink / raw)
To: Florian Schmidt; +Cc: Zhao Liu, Marcelo Tosatti, qemu-devel
[-- Attachment #1: Type: text/plain, Size: 2833 bytes --]
Il gio 16 apr 2026, 17:33 Florian Schmidt <flosch@nutanix.com> ha scritto:
> Hi Paolo, thank you for your reply!
> I'm not sure we can rely on that. I'd have to double-check. Crucially,
> I'm pretty sure some Windows versions may call this more than once, and
> which of those results it then uses is the big question: if it's always
> the first one, we could stop there, return "nothing" as answer for
> further ones, and be good. If it's not the first one... that's a
> problem, because I don't think we want to do this kind of tracking for
> the whole lifetime of the guest?
>
Indeed absolutely not.
> As an alternative to tracking dirty pages in KVM, it would be possible
> > to ask KVM to build a bitmap of pages that were mapped, even at high
> > granularity (e.g. with 32MB you'd fit the bitmap for 1TB of memory in a
> > single page!). QEMU could use dirty page tracking, and build a
> > combination (NOR) of the bitmap from KVM and QEMU's dirty page bitmap.
> >
> > QEMU could also do the same high-granularity tracking, but that leaves
> > out other sources of writes like VFIO or vhost, both of which probably
> > matter to Nutanix :) and which would be stuck with 4k-granularity dirty
> > page tracking.
>
> Are you thinking of creating a new KVM ioctl that would do that for us?
> That's possible, but... isn't good old mincore enough in this case,
> since qemu knows the host-virtual addresses of the guest memory?
Yes and no respectively. For example if you use preallocation mincore will
return always no, and if you swap it will return yes incorrectly.
One thing I was wondering about was races. Unless we pause the guest
> while we're scanning the tables, the guest could touch pages as we scan.
> But: at the point the hypercall is invoked by the guest, the guest OS is
> up by definition. So at this point, the guest OS must be aware of any
> memory that is put to use, correct?
Yes, it probably would go through its free page pool and decide whether to
move it to the zeroed page pool based on the answer. Though the info in my
copy of Tanenbaum is 25 years old at this point. :)
Even including vfio/vhost stuff,
> since the buffers used to write data into the guest would have been set
> aside for that purpose by the guest. So even if we overestimate and
> announce some pages as pre-zeroed, that shouldn't matter if the guest OS
> already handed them out for some usage (and pre-zeroed them in the
> meantime). What we really care about is not announcing any pages as
> pre-zeroed that are in fact dirty, *and* that the guest OS does not
> realise were ever dirtied.
>
Yes. The risky part is stuff that were written to in the past (such as
while firmware ran) by VFIO and vhost.
Paolo
> ... Famous last words and I'm not 100% sure, I appreciate any thoughts
> on this.
>
> Cheers,
> Florian
>
>
[-- Attachment #2: Type: text/html, Size: 4555 bytes --]
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-16 12:47 ` Paolo Bonzini
2026-04-16 15:33 ` Florian Schmidt
@ 2026-04-17 5:51 ` Marcelo Tosatti
2026-04-20 9:25 ` Florian Schmidt
1 sibling, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2026-04-17 5:51 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Florian Schmidt, Zhao Liu, qemu-devel
On Thu, Apr 16, 2026 at 02:47:06PM +0200, Paolo Bonzini wrote:
> On 1/12/26 12:31, Florian Schmidt wrote:
> > This call allows a guest to ask the hypervisor which of its (guest
> > physical) memory ranges were already zeroed out by the hypervisor, which
> > means there's no need for the guest to zero them out again at boot.
> >
> > For now, we simply send one entry back that says "all
> > 64-bit-adddressable memory is zeroed out". This seems to work fine.
>
> This is risky and depends on the firmware.
>
> As discussed on IRC, there are multiple sources of writes and each of them
> needs to be tracked, and we can say that any write potentially makes the
> page nonzero.
>
> The main two are QEMU and the guest, i.e. KVM.
>
> Dirty page tracking is suitable for QEMU, but not so much for KVM. When
> QEMU enables dirty page tracking in KVM, it does so lazily, i.e. all pages
> are assumed dirty at the beginning. This is exactly the opposite of what
> you want here (almost all pages are zero at the beginning). So maybe only
> enable this hypercall when dirty page tracking uses a ring buffer?
>
> But as far as QEMU is concerned you could indeed add a fourth dirty memory
> bitmap, DIRTY_MEMORY_NONZERO, and turn it off after the first call to the
> hypercall (hopefully Windows only calls it once)?
>
> As an alternative to tracking dirty pages in KVM, it would be possible to
> ask KVM to build a bitmap of pages that were mapped, even at high
> granularity (e.g. with 32MB you'd fit the bitmap for 1TB of memory in a
> single page!). QEMU could use dirty page tracking, and build a combination
> (NOR) of the bitmap from KVM and QEMU's dirty page bitmap.
>
> QEMU could also do the same high-granularity tracking, but that leaves out
> other sources of writes like VFIO or vhost, both of which probably matter to
> Nutanix :) and which would be stuck with 4k-granularity dirty page tracking.
>
> Sorry for the delay, I was hoping to have some better ideas.
>
> Paolo
Hi,
Is this to improve boot time of very large guests ?
> > Note that memory devies hotplugged later will still be zeroed out.
> >
> > Signed-off-by: Florian Schmidt <flosch@nutanix.com>
> > ---
> > docs/system/i386/hyperv.rst | 5 +++++
> > hw/hyperv/hyperv.c | 34 ++++++++++++++++++++++++++++++++
> > include/hw/hyperv/hyperv-proto.h | 11 +++++++++++
> > include/hw/hyperv/hyperv.h | 5 +++++
> > target/i386/cpu.c | 2 ++
> > target/i386/cpu.h | 1 +
> > target/i386/kvm/hyperv-proto.h | 5 +++++
> > target/i386/kvm/hyperv.c | 12 ++++++++++-
> > 8 files changed, 74 insertions(+), 1 deletion(-)
> >
> > diff --git a/docs/system/i386/hyperv.rst b/docs/system/i386/hyperv.rst
> > index 1c1de77feb..7d2be2109e 100644
> > --- a/docs/system/i386/hyperv.rst
> > +++ b/docs/system/i386/hyperv.rst
> > @@ -256,6 +256,11 @@ Existing enlightenments
> > Recommended: ``hv-evmcs`` (Intel)
> > +``hv-boot-zeroed-mem``
> > + Enables the HvExtGetBootZeroedMemory hypercall. This allows a Windows guest to
> > + inquire which memory has already been zeroed out by the host and thus doesn't
> > + need to be zeroed out at boot again.
> > +
> > Supplementary features
> > ----------------------
> > diff --git a/hw/hyperv/hyperv.c b/hw/hyperv/hyperv.c
> > index 13a42a68b2..d1b15c089e 100644
> > --- a/hw/hyperv/hyperv.c
> > +++ b/hw/hyperv/hyperv.c
> > @@ -727,6 +727,40 @@ cleanup:
> > return ret;
> > }
> > +uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast)
> > +{
> > + uint16_t ret;
> > + struct hyperv_get_boot_zeroed_memory_output *zero_ranges = NULL;
> > + hwaddr len;
> > +
> > + if (fast) {
> > + ret = HV_STATUS_INVALID_HYPERCALL_CODE;
> > + goto cleanup;
> > + }
> > +
> > + len = sizeof(*zero_ranges);
> > + zero_ranges = cpu_physical_memory_map(outgpa, &len, 1);
> > + if (!zero_ranges || len < sizeof(*zero_ranges)) {
> > + ret = HV_STATUS_INSUFFICIENT_MEMORY;
> > + goto cleanup;
> > + }
> > +
> > + /*
> > + * All memory we pass through will always be zeroed.
> > + * (Check if that's actually true!)
> > + */
> > + zero_ranges->range_count = 1;
> > + zero_ranges->ranges[0].start_pfn = 0x0;
> > + zero_ranges->ranges[0].page_count = 0x10000000000000;
> > + ret = HV_STATUS_SUCCESS;
> > +
> > +cleanup:
> > + if (zero_ranges) {
> > + cpu_physical_memory_unmap(zero_ranges, sizeof(*zero_ranges), 1, len);
> > + }
> > + return ret;
> > +}
> > +
> > uint16_t hyperv_hcall_signal_event(uint64_t param, bool fast)
> > {
> > EventFlagHandler *handler;
> > diff --git a/include/hw/hyperv/hyperv-proto.h b/include/hw/hyperv/hyperv-proto.h
> > index f1d1d2eb26..5bf5684d11 100644
> > --- a/include/hw/hyperv/hyperv-proto.h
> > +++ b/include/hw/hyperv/hyperv-proto.h
> > @@ -36,6 +36,7 @@
> > #define HV_RETRIEVE_DEBUG_DATA 0x006a
> > #define HV_RESET_DEBUG_SESSION 0x006b
> > #define HV_EXT_CALL_QUERY_CAPABILITIES 0x8001
> > +#define HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY 0x8002
> > #define HV_HYPERCALL_FAST (1u << 16)
> > /*
> > @@ -192,4 +193,14 @@ struct hyperv_retrieve_debug_data_output {
> > uint32_t retrieved_count;
> > uint32_t remaining_count;
> > } __attribute__ ((__packed__));
> > +
> > +struct hyperv_get_boot_zeroed_memory_range {
> > + uint64_t start_pfn;
> > + uint64_t page_count;
> > +} __attribute__ ((__packed__));
> > +
> > +struct hyperv_get_boot_zeroed_memory_output {
> > + uint64_t range_count;
> > + struct hyperv_get_boot_zeroed_memory_range ranges[255];
> > +} __attribute__ ((__packed__));
> > #endif
> > diff --git a/include/hw/hyperv/hyperv.h b/include/hw/hyperv/hyperv.h
> > index 921e1623f7..54cd2fff72 100644
> > --- a/include/hw/hyperv/hyperv.h
> > +++ b/include/hw/hyperv/hyperv.h
> > @@ -101,6 +101,11 @@ uint16_t hyperv_hcall_post_dbg_data(uint64_t ingpa, uint64_t outgpa, bool fast);
> > */
> > uint16_t hyperv_ext_hcall_query_caps(uint64_t sup, uint64_t outgpa, bool fast);
> > +/*
> > + * Process HVCALL_EXT_GET_BOOT_ZEROED_MEMORY hypercall.
> > + */
> > +uint16_t hyperv_ext_hcall_get_boot_zeroed_memory(uint64_t outgpa, bool fast);
> > +
> > uint32_t hyperv_syndbg_send(uint64_t ingpa, uint32_t count);
> > uint32_t hyperv_syndbg_recv(uint64_t ingpa, uint32_t count);
> > void hyperv_syndbg_set_pending_page(uint64_t ingpa);
> > diff --git a/target/i386/cpu.c b/target/i386/cpu.c
> > index 37803cd724..d4160f3334 100644
> > --- a/target/i386/cpu.c
> > +++ b/target/i386/cpu.c
> > @@ -10492,6 +10492,8 @@ static const Property x86_cpu_properties[] = {
> > HYPERV_FEAT_TLBFLUSH_EXT, 0),
> > DEFINE_PROP_BIT64("hv-tlbflush-direct", X86CPU, hyperv_features,
> > HYPERV_FEAT_TLBFLUSH_DIRECT, 0),
> > + DEFINE_PROP_BIT64("hv-boot-zeroed-mem", X86CPU, hyperv_features,
> > + HYPERV_FEAT_BOOT_ZEROED_MEMORY, 0),
> > DEFINE_PROP_ON_OFF_AUTO("hv-no-nonarch-coresharing", X86CPU,
> > hyperv_no_nonarch_cs, ON_OFF_AUTO_OFF),
> > #ifdef CONFIG_SYNDBG
> > diff --git a/target/i386/cpu.h b/target/i386/cpu.h
> > index 2bbc977d90..a42eacd800 100644
> > --- a/target/i386/cpu.h
> > +++ b/target/i386/cpu.h
> > @@ -1463,6 +1463,7 @@ uint64_t x86_cpu_get_supported_feature_word(X86CPU *cpu, FeatureWord w);
> > #define HYPERV_FEAT_XMM_INPUT 18
> > #define HYPERV_FEAT_TLBFLUSH_EXT 19
> > #define HYPERV_FEAT_TLBFLUSH_DIRECT 20
> > +#define HYPERV_FEAT_BOOT_ZEROED_MEMORY 21
> > #ifndef HYPERV_SPINLOCK_NEVER_NOTIFY
> > #define HYPERV_SPINLOCK_NEVER_NOTIFY 0xFFFFFFFF
> > diff --git a/target/i386/kvm/hyperv-proto.h b/target/i386/kvm/hyperv-proto.h
> > index 4eb2955ac5..ec38b717e4 100644
> > --- a/target/i386/kvm/hyperv-proto.h
> > +++ b/target/i386/kvm/hyperv-proto.h
> > @@ -94,6 +94,11 @@
> > #define HV_NESTED_DIRECT_FLUSH (1u << 17)
> > #define HV_NESTED_MSR_BITMAP (1u << 19)
> > +/*
> > + * HV_EXT_CALL_QUERY_CAPABILITIES bits
> > + */
> > +#define HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY (1u << 0)
> > +
> > /*
> > * Basic virtualized MSRs
> > */
> > diff --git a/target/i386/kvm/hyperv.c b/target/i386/kvm/hyperv.c
> > index 1ac5c26799..f92ea7e0a2 100644
> > --- a/target/i386/kvm/hyperv.c
> > +++ b/target/i386/kvm/hyperv.c
> > @@ -55,7 +55,9 @@ static uint64_t calc_supported_ext_hypercalls(X86CPU *cpu)
> > {
> > uint64_t ret = 0;
> > - /* For now, no extended hypercalls are supported. */
> > + if (hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
> > + ret |= HV_EXT_CAP_GET_BOOT_ZEROED_MEMORY;
> > + }
> > return ret;
> > }
> > @@ -122,6 +124,14 @@ int kvm_hv_handle_exit(X86CPU *cpu, struct kvm_hyperv_exit *exit)
> > hyperv_ext_hcall_query_caps(calc_supported_ext_hypercalls(cpu),
> > out_param, fast);
> > break;
> > + case HV_EXT_CALL_GET_BOOT_ZEROED_MEMORY:
> > + if (!hyperv_feat_enabled(cpu, HYPERV_FEAT_BOOT_ZEROED_MEMORY)) {
> > + exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
> > + } else {
> > + exit->u.hcall.result =
> > + hyperv_ext_hcall_get_boot_zeroed_memory(out_param, fast);
> > + }
> > + break;
> > default:
> > exit->u.hcall.result = HV_STATUS_INVALID_HYPERCALL_CODE;
> > }
>
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-17 5:51 ` Marcelo Tosatti
@ 2026-04-20 9:25 ` Florian Schmidt
2026-04-20 15:01 ` Marcelo Tosatti
0 siblings, 1 reply; 15+ messages in thread
From: Florian Schmidt @ 2026-04-20 9:25 UTC (permalink / raw)
To: Marcelo Tosatti, Paolo Bonzini; +Cc: Zhao Liu, qemu-devel
Hi Marcelo,
On 2026-04-17 06:51, Marcelo Tosatti wrote:
> Is this to improve boot time of very large guests ?
It depends a bit on how you define "boot time". It won't make a
difference if you measure the time it takes to create the VM until you
start the vCPUs. It may even make a huge difference if you measure the
time until a user can first log in.
What it will do is reduce the time the guest spends on writing memory,
mostly in the background, to zero out all pages it has gotten from the
hypervisor. The main advantage (other than saving CPU cycles that are
effectively wasted on an unneeded operation) is that if you don't
pre-allocate all the VMs memory, then the VM won't touch all pages after
boot, keeping the RSS down until the memory is actually used. That's for
example useful if you overcommit/oversubscribe VM memory.
Cheers,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-16 16:04 ` Paolo Bonzini
@ 2026-04-20 10:51 ` Florian Schmidt
2026-04-21 22:46 ` Paolo Bonzini
0 siblings, 1 reply; 15+ messages in thread
From: Florian Schmidt @ 2026-04-20 10:51 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Zhao Liu, Marcelo Tosatti, qemu-devel
On 2026-04-16 17:04, Paolo Bonzini wrote:
> Yes and no respectively. For example if you use preallocation mincore
> will return always no, and if you swap it will return yes incorrectly.
Ah yes, of course, forgot about that property of mincore. So the new KVM
ioctl sounds neat for this. The one thing I'm slighty worried about is
the time it might take to get that in shape and merged. I guess if
you're in favour of such an interface, that would go a long way to
getting it accepted into the kernel at all?
Here's another option, though I'm not sure you'll like it: We can get
the same information from /proc/<pid>/pagemap, right? We just have to
check whether bits 62 and 63 are both 0 or not for every page. It's less
efficient, because we have to read 8 bytes per page, but qemu knows the
host-virtual addresses of the VM memory, so it can look it up there. I
feel it beats using dirty page tracking if we can't be sure when Windows
issues the hypercall. One option would be to have this as a fallback if
the KVM ioctl does't exist? The other option would be for us to use this
only internally to get some more experience and measurements in while we
work on the KVM ioctl.
Going forward, I think what I care about most for now is getting the
interfaces right. In this case, I guess that just means the names of the
command-line interfaces, and the question of whether we should have a
separate one for HvExtCallQueryCapabilities() and then implement
dependencies between them. Then, I can experiment with the backend
implementation and we can settle on a proper clean implementation for
getting this into QEMU, while experimenting with this internally as we
work on it.
Cheers,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-20 9:25 ` Florian Schmidt
@ 2026-04-20 15:01 ` Marcelo Tosatti
0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2026-04-20 15:01 UTC (permalink / raw)
To: Florian Schmidt; +Cc: Paolo Bonzini, Zhao Liu, qemu-devel
On Mon, Apr 20, 2026 at 10:25:43AM +0100, Florian Schmidt wrote:
> Hi Marcelo,
>
> On 2026-04-17 06:51, Marcelo Tosatti wrote:
> > Is this to improve boot time of very large guests ?
>
> It depends a bit on how you define "boot time". It won't make a difference
> if you measure the time it takes to create the VM until you start the vCPUs.
> It may even make a huge difference if you measure the time until a user can
> first log in.
>
> What it will do is reduce the time the guest spends on writing memory,
> mostly in the background, to zero out all pages it has gotten from the
> hypervisor. The main advantage (other than saving CPU cycles that are
> effectively wasted on an unneeded operation) is that if you don't
> pre-allocate all the VMs memory, then the VM won't touch all pages after
> boot, keeping the RSS down until the memory is actually used. That's for
> example useful if you overcommit/oversubscribe VM memory.
>
> Cheers,
> Florian
Hi Florian,
That is useful!
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-20 10:51 ` Florian Schmidt
@ 2026-04-21 22:46 ` Paolo Bonzini
2026-05-05 15:57 ` Florian Schmidt
0 siblings, 1 reply; 15+ messages in thread
From: Paolo Bonzini @ 2026-04-21 22:46 UTC (permalink / raw)
To: Florian Schmidt; +Cc: Zhao Liu, Marcelo Tosatti, qemu-devel
On Mon, Apr 20, 2026 at 11:52 AM Florian Schmidt <flosch@nutanix.com> wrote:
> Here's another option, though I'm not sure you'll like it: We can get
> the same information from /proc/<pid>/pagemap, right? We just have to
> check whether bits 62 and 63 are both 0 or not for every page.
> It's less efficient, because we have to read 8 bytes per page
It's also completely useless for the very common case of pre-allocated
hugepages. That's the case where you can get the largest benefit,
because when pre-allocating you use threads on the host and do the
zeroing before the VM starts. For non-pre-allocated pages you still
pay the price of double zeroing, but guest and host do it one after
another while the VM is already running.
So I don't think there is any option other than the ioctl. I would
suggest experimenting to understand how Windows uses the hypercall;
and possibly looking at the QEMU write tracking part. The KVM changes
are relatively simple and for quick experimentation you can operate as
if the KVM bitmap is always entirely zero, not unlike this patch.
Paolo
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-16 15:33 ` Florian Schmidt
2026-04-16 16:04 ` Paolo Bonzini
@ 2026-04-23 2:16 ` Marcelo Tosatti
1 sibling, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2026-04-23 2:16 UTC (permalink / raw)
To: Florian Schmidt; +Cc: Paolo Bonzini, Zhao Liu, qemu-devel
On Thu, Apr 16, 2026 at 04:33:20PM +0100, Florian Schmidt wrote:
> Hi Paolo, thank you for your reply!
>
> On 2026-04-16 13:47, Paolo Bonzini wrote:
> > As discussed on IRC, there are multiple sources of writes and each of
> > them needs to be tracked, and we can say that any write potentially
> > makes the page nonzero.
> >
> > But as far as QEMU is concerned you could indeed add a fourth dirty
> > memory bitmap, DIRTY_MEMORY_NONZERO, and turn it off after the first
> > call to the hypercall (hopefully Windows only calls it once)?
>
> I'm not sure we can rely on that. I'd have to double-check. Crucially, I'm
> pretty sure some Windows versions may call this more than once, and which of
> those results it then uses is the big question: if it's always the first
> one, we could stop there, return "nothing" as answer for further ones, and
> be good. If it's not the first one... that's a problem, because I don't
> think we want to do this kind of tracking for the whole lifetime of the
> guest?
>
>
> > As an alternative to tracking dirty pages in KVM, it would be possible
> > to ask KVM to build a bitmap of pages that were mapped, even at high
> > granularity (e.g. with 32MB you'd fit the bitmap for 1TB of memory in a
> > single page!). QEMU could use dirty page tracking, and build a
> > combination (NOR) of the bitmap from KVM and QEMU's dirty page bitmap.
> >
> > QEMU could also do the same high-granularity tracking, but that leaves
> > out other sources of writes like VFIO or vhost, both of which probably
> > matter to Nutanix :) and which would be stuck with 4k-granularity dirty
> > page tracking.
>
> Are you thinking of creating a new KVM ioctl that would do that for us?
> That's possible, but... isn't good old mincore enough in this case, since
> qemu knows the host-virtual addresses of the guest memory? At the cost of 8
> times as much memory for the (temporary) data structure, since it's a byte
> per page.
>
> One thing I was wondering about was races. Unless we pause the guest while
> we're scanning the tables, the guest could touch pages as we scan. But: at
> the point the hypercall is invoked by the guest, the guest OS is up by
> definition. So at this point, the guest OS must be aware of any memory that
> is put to use, correct? Even including vfio/vhost stuff, since the buffers
> used to write data into the guest would have been set aside for that purpose
> by the guest. So even if we overestimate and announce some pages as
> pre-zeroed, that shouldn't matter if the guest OS already handed them out
> for some usage (and pre-zeroed them in the meantime). What we really care
> about is not announcing any pages as pre-zeroed that are in fact dirty,
> *and* that the guest OS does not realise were ever dirtied.
>
> ... Famous last words and I'm not 100% sure, I appreciate any thoughts on
> this.
>
> Cheers,
> Florian
>
Two definitions. The first one:
9.4.7 (TLFS PDF):
****************************************************************************************************
Hyper-V allocates zero-filled pages to a VM at creation time. The HvExtCallGetBootZeroedMemory
hypercall can be used to query which GPA pages were zeroed by Hyper-V during creation.
****************************************************************************************************
This can
prevent the guest memory manager from having to redundantly zero GPA pages, which can reduce
utilization and increase performance.
This is an extended hypercall; its availability must be queried using HvExtCallQueryCapabilities.
Wrapper Interface
HV_STATUS
HvExtCallGetBootZeroedMemory(
__out UINT64 StartGpa,
__out UINT64 PageCount
);
Native Interface
HvExtCallGetBootZeroedMemory
Call Code = 0x8002
Output Parameters
0 StartGpa (8 bytes)
8 PageCount (8 bytes)
Input Parameters
None.
Output Parameters
StartGpa – the GPA address where the zeroed memory region begins.
PageCount – the number of pages included in the zeroed memory region.
Second definition (the web):
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/hypercalls/hvextcallgetbootzeroedmemory
The hypercall returns ranges that are known to be zeroed at the time the hypercall is made. Cacheable reads from reported ranges must return all zeroes.
Querying zeroed ranges may allow the virtual machine to avoid zeroing memory that was already zeroed by the hypervisor.
Ranges can include memory that don’t exist and can overlap. The hypervisor should attempt to report "best" / biggest zeroed ranges earlier in the list for optimal performance
====
If qemu uses mmap(MAP_ANONYMOUS), can you return pages which have not
yet been mmaped?
You can inspect /proc/<pid>/pagemap to find which virtual pages within the mmap region have no physical page backing (i.e., never been written to, still zero-fill-on-demand).
Alternatively, /proc/<pid>/smaps gives a per-VMA summary — but for fine-grained per-page resolution within a single large VMA, pagemap is the tool:
// For each page in the region, read the 8-byte pagemap entry:
int fd = open("/proc/<pid>/pagemap", O_RDONLY);
uint64_t entry;
for (addr = start; addr < end; addr += PAGE_SIZE) {
pread(fd, &entry, 8, (addr / PAGE_SIZE) * 8);
bool present = entry & (1ULL << 63); // bit 63 = page present
bool swapped = entry & (1ULL << 62); // bit 62 = swapped
if (!present && !swapped) {
// Page never faulted in — still unallocated (zero-fill-on-demand)
}
}
I suppose the OS is responsible for handling races ?
Or rather, Windows assumes nothing has visibility to such memory regions
other than CPUs (which it controls).
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory
2026-04-21 22:46 ` Paolo Bonzini
@ 2026-05-05 15:57 ` Florian Schmidt
0 siblings, 0 replies; 15+ messages in thread
From: Florian Schmidt @ 2026-05-05 15:57 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Zhao Liu, Marcelo Tosatti, qemu-devel
On 2026-04-21 23:46, Paolo Bonzini wrote:
> On Mon, Apr 20, 2026 at 11:52 AM Florian Schmidt <flosch@nutanix.com> wrote:
>> Here's another option, though I'm not sure you'll like it: We can get
>> the same information from /proc/<pid>/pagemap, right? We just have to
>> check whether bits 62 and 63 are both 0 or not for every page.
>> It's less efficient, because we have to read 8 bytes per page
>
> It's also completely useless for the very common case of pre-allocated
> hugepages. That's the case where you can get the largest benefit,
> because when pre-allocating you use threads on the host and do the
> zeroing before the VM starts. For non-pre-allocated pages you still
> pay the price of double zeroing, but guest and host do it one after
> another while the VM is already running.
Yes, that's fair. There's another reason this is dodgy that I had
totally forgotten about and only remembered the other week: shared
memory's swap state is not tracked properly in /proc/<pid>/pagemap at
all, so that's another situation where this won't work correctly. So I
agree, that approach is useless.
> So I don't think there is any option other than the ioctl. I would
> suggest experimenting to understand how Windows uses the hypercall;
> and possibly looking at the QEMU write tracking part. The KVM changes
> are relatively simple and for quick experimentation you can operate as
> if the KVM bitmap is always entirely zero, not unlike this patch.
I will look a bit more at Windows, but I think generally speaking, we
can probably not make final and never-changing assertions about when or
how often a guest would use this hypercall.
Regarding the QEMU write tracking: I'm not very familiar with this code,
so I'm still working my way through it wrapping my mind around it all.
But at this point, I wonder whether there are advantages to not using
the current dirty tracking wholesale by adding a fourth option (vs
implementing something separate). What we care about is slightly
different from dirty tracking: we only care about memory that QEMU
touched, not about dirty-tracking guest pages via KVM, which is quite
tightly coupled with the other dirty tracking approaches.
It means we have to support another tracking mode outside the existing
DIRTY_MEMORY_* ones, but on the plus side: it allows us to easily set a
different granularity; to not allocate that memory unconditionally at
start even if the feature isn't enabled (though once we have the
different granularity, the overhead is much smaller); and we could skip
any hotplug support logic (having to extend the bitmap) since Windows
never enquires for hotplugged memory anyway (though maybe that will
change at some point in the future and could be neat to have eventually).
For migration, I'm still thinking through the implications, but one
option would be to just say "after a migration, nothing is pre-zeroed
anymore".
I'd be interested to hear your opinions on that.
Cheers,
Florian
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-05-05 15:58 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-12 11:31 [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 1/2] Add HvExtCallQueryCapabilities Florian Schmidt
2026-01-12 11:31 ` [RFC PATCH 2/2] Add HvExtCallGetBootZeroedMemory Florian Schmidt
2026-04-16 12:47 ` Paolo Bonzini
2026-04-16 15:33 ` Florian Schmidt
2026-04-16 16:04 ` Paolo Bonzini
2026-04-20 10:51 ` Florian Schmidt
2026-04-21 22:46 ` Paolo Bonzini
2026-05-05 15:57 ` Florian Schmidt
2026-04-23 2:16 ` Marcelo Tosatti
2026-04-17 5:51 ` Marcelo Tosatti
2026-04-20 9:25 ` Florian Schmidt
2026-04-20 15:01 ` Marcelo Tosatti
2026-02-02 14:26 ` [RFC PATCH 0/2] Support for Hyper-V's HvExtCallGetBootZeroedMemory() Florian Schmidt
2026-02-23 11:23 ` Florian Schmidt
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.