[PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL

Linux Documentation
 help / color / mirror / Atom feed

* [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
@ 2026-04-28 19:12 David Woodhouse
  2026-04-29 10:36 ` Paul Durrant
  2026-06-25 23:09 ` Sean Christopherson
  0 siblings, 2 replies; 6+ messages in thread
From: David Woodhouse @ 2026-04-28 19:12 UTC (permalink / raw)
  To: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, David Woodhouse, Paul Durrant, kvm, linux-doc,
	linux-kernel, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 9287 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
guest writes to MSR") blocked host-initiated writes from triggering the
Xen hypercall page setup, to fix an SRCU usage violation when the
hypercall MSR index collides with a real MSR written during vCPU reset.

However, some VMMs legitimately need to trigger hypercall page setup
from host context. For example, a VMM may intercept the guest's MSR
write to track an epoch (for kexec/crash recovery), and then replay the
write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
The host_initiated check breaks this use case.

Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
that explicitly invokes kvm_xen_write_hypercall_page() under proper
locking. This gives userspace a safe interface to trigger hypercall page
setup without going through the MSR write path, preserving the
host_initiated defence in depth while restoring the lost functionality.

Fixes: 3617c0ee7dec ("KVM: x86/xen: Only write Xen hypercall page for guest writes to MSR")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/virt/kvm/api.rst                | 11 +++
 arch/x86/include/uapi/asm/kvm.h               |  3 +
 arch/x86/kvm/x86.c                            |  3 +-
 arch/x86/kvm/xen.c                            |  7 ++
 .../selftests/kvm/x86/xen_vmcall_test.c       | 96 +++++++++++++++++++
 5 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 52bbbb553ce1..63423c375a78 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -5800,6 +5800,17 @@ KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR
   vector configured with HVM_PARAM_CALLBACK_IRQ. It is disabled by
   setting the vector to zero.
 
+KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
+  This attribute is available when the KVM_CAP_XEN_HVM ioctl indicates
+  support for KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE. It triggers
+  population of the Xen hypercall page at the guest physical address
+  specified in ``gpa``, just as if the guest had written to the
+  hypercall MSR. This is intended for VMMs that intercept the guest's
+  MSR write (e.g. to track an epoch for kexec/crash recovery) and need
+  to replay the write from host context. Direct host-initiated writes
+  via KVM_SET_MSRS are blocked for safety; this attribute provides the
+  correct alternative.
+
 
 4.129 KVM_XEN_VCPU_GET_ATTR
 ---------------------------
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5f2b30d0405c..977f3aa66c18 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -596,6 +596,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)
 #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE	(1 << 7)
 #define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA	(1 << 8)
+#define KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE	(1 << 9)
 
 #define KVM_XEN_MSR_MIN_INDEX			0x40000000u
 #define KVM_XEN_MSR_MAX_INDEX			0x4fffffffu
@@ -704,6 +705,8 @@ struct kvm_xen_vcpu_attr {
 #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR	0x8
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
 #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA	0x9
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE */
+#define KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE 0xa
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0a1b63c63d1a..3facf0429c0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4891,7 +4891,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		    KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
 		    KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
 		    KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE |
-		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA;
+		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA |
+		    KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE;
 		if (sched_info_on())
 			r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
 			     KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 91fd3673c09a..c16b4560c9e7 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 {
 	int idx, r = -ENOENT;
 
+	/*
+	 * kvm_xen_write_hypercall_page() manages its own locking.
+	 * Handle it before taking xen_lock to avoid a deadlock.
+	 */
+	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
+		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;
+
 	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
 	idx = srcu_read_lock(&vcpu->kvm->srcu);
 
diff --git a/tools/testing/selftests/kvm/x86/xen_vmcall_test.c b/tools/testing/selftests/kvm/x86/xen_vmcall_test.c
index 2585087cdf5c..1536d510ab30 100644
--- a/tools/testing/selftests/kvm/x86/xen_vmcall_test.c
+++ b/tools/testing/selftests/kvm/x86/xen_vmcall_test.c
@@ -12,6 +12,8 @@
 #include "processor.h"
 #include "hyperv.h"
 
+#include <string.h>
+
 #define HCALL_REGION_GPA	0xc0000000ULL
 #define HCALL_REGION_SLOT	10
 
@@ -26,6 +28,10 @@
 #define HVCALL_SIGNAL_EVENT		0x005d
 #define HV_STATUS_INVALID_ALIGNMENT	4
 
+enum {
+	TEST_WRITE_HYPERCALL_PAGE = 1,
+};
+
 static void guest_code(void)
 {
 	unsigned long rax = INPUTVALUE;
@@ -76,17 +82,65 @@ static void guest_code(void)
 			     "r"(r8));
 	GUEST_ASSERT(rax == HV_STATUS_INVALID_ALIGNMENT);
 
+	/*
+	 * Test KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE: ask userspace
+	 * to set up MSR filtering, then write the MSR. The WRMSR will exit
+	 * to userspace (not populate the page). Userspace verifies the page
+	 * is empty, uses the attr to populate it, then resumes us.
+	 */
+	GUEST_SYNC(TEST_WRITE_HYPERCALL_PAGE);
+
+	__asm__ __volatile__("wrmsr" : : "c" (XEN_HYPERCALL_MSR),
+			     "a" (HCALL_REGION_GPA & 0xffffffff),
+			     "d" (HCALL_REGION_GPA >> 32));
+
+	/* Userspace populated the page via the attr — verify it works */
+	rax = INPUTVALUE;
+	rdi = ARGVALUE(1);
+	rsi = ARGVALUE(2);
+	rdx = ARGVALUE(3);
+	r10 = ARGVALUE(4);
+	r8 = ARGVALUE(5);
+	r9 = ARGVALUE(6);
+	__asm__ __volatile__("call *%1" : "=a"(rax) :
+			     "r"(HCALL_REGION_GPA + INPUTVALUE * 32),
+			     "a"(rax), "D"(rdi), "S"(rsi), "d"(rdx),
+			     "r"(r10), "r"(r8), "r"(r9));
+	GUEST_ASSERT(rax == RETVALUE);
+
 	GUEST_DONE();
 }
 
+static void setup_msr_filter(struct kvm_vm *vm)
+{
+	uint64_t deny_bits = 0;
+	struct kvm_msr_filter filter = {
+		.flags = KVM_MSR_FILTER_DEFAULT_ALLOW,
+		.ranges = {
+			{
+				.flags = KVM_MSR_FILTER_WRITE,
+				.nmsrs = 1,
+				.base = XEN_HYPERCALL_MSR,
+				.bitmap = (uint8_t *)&deny_bits,
+			},
+		},
+	};
+
+	vm_ioctl(vm, KVM_X86_SET_MSR_FILTER, &filter);
+}
+
 int main(int argc, char *argv[])
 {
 	unsigned int xen_caps;
 	struct kvm_vcpu *vcpu;
 	struct kvm_vm *vm;
+	bool msr_filter_ready = false;
 
 	xen_caps = kvm_check_cap(KVM_CAP_XEN_HVM);
 	TEST_REQUIRE(xen_caps & KVM_XEN_HVM_CONFIG_INTERCEPT_HCALL);
+	TEST_REQUIRE(xen_caps & KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE);
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_X86_USER_SPACE_MSR));
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_X86_MSR_FILTER));
 
 	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
 	vcpu_set_hv_cpuid(vcpu);
@@ -123,6 +177,36 @@ int main(int argc, char *argv[])
 			continue;
 		}
 
+		if (run->exit_reason == KVM_EXIT_X86_WRMSR) {
+			/* MSR filter caught the Xen hypercall MSR write */
+			TEST_ASSERT(msr_filter_ready,
+				    "Unexpected WRMSR exit before filter setup");
+			TEST_ASSERT_EQ(run->msr.index, XEN_HYPERCALL_MSR);
+
+			/*
+			 * The host_initiated check should have prevented
+			 * KVM from populating the page. Verify it's empty.
+			 */
+			uint8_t *hcall_page = addr_gpa2hva(vm, HCALL_REGION_GPA);
+			TEST_ASSERT_EQ(hcall_page[0], 0);
+
+			/*
+			 * Now use the attr to populate the page, as a
+			 * VMM would after intercepting the MSR write.
+			 */
+			struct kvm_xen_vcpu_attr attr = {
+				.type = KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE,
+				.u.gpa = HCALL_REGION_GPA,
+			};
+			vcpu_ioctl(vcpu, KVM_XEN_VCPU_SET_ATTR, &attr);
+
+			/* Verify the page is now populated */
+			TEST_ASSERT_EQ(hcall_page[0], 0xb8);
+
+			run->msr.error = 0;
+			continue;
+		}
+
 		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
 
 		switch (get_ucall(vcpu, &uc)) {
@@ -130,6 +214,18 @@ int main(int argc, char *argv[])
 			REPORT_GUEST_ASSERT(uc);
 			/* NOT REACHED */
 		case UCALL_SYNC:
+			TEST_ASSERT_EQ(uc.args[1], TEST_WRITE_HYPERCALL_PAGE);
+
+			/*
+			 * Guest is about to write the Xen MSR. Clear the
+			 * hypercall page, install MSR filter to intercept
+			 * the write, and enable userspace MSR exits.
+			 */
+			memset(addr_gpa2hva(vm, HCALL_REGION_GPA), 0, PAGE_SIZE);
+			vm_enable_cap(vm, KVM_CAP_X86_USER_SPACE_MSR,
+				      KVM_MSR_EXIT_REASON_FILTER);
+			setup_msr_filter(vm);
+			msr_filter_ready = true;
 			break;
 		case UCALL_DONE:
 			goto done;
-- 
2.43.0



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
  2026-04-28 19:12 [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE David Woodhouse
@ 2026-04-29 10:36 ` Paul Durrant
  2026-06-02 17:01   ` David Woodhouse
  2026-06-25 23:09 ` Sean Christopherson
  1 sibling, 1 reply; 6+ messages in thread
From: Paul Durrant @ 2026-04-29 10:36 UTC (permalink / raw)
  To: David Woodhouse, Paolo Bonzini, Jonathan Corbet, Shuah Khan,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, kvm, linux-doc,
	linux-kernel, linux-kselftest

On 28/04/2026 21:12, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
> guest writes to MSR") blocked host-initiated writes from triggering the
> Xen hypercall page setup, to fix an SRCU usage violation when the
> hypercall MSR index collides with a real MSR written during vCPU reset.
> 
> However, some VMMs legitimately need to trigger hypercall page setup
> from host context. For example, a VMM may intercept the guest's MSR
> write to track an epoch (for kexec/crash recovery), and then replay the
> write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
> The host_initiated check breaks this use case.
> 
> Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
> that explicitly invokes kvm_xen_write_hypercall_page() under proper
> locking. This gives userspace a safe interface to trigger hypercall page
> setup without going through the MSR write path, preserving the
> host_initiated defence in depth while restoring the lost functionality.
> 
> Fixes: 3617c0ee7dec ("KVM: x86/xen: Only write Xen hypercall page for guest writes to MSR")
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   Documentation/virt/kvm/api.rst                | 11 +++
>   arch/x86/include/uapi/asm/kvm.h               |  3 +
>   arch/x86/kvm/x86.c                            |  3 +-
>   arch/x86/kvm/xen.c                            |  7 ++
>   .../selftests/kvm/x86/xen_vmcall_test.c       | 96 +++++++++++++++++++
>   5 files changed, 119 insertions(+), 1 deletion(-)
> 
Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
  2026-04-29 10:36 ` Paul Durrant
@ 2026-06-02 17:01   ` David Woodhouse
  0 siblings, 0 replies; 6+ messages in thread
From: David Woodhouse @ 2026-06-02 17:01 UTC (permalink / raw)
  To: xadimgnik
  Cc: bp, corbet, dave.hansen, dwmw2, hpa, kvm, linux-doc, linux-kernel,
	linux-kselftest, mingo, pbonzini, seanjc, skhan, tglx, x86

[-- Attachment #1: Type: text/plain, Size: 1800 bytes --]

On Wed, 29 Apr 2026 12:36:52 +0200, Paul Durrant wrote:
> On 28/04/2026 21:12, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
> > guest writes to MSR") blocked host-initiated writes from triggering the
> > Xen hypercall page setup, to fix an SRCU usage violation when the
> > hypercall MSR index collides with a real MSR written during vCPU reset.
> > 
> > However, some VMMs legitimately need to trigger hypercall page setup
> > from host context. For example, a VMM may intercept the guest's MSR
> > write to track an epoch (for kexec/crash recovery), and then replay the
> > write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
> > The host_initiated check breaks this use case.
> > 
> > Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
> > that explicitly invokes kvm_xen_write_hypercall_page() under proper
> > locking. This gives userspace a safe interface to trigger hypercall page
> > setup without going through the MSR write path, preserving the
> > host_initiated defence in depth while restoring the lost functionality.
> > 
> > Fixes: 3617c0ee7dec ("KVM: x86/xen: Only write Xen hypercall page for guest writes to MSR")
> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >   Documentation/virt/kvm/api.rst                | 11 +++
> >   arch/x86/include/uapi/asm/kvm.h               |  3 +
> >   arch/x86/kvm/x86.c                            |  3 +-
> >   arch/x86/kvm/xen.c                            |  7 ++
> >   .../selftests/kvm/x86/xen_vmcall_test.c       | 96 +++++++++++++++++++
> >   5 files changed, 119 insertions(+), 1 deletion(-)
> 
> Reviewed-by: Paul Durrant <paul@xen.org>

Ping? 

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
  2026-04-28 19:12 [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE David Woodhouse
  2026-04-29 10:36 ` Paul Durrant
@ 2026-06-25 23:09 ` Sean Christopherson
  2026-06-26  7:45   ` David Woodhouse
  1 sibling, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2026-06-25 23:09 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, kvm, linux-doc, linux-kernel, linux-kselftest

On Tue, Apr 28, 2026, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
> guest writes to MSR") blocked host-initiated writes from triggering the
> Xen hypercall page setup, to fix an SRCU usage violation when the
> hypercall MSR index collides with a real MSR written during vCPU reset.
> 
> However, some VMMs legitimately need to trigger hypercall page setup
> from host context. For example, a VMM may intercept the guest's MSR
> write to track an epoch (for kexec/crash recovery), and then replay the
> write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
> The host_initiated check breaks this use case.
> 
> Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
> that explicitly invokes kvm_xen_write_hypercall_page() under proper
> locking. This gives userspace a safe interface to trigger hypercall page
> setup without going through the MSR write path, preserving the
> host_initiated defence in depth while restoring the lost functionality.

This is all kinda silly.  Userspace provides KVM a blob, then userspace intercepts
the MSR write that triggers doing something with said blob, only to call back into
KVM to consume the blob that userspace provided in the first place.

Any chance we can deprecate KVM's kvm_xen_write_hypercall_page(), and instead
rely on userspace to fill the page?  This extra bit obviously isn't much code to
carry, but it's yet one more Xen thing to maintain, and we've accumulated a lot
of those over the years...

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index 91fd3673c09a..c16b4560c9e7 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
>  {
>  	int idx, r = -ENOENT;
>  
> +	/*
> +	 * kvm_xen_write_hypercall_page() manages its own locking.
> +	 * Handle it before taking xen_lock to avoid a deadlock.

Do we actually want the side effects that necessitate taking xen.xen_lock?  From
a uAPI perspective, it's odd to effectively bundle KVM_XEN_ATTR_TYPE_LONG_MODE
into KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE.

The other question is, why does kvm_xen_write_hypercall_page() drop xen_lock
when writing guest memory?  That seems odd and unnecessary.

> +	 */
> +	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
> +		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;

-EIO is rather weird, wouldn't -EINVAL be more appropriate?  Ah, and both are
wrong if copying the blob fails.

> +
>  	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
>  	idx = srcu_read_lock(&vcpu->kvm->srcu);

Speaking of writing memory, kvm_xen_write_hypercall_page() expects the caller
to be in a read-side SRCU critical section (I didn't actually run this with
PROVE_LOCKING=y, but I don't think I'm missing anything?)

So, if this uAPI is unavoidable seems like we want something like the below.
Either that or guard all of kvm_xen_write_hypercall_page() with a lock, and put
the entire thing in a helper so that KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
can be handled in a case-statement and doesn't need to grab SRCU on its own.

---
 arch/x86/include/uapi/asm/kvm.h |  3 ++
 arch/x86/kvm/x86.c              |  3 +-
 arch/x86/kvm/xen.c              | 64 ++++++++++++++++++++-------------
 3 files changed, 44 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 1585ec804066..7732b92a4db0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -598,6 +598,7 @@ struct kvm_x86_mce {
 #define KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG	(1 << 6)
 #define KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE	(1 << 7)
 #define KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA	(1 << 8)
+#define KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE	(1 << 9)
 
 #define KVM_XEN_MSR_MIN_INDEX			0x40000000u
 #define KVM_XEN_MSR_MAX_INDEX			0x4fffffffu
@@ -706,6 +707,8 @@ struct kvm_xen_vcpu_attr {
 #define KVM_XEN_VCPU_ATTR_TYPE_UPCALL_VECTOR	0x8
 /* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA */
 #define KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA	0x9
+/* Available with KVM_CAP_XEN_HVM / KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE */
+#define KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE 0xa
 
 /* Secure Encrypted Virtualization command */
 enum sev_cmd_id {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0626e835e9eb..ced19e84bf6c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2287,7 +2287,8 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		    KVM_XEN_HVM_CONFIG_EVTCHN_2LEVEL |
 		    KVM_XEN_HVM_CONFIG_EVTCHN_SEND |
 		    KVM_XEN_HVM_CONFIG_PVCLOCK_TSC_UNSTABLE |
-		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA;
+		    KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA |
+		    KVM_XEN_HVM_CONFIG_WRITE_HYPERCALL_PAGE;
 		if (sched_info_on())
 			r |= KVM_XEN_HVM_CONFIG_RUNSTATE |
 			     KVM_XEN_HVM_CONFIG_RUNSTATE_UPDATE_FLAG;
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index db10f12d10cf..b72845aa67e2 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -904,6 +904,8 @@ int kvm_xen_hvm_get_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data)
 	return r;
 }
 
+static int __kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data);
+
 int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 {
 	int idx, r = -ENOENT;
@@ -1138,7 +1140,9 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 			r = 0;
 		}
 		break;
-
+	case KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE:
+		r = __kvm_xen_write_hypercall_page(vcpu, data->u.gpa);
+		break;
 	default:
 		break;
 	}
@@ -1274,30 +1278,12 @@ int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 	return r;
 }
 
-int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
+static int __kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 {
-	struct kvm *kvm = vcpu->kvm;
 	u32 page_num = data & ~PAGE_MASK;
 	u64 page_addr = data & PAGE_MASK;
 	bool lm = is_long_mode(vcpu);
-	int r = 0;
-
-	mutex_lock(&kvm->arch.xen.xen_lock);
-	if (kvm->arch.xen.long_mode != lm) {
-		kvm->arch.xen.long_mode = lm;
-
-		/*
-		 * Re-initialize shared_info to put the wallclock in the
-		 * correct place.
-		 */
-		if (kvm->arch.xen.shinfo_cache.active &&
-		    kvm_xen_shared_info_init(kvm))
-			r = 1;
-	}
-	mutex_unlock(&kvm->arch.xen.xen_lock);
-
-	if (r)
-		return r;
+	struct kvm *kvm = vcpu->kvm;
 
 	/*
 	 * If Xen hypercall intercept is enabled, fill the hypercall
@@ -1310,7 +1296,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		int i;
 
 		if (page_num)
-			return 1;
+			return -EINVAL;
 
 		/* mov imm32, %eax */
 		instructions[0] = 0xb8;
@@ -1329,7 +1315,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 			if (kvm_vcpu_write_guest(vcpu,
 						 page_addr + (i * sizeof(instructions)),
 						 instructions, sizeof(instructions)))
-				return 1;
+				return -EINVAL;
 		}
 	} else {
 		/*
@@ -1344,7 +1330,7 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		int ret;
 
 		if (page_num >= blob_size)
-			return 1;
+			return -EINVAL;
 
 		blob_addr += page_num * PAGE_SIZE;
 
@@ -1355,11 +1341,39 @@ int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
 		ret = kvm_vcpu_write_guest(vcpu, page_addr, page, PAGE_SIZE);
 		kfree(page);
 		if (ret)
-			return 1;
+			return -EINVAL;
 	}
 	return 0;
 }
 
+
+int kvm_xen_write_hypercall_page(struct kvm_vcpu *vcpu, u64 data)
+{
+	struct kvm *kvm = vcpu->kvm;
+	bool lm = is_long_mode(vcpu);
+	int r = 0;
+
+	mutex_lock(&kvm->arch.xen.xen_lock);
+	if (kvm->arch.xen.long_mode != lm) {
+		kvm->arch.xen.long_mode = lm;
+
+		/*
+		 * Re-initialize shared_info to put the wallclock in the
+		 * correct place.
+		 */
+		if (kvm->arch.xen.shinfo_cache.active &&
+		    kvm_xen_shared_info_init(kvm))
+			r = 1;
+	}
+	mutex_unlock(&kvm->arch.xen.xen_lock);
+
+	if (r)
+		return r;
+
+
+	return __kvm_xen_write_hypercall_page(vcpu, data) ? 1 : 0;
+}
+
 int kvm_xen_hvm_config(struct kvm *kvm, struct kvm_xen_hvm_config *xhc)
 {
 	/* Only some feature flags need to be *enabled* by userspace */

base-commit: 8867ab1259b34261eaa96f1f3a2a092cd03e5a17
--

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
  2026-06-25 23:09 ` Sean Christopherson
@ 2026-06-26  7:45   ` David Woodhouse
  2026-06-26 13:40     ` Sean Christopherson
  0 siblings, 1 reply; 6+ messages in thread
From: David Woodhouse @ 2026-06-26  7:45 UTC (permalink / raw)
  To: Sean Christopherson, Gerd Hoffmann
  Cc: Paolo Bonzini, Jonathan Corbet, Shuah Khan, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, kvm, linux-doc, linux-kernel, linux-kselftest

[-- Attachment #1: Type: text/plain, Size: 6198 bytes --]

On Thu, 2026-06-25 at 16:09 -0700, Sean Christopherson wrote:
> On Tue, Apr 28, 2026, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > Commit 3617c0ee7decb ("KVM: x86/xen: Only write Xen hypercall page for
> > guest writes to MSR") blocked host-initiated writes from triggering the
> > Xen hypercall page setup, to fix an SRCU usage violation when the
> > hypercall MSR index collides with a real MSR written during vCPU reset.
> > 
> > However, some VMMs legitimately need to trigger hypercall page setup
> > from host context. For example, a VMM may intercept the guest's MSR
> > write to track an epoch (for kexec/crash recovery), and then replay the
> > write as a host-initiated KVM_SET_MSRS to populate the hypercall page.
> > The host_initiated check breaks this use case.
> > 
> > Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE as a new vcpu attribute
> > that explicitly invokes kvm_xen_write_hypercall_page() under proper
> > locking. This gives userspace a safe interface to trigger hypercall page
> > setup without going through the MSR write path, preserving the
> > host_initiated defence in depth while restoring the lost functionality.
> 
> This is all kinda silly.  Userspace provides KVM a blob, then userspace intercepts
> the MSR write that triggers doing something with said blob, only to call back into
> KVM to consume the blob that userspace provided in the first place.
> 
> Any chance we can deprecate KVM's kvm_xen_write_hypercall_page(), and instead
> rely on userspace to fill the page?  This extra bit obviously isn't much code to
> carry, but it's yet one more Xen thing to maintain, and we've accumulated a lot
> of those over the years...

We don't actually use the 'blob' mode. That was added in commit
ffde22ac53b6d in 2009 with a comment saying, "A generic mechanism to
delegate MSR writes to userspace seems overkill and risks encouraging
similar MSR abuse in the future.  Thus this patch adds special support
for the Xen HVM MSR."

When João and I came along almost a decade later, in 23200b7a30de3
where we added hypercall interception support we said, "Since this
means KVM owns the ABI, dispense with the facility for the VMM to
provide its own copy of the hypercall pages; just fill them in directly
using VMCALL/VMMCALL as we do for the Hyper-V hypercall page."

I think we could probably rip out the blob mode without any fear of
breaking userspace. Even in 2018 I don't think we could even find the
alleged code from 2009 that used the old support. At least, not in
buildable and usable form?

> > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > index 91fd3673c09a..c16b4560c9e7 100644
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> >  {
> >  	int idx, r = -ENOENT;
> >  
> > +	/*
> > +	 * kvm_xen_write_hypercall_page() manages its own locking.
> > +	 * Handle it before taking xen_lock to avoid a deadlock.
> 
> Do we actually want the side effects that necessitate taking xen.xen_lock?  From
> a uAPI perspective, it's odd to effectively bundle KVM_XEN_ATTR_TYPE_LONG_MODE
> into KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE.

That's *guest* ABI, and it's derived from Xen behaviour. Xen will
'latch' its idea of whether a guest VM is 32-bit or 64-bit, for the
purpose of shared data structures (shared_info page, vcpu_info,
runstate).

Xen latches this from the current mode of the running vCPU in *two*
places:
 • When the hypercall MSR is invoked
 • When the guest sets the event channel GSI (HVM_PARAM_CALLBACK_IRQ).

Thus far, the former has been handled in the kernel (in the code you're
looking at), while the latter is why we have the ioctl to explicitly
latch the guest's long_mode from userspace too, as userspace handles
the HVMOP_set_param calls.

> The other question is, why does kvm_xen_write_hypercall_page() drop xen_lock
> when writing guest memory?  That seems odd and unnecessary.

Huh? It takes the lock to do the thing that needs the lock, then drops
it. That is not "odd and unnecessary" at all.

You've been spending too long with these scope-guarded locks. I *hate*
them. I hate the way they slowly spread around the whole kernel, making
every lock holder hold their locks for just a *little* bit longer than
they need to, slowly increasing lock contention "just a little bit; it
doesn't matter" at a time. I hate the way they stop us thinking about
which locks are needed and in which order, and make it unclear whether
some action in the tail of a function actually *needed* the lock, or
was just caught up in it as collateral damage.

> > +	 */
> > +	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
> > +		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;
> 
> -EIO is rather weird, wouldn't -EINVAL be more appropriate?  Ah, and both are
> wrong if copying the blob fails.

-EINVAL is more for "you asked me to do something that doesn't make
sense". -EIO is for "something went wrong when I tried".

Arguably, the thing that's most likely to go wrong is the
kvm_vcpu_write_guest() where it writes instructions[] to the guest, and
maybe that ought to be -EFAULT? But I'm not sure that's quite the right
semantic to return from the ioctl?

> > +
> >  	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
> >  	idx = srcu_read_lock(&vcpu->kvm->srcu);
> 
> Speaking of writing memory, kvm_xen_write_hypercall_page() expects the caller
> to be in a read-side SRCU critical section (I didn't actually run this with
> PROVE_LOCKING=y, but I don't think I'm missing anything?)

Yes, good catch. Thanks.

> So, if this uAPI is unavoidable seems like we want something like the below.
> Either that or guard all of kvm_xen_write_hypercall_page() with a lock, and put
> the entire thing in a helper so that KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
> can be handled in a case-statement and doesn't need to grab SRCU on its own.

Makes sense (with the test, of course). Want me to put them together
and resend?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
  2026-06-26  7:45   ` David Woodhouse
@ 2026-06-26 13:40     ` Sean Christopherson
  0 siblings, 0 replies; 6+ messages in thread
From: Sean Christopherson @ 2026-06-26 13:40 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Gerd Hoffmann, Paolo Bonzini, Jonathan Corbet, Shuah Khan,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, kvm, linux-doc, linux-kernel,
	linux-kselftest

On Fri, Jun 26, 2026, David Woodhouse wrote:
> On Thu, 2026-06-25 at 16:09 -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> > > index 91fd3673c09a..c16b4560c9e7 100644
> > > --- a/arch/x86/kvm/xen.c
> > > +++ b/arch/x86/kvm/xen.c
> > > @@ -907,6 +907,13 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
> > >  {
> > >  	int idx, r = -ENOENT;
> > >  
> > > +	/*
> > > +	 * kvm_xen_write_hypercall_page() manages its own locking.
> > > +	 * Handle it before taking xen_lock to avoid a deadlock.
> > 
> > Do we actually want the side effects that necessitate taking xen.xen_lock?  From
> > a uAPI perspective, it's odd to effectively bundle KVM_XEN_ATTR_TYPE_LONG_MODE
> > into KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE.
> 
> That's *guest* ABI, and it's derived from Xen behaviour. Xen will
> 'latch' its idea of whether a guest VM is 32-bit or 64-bit, for the
> purpose of shared data structures (shared_info page, vcpu_info,
> runstate).
> 
> Xen latches this from the current mode of the running vCPU in *two*
> places:
>  • When the hypercall MSR is invoked
>  • When the guest sets the event channel GSI (HVM_PARAM_CALLBACK_IRQ).
> 
> Thus far, the former has been handled in the kernel (in the code you're
> looking at), while the latter is why we have the ioctl to explicitly
> latch the guest's long_mode from userspace too, as userspace handles
> the HVMOP_set_param calls.

Right, and I'm pointing out that from a KVM uAPI perspective, bundling the first
one in a "write hypercall page" call is rather odd, especially since there's
already uAPI to handle the latching.

> > The other question is, why does kvm_xen_write_hypercall_page() drop xen_lock
> > when writing guest memory?  That seems odd and unnecessary.
> 
> Huh? It takes the lock to do the thing that needs the lock, then drops
> it. That is not "odd and unnecessary" at all.
>
> You've been spending too long with these scope-guarded locks.

No, I'm asking why KVM doesn't serialize the writes to guest memory.  Usually
when KVM writes to guest memory, KVM is emulating something that is very much
vCPU-specific, and so if there are races it's the guest's problem to deal with.

The Xen MSR here is clearly VM-scoped though, which is why it feels odd to take
a per-VM lock, and then deliberately drop the lock before completing the operation,
In practice it shouldn't matter, since it sounds like the same repeating 16 byte
pattern will be written every time, but it was a bit head-scratching when reading
the code.

> > > +	if (data->type == KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE)
> > > +		return kvm_xen_write_hypercall_page(vcpu, data->u.gpa) ? -EIO : 0;
> > 
> > -EIO is rather weird, wouldn't -EINVAL be more appropriate?  Ah, and both are
> > wrong if copying the blob fails.
> 
> -EINVAL is more for "you asked me to do something that doesn't make sense".
> -EIO is for "something went wrong when I tried".

Sure, but KVM returns EINVAL for pretty much every ioctl (or ioctl-like thing)
if userspace provides bad input, e.g. for the @data param.
 
> Arguably, the thing that's most likely to go wrong is the
> kvm_vcpu_write_guest() where it writes instructions[] to the guest, and
> maybe that ought to be -EFAULT?

Heh, ya, I just say that too when looking at the code again.

> But I'm not sure that's quite the right semantic to return from the ioctl?

We can/should return whatever kvm_vcpu_write_guest() returns, i.e. literally
return its result directly.  Which of course is only ever going to be -EFAULT,
but in the extremely unlikely case that ever changes, we won't have to worry
about creating misleading behavior in the Xen code.

> > >  	mutex_lock(&vcpu->kvm->arch.xen.xen_lock);
> > >  	idx = srcu_read_lock(&vcpu->kvm->srcu);
> > 
> > Speaking of writing memory, kvm_xen_write_hypercall_page() expects the caller
> > to be in a read-side SRCU critical section (I didn't actually run this with
> > PROVE_LOCKING=y, but I don't think I'm missing anything?)
> 
> Yes, good catch. Thanks.
> 
> > So, if this uAPI is unavoidable seems like we want something like the below.
> > Either that or guard all of kvm_xen_write_hypercall_page() with a lock, and put
> > the entire thing in a helper so that KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE
> > can be handled in a case-statement and doesn't need to grab SRCU on its own.
> 
> Makes sense (with the test, of course). Want me to put them together
> and resend?

Yes please.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-06-26 13:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 19:12 [PATCH] KVM: x86/xen: Add KVM_XEN_VCPU_ATTR_TYPE_WRITE_HYPERCALL_PAGE David Woodhouse
2026-04-29 10:36 ` Paul Durrant
2026-06-02 17:01   ` David Woodhouse
2026-06-25 23:09 ` Sean Christopherson
2026-06-26  7:45   ` David Woodhouse
2026-06-26 13:40     ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox