[PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM
@ 2025-08-13 19:23 Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Maxim Levitsky @ 2025-08-13 19:23 UTC (permalink / raw)
  To: kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, Paolo Bonzini, x86, Borislav Petkov,
	linux-kernel, Maxim Levitsky

Recently we debugged a customer case in which the guest VM was showing
tasks permanently stuck in the kvm_async_pf_task_wait_schedule.

This was traced to the incorrect flushing of the async pagefault queue,
which was done during the real mode entry by the kvm_post_set_cr0.

This code, the kvm_clear_async_pf_completion_queue does wait for all #APF
tasks to complete but then it proceeds to wipe the 'done' queue without
notifying the guest.

Such approach is acceptable if the guest is being rebooted or if
it decided to disable APF, but it leads to failures if the entry to real
mode was caused by SMM, because in this case the guest intends to continue
using APF after returning from the SMM handler.

Amusingly, and on top of this, the SMM entry code doesn't call
the kvm_set_cr0 (and subsequently neither it calls kvm_post_set_cr0),
but rather only the SMM mode exit code does.

During SMM entry, the SMM code calls .set_cr0 instead, with an intention
to bypass various architectural checks that can otherwise fail.

One example of such check is a #GP check on an attempt to disable paging
while the long mode is active.
To do this, the user must first exit to the compatibility mode and only then
disable paging.

The question of the possiblity of eliminating this bypass, is a side topic
that is probably worth discussing separately.

Back to the topic, the kvm_set_cr0 is still called during SMM handling,
more particularly during the exit from SMM, by emulator_leave_smm:

It is called once with CR0.PE == off, to setup a baseline real-mode
environment, and then a second time, with the original CR0 value.

Even more amusingly, usually both mentioned calls result in APF queue being
flushed, because the code in kvm_post_set_cr0 doesn't distinguish between
entry and exit from protected mode, and SMM mode usually enables protection
and paging, and exits itself without bothering first to exit back to
the real mode.

To fix this problem, I think the best solution is to drop the call to
kvm_clear_async_pf_completion_queue in kvm_post_set_cr0 code altogether,
and instead raise the KVM_REQ_APF_READY, when the protected mode
is re-established.

Existing APF requests should have no problem to complete while the guest is
in SMM and the APF completion event injection should work too,
because SMM handler *ought* to not enable interrupts because otherwise
things would go south very quickly.

This change also brings the logic to be up to date with logic that KVM
follows when the guest disables APIC.
KVM also raises KVM_REQ_APF_READY when the APIC is re-enabled.

In addition to this, I also included few fixes for few semi-theortical
bugs I found while debugging this.

Best regards,
        Maxim Levitsky

Maxim Levitsky (3):
  KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is
    not enabled
  KVM: x86: Fix a semi theoretical bug in
    kvm_arch_async_page_present_queued
  KVM: x86: Fix the interaction between SMM and the asynchronous
    pagefault

 arch/x86/kvm/x86.c | 22 +++++++++++++++-------
 arch/x86/kvm/x86.h |  1 +
 2 files changed, 16 insertions(+), 7 deletions(-)

-- 
2.49.0

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled
  2025-08-13 19:23 [PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
@ 2025-08-13 19:23 ` Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
  2 siblings, 0 replies; 14+ messages in thread
From: Maxim Levitsky @ 2025-08-13 19:23 UTC (permalink / raw)
  To: kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, Paolo Bonzini, x86, Borislav Petkov,
	linux-kernel, Maxim Levitsky

KVM flushes the APF queue completely when the asynchronous pagefault is
disabled, therefore this case should not occur.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1c49bc681c4..9018d56b4b0a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13466,7 +13466,7 @@ void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
 {
-	if (!kvm_pv_async_pf_enabled(vcpu))
+	if (WARN_ON_ONCE(!kvm_pv_async_pf_enabled(vcpu)))
 		return true;
 	else
 		return kvm_lapic_enabled(vcpu) && apf_pageready_slot_free(vcpu);
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-08-13 19:23 [PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
@ 2025-08-13 19:23 ` Maxim Levitsky
  2025-08-18 18:08   ` Sean Christopherson
  2025-09-23 15:58   ` Paolo Bonzini
  2025-08-13 19:23 ` [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
  2 siblings, 2 replies; 14+ messages in thread
From: Maxim Levitsky @ 2025-08-13 19:23 UTC (permalink / raw)
  To: kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, Paolo Bonzini, x86, Borislav Petkov,
	linux-kernel, Maxim Levitsky

Fix a semi theoretical race condition in reading of page_ready_pending
in kvm_arch_async_page_present_queued.

Only trust the value of page_ready_pending if the guest is about to
enter guest mode (vcpu->mode).

To achieve this, read the vcpu->mode using smp_load_acquire which is
paired with smp_store_release in vcpu_enter_guest.

Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
page_ready_pending because it was written before and therefore its correct
value is visible.

Also if the above mentioned check is true, avoid raising the request
on the target vCPU.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9018d56b4b0a..3d45a4cd08a4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 {
-	kvm_make_request(KVM_REQ_APF_READY, vcpu);
-	if (!vcpu->arch.apf.pageready_pending)
+	/* Pairs with smp_store_release in vcpu_enter_guest. */
+	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
+	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
+
+	if (!in_guest_mode || !page_ready_pending) {
+		kvm_make_request(KVM_REQ_APF_READY, vcpu);
 		kvm_vcpu_kick(vcpu);
+	}
 }
 
 bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault
  2025-08-13 19:23 [PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
  2025-08-13 19:23 ` [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
@ 2025-08-13 19:23 ` Maxim Levitsky
  2025-08-18 18:20   ` Sean Christopherson
  2 siblings, 1 reply; 14+ messages in thread
From: Maxim Levitsky @ 2025-08-13 19:23 UTC (permalink / raw)
  To: kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, Paolo Bonzini, x86, Borislav Petkov,
	linux-kernel, Maxim Levitsky

Currently a #SMI can cause KVM to drop an #APF ready event and
subsequently causes the guest to never resume the task that is waiting
for it.
This can result in tasks becoming permanently stuck within the guest.

This happens because KVM flushes the APF queue without notifying the guest
of completed APF requests when the guest exits to real mode.

And the SMM exit code calls kvm_set_cr0 with CR.PE == 0, which triggers
this code.

It must be noted that while this flush is reasonable to do for the actual
real mode entry, it is actually achieves nothing because it is too late to
flush this queue on SMM exit.

To fix this, avoid doing this flush altogether, and handle the real
mode entry/exits in the same way KVM already handles the APIC
enable/disable events:

APF completion events are not injected while APIC is disabled,
and once APIC is re-enabled, KVM raises the KVM_REQ_APF_READY request
which causes the first pending #APF ready event to be injected prior
to entry to the guest mode.

This change also has the side benefit of preserving #APF events if the
guest temporarily enters real mode - for example, to call firmware -
although such usage should be extermery rare in modern operating systems.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c | 11 +++++++----
 arch/x86/kvm/x86.h |  1 +
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d45a4cd08a4..5dfe166025bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1118,15 +1118,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 	}
 
 	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
-		kvm_clear_async_pf_completion_queue(vcpu);
-		kvm_async_pf_hash_reset(vcpu);
-
 		/*
 		 * Clearing CR0.PG is defined to flush the TLB from the guest's
 		 * perspective.
 		 */
 		if (!(cr0 & X86_CR0_PG))
 			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+
+		/*
+		 * Re-check APF completion events, when the guest re-enables paging.
+		 */
+		if ((cr0 & X86_CR0_PG) && kvm_pv_async_pf_enabled(vcpu))
+			kvm_make_request(KVM_REQ_APF_READY, vcpu);
 	}
 
 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
@@ -3547,7 +3550,7 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return 0;
 }
 
-static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
+bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
 {
 	u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index bcfd9b719ada..3949c938a88d 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -698,5 +698,6 @@ int ____kvm_emulate_hypercall(struct kvm_vcpu *vcpu, int cpl,
 })
 
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
+bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu);
 
 #endif
-- 
2.49.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-08-13 19:23 ` [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
@ 2025-08-18 18:08   ` Sean Christopherson
  2025-09-23 15:58   ` Paolo Bonzini
  1 sibling, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-08-18 18:08 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, x86, Borislav Petkov, linux-kernel

On Wed, Aug 13, 2025, Maxim Levitsky wrote:
> Fix a semi theoretical race condition in reading of page_ready_pending
> in kvm_arch_async_page_present_queued.

This needs to explain what can actually go wrong if the race is "hit".  After
staring at all of this for far, far too long, I'm 99.9% confident the race is
benign.

If the worker "incorrectly" sees pageready_pending as %false, then the result
is simply a spurious kick, and that spurious kick is all but guaranteed to be a
nop since if kvm_arch_async_page_present() is setting the flag, then (a) the
vCPU isn't blocking and (b) isn't IN_GUEST_MODE and thus won't be IPI'd.

If the worker incorrectly sees pageready_pending as %true, then the vCPU has
*just* written MSR_KVM_ASYNC_PF_ACK, and is guaranteed to observe and process
KVM_REQ_APF_READY before re-entering the guest, and the sole purpose of the kick
is to ensure the request is processed.

> Only trust the value of page_ready_pending if the guest is about to
> enter guest mode (vcpu->mode).

This is misleading, e.g. IN_GUEST_MODE can be true if the vCPU just *exited*.
All IN_GUEST_MODE says is that the vCPU task is somewhere in KVM's inner run loop.

> To achieve this, read the vcpu->mode using smp_load_acquire which is
> paired with smp_store_release in vcpu_enter_guest.
> 
> Then only if vcpu_mode is IN_GUEST_MODE, trust the value of the
> page_ready_pending because it was written before and therefore its correct
> value is visible.
> 
> Also if the above mentioned check is true, avoid raising the request
> on the target vCPU.

Why?  At worst, a dangling KVM_REQ_APF_READY will cause KVM to bail from the
fastpath when it's not strictly necessary to do so.  On the other hand, a missing
request could hang the guest.  So I don't see any reason to try and be super
precise when setting KVM_REQ_APF_READY.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>  
>  void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>  {
> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> -	if (!vcpu->arch.apf.pageready_pending)
> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);

In terms of arch.apf.pageready_pending being modified, it's not IN_GUEST_MODE
that's special, it's OUTSIDE_GUEST_MODE that's special, because that's the only
time the task that hold vcpu->mutex can clear pageready_pending.

> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);

This should be paired with WRITE_ONCE() on the vCPU.

> +
> +	if (!in_guest_mode || !page_ready_pending) {
> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>  		kvm_vcpu_kick(vcpu);
> +	}

Given that the race is guaranteed to be bening (assuming my analysis is correct),
I definitely think there should be a comment here explaining that pageready_pending
is "technically unstable".  Otherwise, it takes a lot of staring to understand
what this code is actually doing.

I also think it makes sense to do the bare minimum for OUTSIDE_GUEST_MODE, which
is to wake the vCPU.  Because calling kvm_vcpu_kick() when the vCPU is known to
not be IN_GUEST_MODE is weird.

For the code+comment, how about this?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bdf7ef0b535..d721fab3418d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4000,7 +4000,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
                if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT))
                        return 1;
                if (data & 0x1) {
-                       vcpu->arch.apf.pageready_pending = false;
+                       WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
                        kvm_check_async_pf_completion(vcpu);
                }
                break;
@@ -13457,7 +13457,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
        if ((work->wakeup_all || work->notpresent_injected) &&
            kvm_pv_async_pf_enabled(vcpu) &&
            !apf_put_user_ready(vcpu, work->arch.token)) {
-               vcpu->arch.apf.pageready_pending = true;
+               WRITE_ONCE(vcpu->arch.apf.pageready_pending, true);
                kvm_apic_set_irq(vcpu, &irq, NULL);
        }

@@ -13468,7 +13468,20 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
 void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
 {
        kvm_make_request(KVM_REQ_APF_READY, vcpu);
-       if (!vcpu->arch.apf.pageready_pending)
+
+       /*
+        * Don't kick the vCPU if it has an outstanding "page ready" event as
+        * KVM won't be able to deliver the next "page ready" token until the
+        * outstanding one is handled.  Ignore pageready_pending if the vCPU is
+        * outside "guest mode", i.e. if KVM might be sending "page ready" or
+        * servicing a MSR_KVM_ASYNC_PF_ACK write, as the flag is technically
+        * unstable.  However, in that case, there's obviously no need to kick
+        * the vCPU out of the guest, so just ensure the vCPU is awakened if
+        * it's blocking.
+        */
+       if (smp_load_acquire(vcpu->mode) == OUTSIDE_GUEST_MODE)
+               kvm_vcpu_wake_up(vcpu);
+       else if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
                kvm_vcpu_kick(vcpu);
 }

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault
  2025-08-13 19:23 ` [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
@ 2025-08-18 18:20   ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-08-18 18:20 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	Paolo Bonzini, x86, Borislav Petkov, linux-kernel

On Wed, Aug 13, 2025, Maxim Levitsky wrote:
> Currently a #SMI can cause KVM to drop an #APF ready event and
> subsequently causes the guest to never resume the task that is waiting
> for it.
> This can result in tasks becoming permanently stuck within the guest.
> 
> This happens because KVM flushes the APF queue without notifying the guest
> of completed APF requests when the guest exits to real mode.
> 
> And the SMM exit code calls kvm_set_cr0 with CR.PE == 0, which triggers
> this code.
> 
> It must be noted that while this flush is reasonable to do for the actual
> real mode entry, it is actually achieves nothing because it is too late to
> flush this queue on SMM exit.
> 
> To fix this, avoid doing this flush altogether, and handle the real
> mode entry/exits in the same way KVM already handles the APIC
> enable/disable events:
> 
> APF completion events are not injected while APIC is disabled,
> and once APIC is re-enabled, KVM raises the KVM_REQ_APF_READY request
> which causes the first pending #APF ready event to be injected prior
> to entry to the guest mode.
> 
> This change also has the side benefit of preserving #APF events if the
> guest temporarily enters real mode - for example, to call firmware -
> although such usage should be extermery rare in modern operating systems.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++++----
>  arch/x86/kvm/x86.h |  1 +
>  2 files changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3d45a4cd08a4..5dfe166025bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1118,15 +1118,18 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  	}
>  
>  	if ((cr0 ^ old_cr0) & X86_CR0_PG) {
> -		kvm_clear_async_pf_completion_queue(vcpu);
> -		kvm_async_pf_hash_reset(vcpu);
> -
>  		/*
>  		 * Clearing CR0.PG is defined to flush the TLB from the guest's
>  		 * perspective.
>  		 */
>  		if (!(cr0 & X86_CR0_PG))
>  			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
> +
> +		/*
> +		 * Re-check APF completion events, when the guest re-enables paging.
> +		 */
> +		if ((cr0 & X86_CR0_PG) && kvm_pv_async_pf_enabled(vcpu))

I'm tempted to make this an elif, i.e.

		if (!(cr0 & X86_CR0_PG))
			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
		else if (kvm_pv_async_pf_enabled(vcpu))
			kvm_make_request(KVM_REQ_APF_READY, vcpu);

In theory, that could set us up to fail if another CR0.PG=1 case is added, but I
like to think future us will be smart enough to turn it into:

		if (!(cr0 & X86_CR0_PG)) {
			kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
		} else {
			if (kvm_pv_async_pf_enabled(vcpu))
				kvm_make_request(KVM_REQ_APF_READY, vcpu);

			if (<other thing>)
				...
		}


> +			kvm_make_request(KVM_REQ_APF_READY, vcpu);
>  	}
>  
>  	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
> @@ -3547,7 +3550,7 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	return 0;
>  }
>  
> -static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
> +bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)

This is in the same file, there's no reason/need to expose this via x86.h.  The
overall diff is small enough that I'm comfortable hoisting this "up" as part of
the fix, especially since this needs to go to stable@.

If we use an elif, this?

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bdf7ef0b535..2bc41e562314 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1030,6 +1030,13 @@ bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr)
 }
 EXPORT_SYMBOL_GPL(kvm_require_dr);
 
+static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
+{
+       u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
+
+       return (vcpu->arch.apf.msr_en_val & mask) == mask;
+}
+
 static inline u64 pdptr_rsvd_bits(struct kvm_vcpu *vcpu)
 {
        return vcpu->arch.reserved_gpa_bits | rsvd_bits(5, 8) | rsvd_bits(1, 2);
@@ -1122,15 +1129,15 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
        }
 
        if ((cr0 ^ old_cr0) & X86_CR0_PG) {
-               kvm_clear_async_pf_completion_queue(vcpu);
-               kvm_async_pf_hash_reset(vcpu);
-
                /*
                 * Clearing CR0.PG is defined to flush the TLB from the guest's
-                * perspective.
+                * perspective.  If the guest is (re)enabling, check for async
+                * #PFs that were completed while paging was disabled.
                 */
                if (!(cr0 & X86_CR0_PG))
                        kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+               else if (kvm_pv_async_pf_enabled(vcpu))
+                       kvm_make_request(KVM_REQ_APF_READY, vcpu);
        }
 
        if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
@@ -3524,13 +3531,6 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
        return 0;
 }
 
-static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
-{
-       u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
-
-       return (vcpu->arch.apf.msr_en_val & mask) == mask;
-}
-
 static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
 {
        gpa_t gpa = data & ~0x3f;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-08-13 19:23 ` [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
  2025-08-18 18:08   ` Sean Christopherson
@ 2025-09-23 15:58   ` Paolo Bonzini
  2025-09-23 16:23     ` Paolo Bonzini
  2025-09-23 18:55     ` Sean Christopherson
  1 sibling, 2 replies; 14+ messages in thread
From: Paolo Bonzini @ 2025-09-23 15:58 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On 8/13/25 21:23, Maxim Levitsky wrote:
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9018d56b4b0a..3d45a4cd08a4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>   
>   void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>   {
> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> -	if (!vcpu->arch.apf.pageready_pending)
> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> +
> +	if (!in_guest_mode || !page_ready_pending) {
> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>   		kvm_vcpu_kick(vcpu);
> +	}

Unlike Sean, I think the race exists in abstract and is not benign, but
there are already enough memory barriers to tame it.

That said, in_guest_mode is a red herring.  The way I look at it, is
through the very common wake/sleep (or kick/check) pattern that has
smp_mb() on both sides.

The pair you are considering consists of this  function (the "kick
side"), and the wrmsr handler for MSR_KVM_ASYNC_PF_ACK (the "check
side").  Let's see how the synchronization between the two sides
happens:

- here, you need to check whether to inject the interrupt.  It looks
like this:

   kvm_make_request(KVM_REQ_APF_READY, vcpu);
   smp_mb();
   if (!READ_ONCE(page_ready_pending))
     kvm_vcpu_kick(vcpu);

- on the other side, after clearing page_ready_pending, there will be a
check for a wakeup:

   WRITE_ONCE(page_ready_pending, false);
   smp_mb();
   if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
     kvm_check_async_pf_completion(vcpu)

except that the "if" is not in kvm_set_msr_common(); it will happen
naturally as part of the first re-entry.


So let's look at the changes you need to make, in order to make the code
look like the above.

- using READ_ONCE/WRITE_ONCE for pageready_pending never hurts

- here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
(compiler barrier on x86) is missing after kvm_make_request():

         kvm_make_request(KVM_REQ_APF_READY, vcpu);
	/*
	 * Tell vCPU to wake up before checking if they need an
	 * interrupt.  Pairs with any memory barrier between
	 * the clearing of pageready_pending and vCPU entry.
	 */
	smp_mb__after_atomic();
         if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
                 kvm_vcpu_kick(vcpu);

- in kvm_set_msr_common(), there are two possibilities.
The easy one is to just use smp_store_mb() to clear
vcpu->arch.apf.pageready_pending.  The other would be a comment
like this:

	WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
	/*
	 * Ensure they know to wake this vCPU up, before the vCPU
	 * next checks KVM_REQ_APF_READY.  Use an existing memory
	 * barrier between here and thenext kvm_request_pending(),
	 * for example in vcpu_run().
	 */
	/* smp_mb(); */

plus a memory barrier in common code like this:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 706b6fd56d3c..e302c617e4b2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
  		if (r <= 0)
  			break;
  
+		/*
+		 * Provide a memory barrier between handle_exit and the
+		 * kvm_request_pending() read in vcpu_enter_guest().  It
+		 * pairs with any barrier after kvm_make_request(), for
+		 * example in kvm_arch_async_page_present_queued().
+		 */
+		smp_mb__before_atomic();
  		kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
  		if (kvm_xen_has_pending_events(vcpu))
  			kvm_xen_inject_pending_events(vcpu);


The only advantage of this second, more complex approach is that
it shows *why* the race was not happening.  The 50 clock cycles
saved on an MSR write are not worth the extra complication, and
on a quick grep I could not find other cases which rely on the same
implicit barriers.  So I'd say use smp_store_mb(), with a comment
about the pairing with kvm_arch_async_page_present_queued(); and write
in the commit message that the race wasn't happening thanks to unrelated
memory barriers between handle_exit and the kvm_request_pending()
read in vcpu_enter_guest.

Thanks,

Paolo


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-09-23 15:58   ` Paolo Bonzini
@ 2025-09-23 16:23     ` Paolo Bonzini
  2025-09-23 18:55     ` Sean Christopherson
  1 sibling, 0 replies; 14+ messages in thread
From: Paolo Bonzini @ 2025-09-23 16:23 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: Sean Christopherson, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On 9/23/25 17:58, Paolo Bonzini wrote:
> - on the other side, after clearing page_ready_pending, there will be a
> check for a wakeup:
> 
>    WRITE_ONCE(page_ready_pending, false);
>    smp_mb();
>    if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
>      kvm_check_async_pf_completion(vcpu)
> 
> except that the "if" is not in kvm_set_msr_common(); it will happen
> naturally as part of the first re-entry.

Important thing that I forgot to mention: the above only covers the race 
case.  There's also the case where KVM_REQ_APF_READY has been cleared 
already, and for that one the call to kvm_check_async_pf_completion() is 
*also* needed in kvm_set_msr_common().

Paolo

> 
> So let's look at the changes you need to make, in order to make the code
> look like the above.
> 
> - using READ_ONCE/WRITE_ONCE for pageready_pending never hurts
> 
> - here in kvm_arch_async_page_present_queued(), a smp_mb__after_atomic()
> (compiler barrier on x86) is missing after kvm_make_request():
> 
>          kvm_make_request(KVM_REQ_APF_READY, vcpu);
>      /*
>       * Tell vCPU to wake up before checking if they need an
>       * interrupt.  Pairs with any memory barrier between
>       * the clearing of pageready_pending and vCPU entry.
>       */
>      smp_mb__after_atomic();
>          if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
>                  kvm_vcpu_kick(vcpu);
> 
> - in kvm_set_msr_common(), there are two possibilities.
> The easy one is to just use smp_store_mb() to clear
> vcpu->arch.apf.pageready_pending.  The other would be a comment
> like this:
> 
>      WRITE_ONCE(vcpu->arch.apf.pageready_pending, false);
>      /*
>       * Ensure they know to wake this vCPU up, before the vCPU
>       * next checks KVM_REQ_APF_READY.  Use an existing memory
>       * barrier between here and thenext kvm_request_pending(),
>       * for example in vcpu_run().
>       */
>      /* smp_mb(); */
> 
> plus a memory barrier in common code like this:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 706b6fd56d3c..e302c617e4b2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11236,6 +11236,11 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>           if (r <= 0)
>               break;
> 
> +        /*
> +         * Provide a memory barrier between handle_exit and the
> +         * kvm_request_pending() read in vcpu_enter_guest().  It
> +         * pairs with any barrier after kvm_make_request(), for
> +         * example in kvm_arch_async_page_present_queued().
> +         */
> +        smp_mb__before_atomic();
>           kvm_clear_request(KVM_REQ_UNBLOCK, vcpu);
>           if (kvm_xen_has_pending_events(vcpu))
>               kvm_xen_inject_pending_events(vcpu);
> 
> 
> The only advantage of this second, more complex approach is that
> it shows *why* the race was not happening.  The 50 clock cycles
> saved on an MSR write are not worth the extra complication, and
> on a quick grep I could not find other cases which rely on the same
> implicit barriers.  So I'd say use smp_store_mb(), with a comment
> about the pairing with kvm_arch_async_page_present_queued(); and write
> in the commit message that the race wasn't happening thanks to unrelated
> memory barriers between handle_exit and the kvm_request_pending()
> read in vcpu_enter_guest.
> 
> Thanks,
> 
> Paolo
> 
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-09-23 15:58   ` Paolo Bonzini
  2025-09-23 16:23     ` Paolo Bonzini
@ 2025-09-23 18:55     ` Sean Christopherson
  2025-09-23 19:28       ` Paolo Bonzini
  1 sibling, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-09-23 18:55 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 8/13/25 21:23, Maxim Levitsky wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 9018d56b4b0a..3d45a4cd08a4 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> >   void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> >   {
> > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > -	if (!vcpu->arch.apf.pageready_pending)
> > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > +
> > +	if (!in_guest_mode || !page_ready_pending) {
> > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> >   		kvm_vcpu_kick(vcpu);
> > +	}
> 
> Unlike Sean, I think the race exists in abstract and is not benign

How is it not benign?  I never said the race doesn't exist, I said that consuming
a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
is benign.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-09-23 18:55     ` Sean Christopherson
@ 2025-09-23 19:28       ` Paolo Bonzini
  2025-09-23 23:14         ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Paolo Bonzini @ 2025-09-23 19:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On 9/23/25 20:55, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Paolo Bonzini wrote:
>> On 8/13/25 21:23, Maxim Levitsky wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 9018d56b4b0a..3d45a4cd08a4 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
>>>    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
>>>    {
>>> -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>> -	if (!vcpu->arch.apf.pageready_pending)
>>> +	/* Pairs with smp_store_release in vcpu_enter_guest. */
>>> +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
>>> +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
>>> +
>>> +	if (!in_guest_mode || !page_ready_pending) {
>>> +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
>>>    		kvm_vcpu_kick(vcpu);
>>> +	}
>>
>> Unlike Sean, I think the race exists in abstract and is not benign
> 
> How is it not benign?  I never said the race doesn't exist, I said that consuming
> a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> is benign.

In principle there is a possibility that a KVM_REQ_APF_READY is missed. 
Just by the reading of the specs, without a smp__mb_after_atomic() this 
is broken:

	kvm_make_request(KVM_REQ_APF_READY, vcpu);
	if (!vcpu->arch.apf.pageready_pending)
    		kvm_vcpu_kick(vcpu);

It won't happen because set_bit() is written with asm("memory"), because 
x86 set_bit() does prevent reordering at the processor level, etc.

In other words the race is only avoided by the fact that compiler 
reorderings are prevented even in cases that memory-barriers.txt does 
not promise.

Paolo


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-09-23 19:28       ` Paolo Bonzini
@ 2025-09-23 23:14         ` Sean Christopherson
  2025-10-27 15:00           ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-09-23 23:14 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> On 9/23/25 20:55, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > >    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > >    {
> > > > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > -	if (!vcpu->arch.apf.pageready_pending)
> > > > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > +
> > > > +	if (!in_guest_mode || !page_ready_pending) {
> > > > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > >    		kvm_vcpu_kick(vcpu);
> > > > +	}
> > > 
> > > Unlike Sean, I think the race exists in abstract and is not benign
> > 
> > How is it not benign?  I never said the race doesn't exist, I said that consuming
> > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > is benign.
> 
> In principle there is a possibility that a KVM_REQ_APF_READY is missed.

I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
is missed.  I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
guest or schedule out the vCPU with the flag set.

All in all, I think we're in violent agreement.  I agree that kvm_vcpu_kick()
could be missed (theoretically), but I'm saying that missing the kick would be
benign due to a myriad of other barriers and checks, i.e. that the vCPU is
guaranteed to see KVM_REQ_APF_READY anyways.

E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU.  Oh,
hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().

Huh.  But isn't that a real bug?  KVM doesn't consider KVM_REQ_APF_READY to be a
wake event, so isn't this an actual race?

  vCPU                                  async #PF
  kvm_check_async_pf_completion()
  pageready_pending = false
  VM-Enter
  HLT
  VM-Exit
                                        kvm_make_request(KVM_REQ_APF_READY, vcpu)
                                        kvm_vcpu_kick(vcpu)  // nop as the vCPU isn't blocking, yet
                                        __kvm_vcpu_wake_up() // nop for the same reason
  vcpu_block()
  <hang>

On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
is guarnateed wake the vCPU in the above sequence.

> broken:
> 
> 	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> 	if (!vcpu->arch.apf.pageready_pending)
>    		kvm_vcpu_kick(vcpu);
> 
> It won't happen because set_bit() is written with asm("memory"), because x86
> set_bit() does prevent reordering at the processor level, etc.
> 
> In other words the race is only avoided by the fact that compiler
> reorderings are prevented even in cases that memory-barriers.txt does not
> promise.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-09-23 23:14         ` Sean Christopherson
@ 2025-10-27 15:00           ` Sean Christopherson
  2025-10-30 19:57             ` mlevitsk
  0 siblings, 1 reply; 14+ messages in thread
From: Sean Christopherson @ 2025-10-27 15:00 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Maxim Levitsky, kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On Tue, Sep 23, 2025, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > On 9/23/25 20:55, Sean Christopherson wrote:
> > > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > > >    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > > >    {
> > > > > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > -	if (!vcpu->arch.apf.pageready_pending)
> > > > > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > > +
> > > > > +	if (!in_guest_mode || !page_ready_pending) {
> > > > > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > >    		kvm_vcpu_kick(vcpu);
> > > > > +	}
> > > > 
> > > > Unlike Sean, I think the race exists in abstract and is not benign
> > > 
> > > How is it not benign?  I never said the race doesn't exist, I said that consuming
> > > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > > is benign.
> > 
> > In principle there is a possibility that a KVM_REQ_APF_READY is missed.
> 
> I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
> is missed.  I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
> guest or schedule out the vCPU with the flag set.
> 
> All in all, I think we're in violent agreement.  I agree that kvm_vcpu_kick()
> could be missed (theoretically), but I'm saying that missing the kick would be
> benign due to a myriad of other barriers and checks, i.e. that the vCPU is
> guaranteed to see KVM_REQ_APF_READY anyways.
> 
> E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
> smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
> KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
> be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
> visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU.  Oh,
> hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().
> 
> Huh.  But isn't that a real bug?  KVM doesn't consider KVM_REQ_APF_READY to be a
> wake event, so isn't this an actual race?
> 
>   vCPU                                  async #PF
>   kvm_check_async_pf_completion()
>   pageready_pending = false
>   VM-Enter
>   HLT
>   VM-Exit
>                                         kvm_make_request(KVM_REQ_APF_READY, vcpu)
>                                         kvm_vcpu_kick(vcpu)  // nop as the vCPU isn't blocking, yet
>                                         __kvm_vcpu_wake_up() // nop for the same reason
>   vcpu_block()
>   <hang>
> 
> On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
> is guarnateed wake the vCPU in the above sequence.

Gah, KVM checks async_pf.done instead of the request.  So I don't think there's
a bug, just weird code.

  bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
  {
	if (!list_empty_careful(&vcpu->async_pf.done))  <===
		return true;

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-10-27 15:00           ` Sean Christopherson
@ 2025-10-30 19:57             ` mlevitsk
  2025-10-30 20:28               ` Sean Christopherson
  0 siblings, 1 reply; 14+ messages in thread
From: mlevitsk @ 2025-10-30 19:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar, Thomas Gleixner,
	x86, Borislav Petkov, linux-kernel

On Mon, 2025-10-27 at 08:00 -0700, Sean Christopherson wrote:
> On Tue, Sep 23, 2025, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > On 9/23/25 20:55, Sean Christopherson wrote:
> > > > On Tue, Sep 23, 2025, Paolo Bonzini wrote:
> > > > > On 8/13/25 21:23, Maxim Levitsky wrote:
> > > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > > index 9018d56b4b0a..3d45a4cd08a4 100644
> > > > > > --- a/arch/x86/kvm/x86.c
> > > > > > +++ b/arch/x86/kvm/x86.c
> > > > > > @@ -13459,9 +13459,14 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
> > > > > >    void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
> > > > > >    {
> > > > > > -	kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > > -	if (!vcpu->arch.apf.pageready_pending)
> > > > > > +	/* Pairs with smp_store_release in vcpu_enter_guest. */
> > > > > > +	bool in_guest_mode = (smp_load_acquire(&vcpu->mode) == IN_GUEST_MODE);
> > > > > > +	bool page_ready_pending = READ_ONCE(vcpu->arch.apf.pageready_pending);
> > > > > > +
> > > > > > +	if (!in_guest_mode || !page_ready_pending) {
> > > > > > +		kvm_make_request(KVM_REQ_APF_READY, vcpu);
> > > > > >    		kvm_vcpu_kick(vcpu);
> > > > > > +	}
> > > > > 
> > > > > Unlike Sean, I think the race exists in abstract and is not benign
> > > > 
> > > > How is it not benign?  I never said the race doesn't exist, I said that consuming
> > > > a stale vcpu->arch.apf.pageready_pending in kvm_arch_async_page_present_queued()
> > > > is benign.
> > > 
> > > In principle there is a possibility that a KVM_REQ_APF_READY is missed.
> > 
> > I think you mean a kick (wakeup or IPI), is missed, not that the APF_READY itself
> > is missed.  I.e. KVM_REQ_APF_READY will never be lost, KVM just might enter the
> > guest or schedule out the vCPU with the flag set.
> > 
> > All in all, I think we're in violent agreement.  I agree that kvm_vcpu_kick()
> > could be missed (theoretically), but I'm saying that missing the kick would be
> > benign due to a myriad of other barriers and checks, i.e. that the vCPU is
> > guaranteed to see KVM_REQ_APF_READY anyways.
> > 
> > E.g. my suggestion earlier regarding OUTSIDE_GUEST_MODE was to rely on the
> > smp_mb__after_srcu_read_{,un}lock() barriers in vcpu_enter_guest() to ensure
> > KVM_REQ_APF_READY would be observed before trying VM-Enter, and that if KVM might
> > be in the process of emulating HLT (blocking), that either KVM_REQ_APF_READY is
> > visible to the vCPU or that kvm_arch_async_page_present() wakes the vCPU.  Oh,
> > hilarious, async_pf_execute() also does an unconditional __kvm_vcpu_wake_up().
> > 
> > Huh.  But isn't that a real bug?  KVM doesn't consider KVM_REQ_APF_READY to be a
> > wake event, so isn't this an actual race?
> > 
> >   vCPU                                  async #PF
> >   kvm_check_async_pf_completion()
> >   pageready_pending = false
> >   VM-Enter
> >   HLT
> >   VM-Exit
> >                                         kvm_make_request(KVM_REQ_APF_READY, vcpu)
> >                                         kvm_vcpu_kick(vcpu)  // nop as the vCPU isn't blocking, yet
> >                                         __kvm_vcpu_wake_up() // nop for the same reason
> >   vcpu_block()
> >   <hang>
> > 
> > On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
> > is guarnateed wake the vCPU in the above sequence.
> 
> Gah, KVM checks async_pf.done instead of the request.  So I don't think there's
> a bug, just weird code.

Hi!

Note that I posted a v2 of this patch series. Do I need to drop this patch or its better to keep it
(the patch should still be correct, but maybe an overkill I think).

What do you think? 

Can we have the patch 3 of the v2 merged as it fixes an real issue, which actually
causes random and hard to debug failures?

Best regards,
	Maxim Levitsky

> 
>   bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>   {
> 	if (!list_empty_careful(&vcpu->async_pf.done))  <===
> 		return true;
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
  2025-10-30 19:57             ` mlevitsk
@ 2025-10-30 20:28               ` Sean Christopherson
  0 siblings, 0 replies; 14+ messages in thread
From: Sean Christopherson @ 2025-10-30 20:28 UTC (permalink / raw)
  To: mlevitsk
  Cc: Paolo Bonzini, kvm, Dave Hansen, H. Peter Anvin, Ingo Molnar,
	Thomas Gleixner, x86, Borislav Petkov, linux-kernel

On Thu, Oct 30, 2025, mlevitsk@redhat.com wrote:
> On Mon, 2025-10-27 at 08:00 -0700, Sean Christopherson wrote:
> > On Tue, Sep 23, 2025, Sean Christopherson wrote:
> > > On x86, the "page ready" IRQ is only injected from vCPU context, so AFAICT nothing
> > > is guarnateed wake the vCPU in the above sequence.
> > 
> > Gah, KVM checks async_pf.done instead of the request.  So I don't think there's
> > a bug, just weird code.
> 
> Hi!
> 
> Note that I posted a v2 of this patch series.

I got 'em, and looked at them in depth (which is how I figured out the above
weirdness with async_pf.done).  They're sitting in my "for_next" folder, I just
haven't spent any time on applying+testing upstream patches this week (I expect
to get to your series tomorrow, or early next week).

> Do I need to drop this patch or its better to keep it (the patch should still
> be correct, but maybe an overkill I think).

It's probably overkill, but there's no real downside, so I'm inclined to apply
the v2 version (and am planning on doing so).

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-10-30 20:28 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-13 19:23 [PATCH 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
2025-08-13 19:23 ` [PATCH 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
2025-08-13 19:23 ` [PATCH 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
2025-08-18 18:08   ` Sean Christopherson
2025-09-23 15:58   ` Paolo Bonzini
2025-09-23 16:23     ` Paolo Bonzini
2025-09-23 18:55     ` Sean Christopherson
2025-09-23 19:28       ` Paolo Bonzini
2025-09-23 23:14         ` Sean Christopherson
2025-10-27 15:00           ` Sean Christopherson
2025-10-30 19:57             ` mlevitsk
2025-10-30 20:28               ` Sean Christopherson
2025-08-13 19:23 ` [PATCH 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
2025-08-18 18:20   ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox