* [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM
@ 2025-10-15 3:32 Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
` (3 more replies)
0 siblings, 4 replies; 6+ messages in thread
From: Maxim Levitsky @ 2025-10-15 3:32 UTC (permalink / raw)
To: kvm
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Sean Christopherson, Borislav Petkov, H. Peter Anvin, Dave Hansen,
Maxim Levitsky
Recently we debugged a customer case in which the guest VM was showing
tasks permanently stuck in the kvm_async_pf_task_wait_schedule.
This was traced to the incorrect flushing of the async pagefault queue,
which was done during the real mode entry by the kvm_post_set_cr0.
This code, the kvm_clear_async_pf_completion_queue does wait for all #APF
tasks to complete but then it proceeds to wipe the 'done' queue without
notifying the guest.
Such approach is acceptable if the guest is being rebooted or if
it decided to disable APF, but it leads to failures if the entry to real
mode was caused by SMM, because in this case the guest intends to continue
using APF after returning from the SMM handler.
Amusingly, and on top of this, the SMM entry code doesn't call
the kvm_set_cr0 (and subsequently neither it calls kvm_post_set_cr0),
but rather only the SMM mode exit code does.
During SMM entry, the SMM code calls .set_cr0 instead, with an intention
to bypass various architectural checks that can otherwise fail.
One example of such check is a #GP check on an attempt to disable paging
while the long mode is active.
To do this, the user must first exit to the compatibility mode and only then
disable paging.
The question of the possiblity of eliminating this bypass, is a side topic
that is probably worth discussing separately.
Back to the topic, the kvm_set_cr0 is still called during SMM handling,
more particularly during the exit from SMM, by emulator_leave_smm:
It is called once with CR0.PE == off, to setup a baseline real-mode
environment, and then a second time, with the original CR0 value.
Even more amusingly, usually both mentioned calls result in APF queue being
flushed, because the code in kvm_post_set_cr0 doesn't distinguish between
entry and exit from protected mode, and SMM mode usually enables protection
and paging, and exits itself without bothering first to exit back to
the real mode.
To fix this problem, I think the best solution is to drop the call to
kvm_clear_async_pf_completion_queue in kvm_post_set_cr0 code altogether,
and instead raise the KVM_REQ_APF_READY, when the protected mode
is re-established.
Existing APF requests should have no problem to complete while the guest is
in SMM and the APF completion event injection should work too,
because SMM handler *ought* to not enable interrupts because otherwise
things would go south very quickly.
This change also brings the logic to be up to date with logic that KVM
follows when the guest disables APIC.
KVM also raises KVM_REQ_APF_READY when the APIC is re-enabled.
In addition to this, I also included few fixes for few semi-theortical
bugs I found while debugging this.
V2: incorporated review feedback from Paolo Bonzini and Sean Christopherson.
Best regards,
Maxim Levitsky
Maxim Levitsky (3):
KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is
not enabled
KVM: x86: Fix a semi theoretical bug in
kvm_arch_async_page_present_queued
KVM: x86: Fix the interaction between SMM and the asynchronous
pagefault
arch/x86/kvm/x86.c | 37 +++++++++++++++++++++++--------------
1 file changed, 23 insertions(+), 14 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled
2025-10-15 3:32 [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
@ 2025-10-15 3:32 ` Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
` (2 subsequent siblings)
3 siblings, 0 replies; 6+ messages in thread
From: Maxim Levitsky @ 2025-10-15 3:32 UTC (permalink / raw)
To: kvm
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Sean Christopherson, Borislav Petkov, H. Peter Anvin, Dave Hansen,
Maxim Levitsky
KVM flushes the APF queue completely when the asynchronous pagefault is
disabled, therefore this case should not occur.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b8138bd4857..22024de00cbd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13896,7 +13896,7 @@ void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
bool kvm_arch_can_dequeue_async_page_present(struct kvm_vcpu *vcpu)
{
- if (!kvm_pv_async_pf_enabled(vcpu))
+ if (WARN_ON_ONCE(!kvm_pv_async_pf_enabled(vcpu)))
return true;
else
return kvm_lapic_enabled(vcpu) && apf_pageready_slot_free(vcpu);
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
2025-10-15 3:32 [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
@ 2025-10-15 3:32 ` Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
2025-11-10 15:37 ` [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Sean Christopherson
3 siblings, 0 replies; 6+ messages in thread
From: Maxim Levitsky @ 2025-10-15 3:32 UTC (permalink / raw)
To: kvm
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Sean Christopherson, Borislav Petkov, H. Peter Anvin, Dave Hansen,
Maxim Levitsky
Fix a semi theoretical race condition related to a lack of memory barriers
when dealing with vcpu->arch.apf.pageready_pending.
We have the following lockless code implementing the sleep/wake pattern:
kvm_arch_async_page_present_queued() running in workqueue context:
kvm_make_request(KVM_REQ_APF_READY, vcpu);
/* memory barrier is missing here*/
if (!vcpu->arch.apf.pageready_pending)
kvm_vcpu_kick(vcpu);
And vCPU code running:
kvm_set_msr_common()
vcpu->arch.apf.pageready_pending = false;
/* memory barrier is missing here*/
And later, the vcpu_enter_guest():
if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
kvm_check_async_pf_completion(vcpu)
Add missing full memory barriers in both cases to avoid theoretical
case of not kicking the vCPU thread.
Note that the bug is mostly theoretical because kvm_make_request
already uses an atomic bit operation which is already serializing on x86,
requiring only for documentation purposes the smp_mb__after_atomic()
after it, which is NOP.
The second missing barrier, between kvm_set_msr_common and
vcpu_enter_guest isn't strictly needed because KVM executes several
barriers in between calling these functions, however it still makes
sense to have an explicit barrier to be on the safe side.
Finally, also use READ_ONCE/WRITE_ONCE.
Thanks a lot to Paolo for the help with this patch.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
arch/x86/kvm/x86.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 22024de00cbd..0fc7171ced26 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4184,7 +4184,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
if (!guest_pv_has(vcpu, KVM_FEATURE_ASYNC_PF_INT))
return 1;
if (data & 0x1) {
- vcpu->arch.apf.pageready_pending = false;
+
+ /* Pairs with a memory barrier in kvm_arch_async_page_present_queued */
+ smp_store_mb(vcpu->arch.apf.pageready_pending, false);
+
kvm_check_async_pf_completion(vcpu);
}
break;
@@ -13879,7 +13882,7 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
if ((work->wakeup_all || work->notpresent_injected) &&
kvm_pv_async_pf_enabled(vcpu) &&
!apf_put_user_ready(vcpu, work->arch.token)) {
- vcpu->arch.apf.pageready_pending = true;
+ WRITE_ONCE(vcpu->arch.apf.pageready_pending, true);
kvm_apic_set_irq(vcpu, &irq, NULL);
}
@@ -13890,7 +13893,11 @@ void kvm_arch_async_page_present(struct kvm_vcpu *vcpu,
void kvm_arch_async_page_present_queued(struct kvm_vcpu *vcpu)
{
kvm_make_request(KVM_REQ_APF_READY, vcpu);
- if (!vcpu->arch.apf.pageready_pending)
+
+ /* Pairs with smp_store_mb in kvm_set_msr_common */
+ smp_mb__after_atomic();
+
+ if (!READ_ONCE(vcpu->arch.apf.pageready_pending))
kvm_vcpu_kick(vcpu);
}
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault
2025-10-15 3:32 [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
@ 2025-10-15 3:32 ` Maxim Levitsky
2025-11-10 15:37 ` [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Sean Christopherson
3 siblings, 0 replies; 6+ messages in thread
From: Maxim Levitsky @ 2025-10-15 3:32 UTC (permalink / raw)
To: kvm
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Sean Christopherson, Borislav Petkov, H. Peter Anvin, Dave Hansen,
Maxim Levitsky
Currently a #SMI can cause KVM to drop an #APF ready event and
subsequently causes the guest to never resume the task that is waiting
for it.
This can result in tasks becoming permanently stuck within the guest.
This happens because KVM flushes the APF queue without notifying the guest
of completed APF requests when the guest exits to real mode.
And the SMM exit code calls kvm_set_cr0 with CR.PE == 0, which triggers
this code.
It must be noted that while this flush is reasonable to do for the actual
real mode entry, it is actually achieves nothing because it is too late to
flush this queue on SMM exit.
To fix this, avoid doing this flush altogether, and handle the real
mode entry/exits in the same way KVM already handles the APIC
enable/disable events:
APF completion events are not injected while APIC is disabled,
and once APIC is re-enabled, KVM raises the KVM_REQ_APF_READY request
which causes the first pending #APF ready event to be injected prior
to entry to the guest mode.
This change also has the side benefit of preserving #APF events if the
guest temporarily enters real mode - for example, to call firmware -
although such usage should be extermery rare in modern operating systems.
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
arch/x86/kvm/x86.c | 22 ++++++++++++----------
1 file changed, 12 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0fc7171ced26..ec96328634ed 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1046,6 +1046,13 @@ bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr)
}
EXPORT_SYMBOL_FOR_KVM_INTERNAL(kvm_require_dr);
+static bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
+{
+ u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
+
+ return (vcpu->arch.apf.msr_en_val & mask) == mask;
+}
+
static inline u64 pdptr_rsvd_bits(struct kvm_vcpu *vcpu)
{
return vcpu->arch.reserved_gpa_bits | rsvd_bits(5, 8) | rsvd_bits(1, 2);
@@ -1138,15 +1145,17 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
}
if ((cr0 ^ old_cr0) & X86_CR0_PG) {
- kvm_clear_async_pf_completion_queue(vcpu);
- kvm_async_pf_hash_reset(vcpu);
-
/*
* Clearing CR0.PG is defined to flush the TLB from the guest's
* perspective.
*/
if (!(cr0 & X86_CR0_PG))
kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
+ /*
+ * Re-check APF completion events, when the guest re-enables paging.
+ */
+ else if (kvm_pv_async_pf_enabled(vcpu))
+ kvm_make_request(KVM_REQ_APF_READY, vcpu);
}
if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
@@ -3651,13 +3660,6 @@ static int set_msr_mce(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
return 0;
}
-static inline bool kvm_pv_async_pf_enabled(struct kvm_vcpu *vcpu)
-{
- u64 mask = KVM_ASYNC_PF_ENABLED | KVM_ASYNC_PF_DELIVERY_AS_INT;
-
- return (vcpu->arch.apf.msr_en_val & mask) == mask;
-}
-
static int kvm_pv_enable_async_pf(struct kvm_vcpu *vcpu, u64 data)
{
gpa_t gpa = data & ~0x3f;
--
2.49.0
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM
2025-10-15 3:32 [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
` (2 preceding siblings ...)
2025-10-15 3:32 ` [PATCH v2 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
@ 2025-11-10 15:37 ` Sean Christopherson
2025-11-10 18:19 ` mlevitsk
3 siblings, 1 reply; 6+ messages in thread
From: Sean Christopherson @ 2025-11-10 15:37 UTC (permalink / raw)
To: Sean Christopherson, kvm, Maxim Levitsky
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Borislav Petkov, H. Peter Anvin, Dave Hansen
On Tue, 14 Oct 2025 23:32:55 -0400, Maxim Levitsky wrote:
> Recently we debugged a customer case in which the guest VM was showing
> tasks permanently stuck in the kvm_async_pf_task_wait_schedule.
>
> This was traced to the incorrect flushing of the async pagefault queue,
> which was done during the real mode entry by the kvm_post_set_cr0.
>
> This code, the kvm_clear_async_pf_completion_queue does wait for all #APF
> tasks to complete but then it proceeds to wipe the 'done' queue without
> notifying the guest.
>
> [...]
Applied 2 and 3 to kvm-x86 misc. The async #PF delivery path is also used by
the host-only version of async #PF (where KVM puts the vCPU into HLT instead of
letting the kernel schedule() in I/O), and so it's entirely expected that KVM
will dequeue completed async #PFs when the PV version is disabled.
https://lore.kernel.org/all/aQ5BiLBWGKcMe-mM@google.com
[1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled
[DROP]
[2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
https://github.com/kvm-x86/linux/commit/68c35f89d016
[3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault
https://github.com/kvm-x86/linux/commit/ab4e41eb9fab
--
https://github.com/kvm-x86/linux/tree/next
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM
2025-11-10 15:37 ` [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Sean Christopherson
@ 2025-11-10 18:19 ` mlevitsk
0 siblings, 0 replies; 6+ messages in thread
From: mlevitsk @ 2025-11-10 18:19 UTC (permalink / raw)
To: Sean Christopherson, kvm
Cc: x86, Thomas Gleixner, Paolo Bonzini, Ingo Molnar, linux-kernel,
Borislav Petkov, H. Peter Anvin, Dave Hansen
On Mon, 2025-11-10 at 07:37 -0800, Sean Christopherson wrote:
> On Tue, 14 Oct 2025 23:32:55 -0400, Maxim Levitsky wrote:
> > Recently we debugged a customer case in which the guest VM was showing
> > tasks permanently stuck in the kvm_async_pf_task_wait_schedule.
> >
> > This was traced to the incorrect flushing of the async pagefault queue,
> > which was done during the real mode entry by the kvm_post_set_cr0.
> >
> > This code, the kvm_clear_async_pf_completion_queue does wait for all #APF
> > tasks to complete but then it proceeds to wipe the 'done' queue without
> > notifying the guest.
> >
> > [...]
>
> Applied 2 and 3 to kvm-x86 misc. The async #PF delivery path is also used by
> the host-only version of async #PF (where KVM puts the vCPU into HLT instead of
> letting the kernel schedule() in I/O), and so it's entirely expected that KVM
> will dequeue completed async #PFs when the PV version is disabled.
True, sorry for confusion.
Thanks,
Best regards,
Maxim Levitsky
>
> https://lore.kernel.org/all/aQ5BiLBWGKcMe-mM@google.com
>
> [1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled
> [DROP]
> [2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued
> https://github.com/kvm-x86/linux/commit/68c35f89d016
> [3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault
> https://github.com/kvm-x86/linux/commit/ab4e41eb9fab
>
> --
> https://github.com/kvm-x86/linux/tree/next
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-11-10 18:19 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-15 3:32 [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 1/3] KVM: x86: Warn if KVM tries to deliver an #APF completion when APF is not enabled Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 2/3] KVM: x86: Fix a semi theoretical bug in kvm_arch_async_page_present_queued Maxim Levitsky
2025-10-15 3:32 ` [PATCH v2 3/3] KVM: x86: Fix the interaction between SMM and the asynchronous pagefault Maxim Levitsky
2025-11-10 15:37 ` [PATCH v2 0/3] Fix a lost async pagefault notification when the guest is using SMM Sean Christopherson
2025-11-10 18:19 ` mlevitsk
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox