* [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
@ 2025-08-27 15:27 Fei Li
2025-08-27 16:01 ` Sean Christopherson
0 siblings, 1 reply; 9+ messages in thread
From: Fei Li @ 2025-08-27 15:27 UTC (permalink / raw)
To: seanjc, pbonzini, tglx, mingo, bp, dave.hansen, liran.alon, hpa,
wanpeng.li
Cc: kvm, x86, linux-kernel, Fei Li, stable
Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in
KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS
handler to set pending LAPIC INIT event regardless of if vCPU is in
SMM mode or not.
However, latch INIT without checking CPU state exists race condition,
which causes the loss of INIT event. This is fatal during the VM
startup process because it will cause some AP to never switch to
non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of
pending INIT due to race") said:
BSP AP
kvm_vcpu_ioctl_x86_get_vcpu_events
events->smi.latched_init = 0
kvm_vcpu_block
kvm_vcpu_check_block
schedule
send INIT to AP
kvm_vcpu_ioctl_x86_set_vcpu_events
(e.g. `info registers -a` when VM starts/reboots)
if (events->smi.latched_init == 0)
clear INIT in pending_events
kvm_apic_accept_events
test_bit(KVM_APIC_INIT, &pe) == false
vcpu->arch.mp_state maintains UNINITIALIZED
send SIPI to AP
kvm_apic_accept_events
test_bit(KVM_APIC_SIPI, &pe) == false
vcpu->arch.mp_state will never change to RUNNABLE
(defy: UNINITIALIZED => INIT_RECEIVED => RUNNABLE)
AP will never switch to non-root operation
In such race result, VM hangs. E.g., BSP loops in SeaBIOS's SMPLock and
AP will never be reset, and qemu hmp "info registers -a" shows:
CPU#0
EAX=00000002 EBX=00000002 ECX=00000000 EDX=00020000
ESI=00000000 EDI=00000000 EBP=00000008 ESP=00006c6c
EIP=000ef570 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
......
CPU#1
EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660
ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0000 00000000 0000ffff 00009300
CS =f000 ffff0000 0000ffff 00009b00
......
Fix this by handling latched INITs only in specific CPU states (SMM,
VMX non-root mode, SVM with GIF=0) in KVM_SET_VCPU_EVENTS.
Cc: stable@vger.kernel.org
Fixes: ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM")
Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
---
arch/x86/kvm/x86.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1c49bc681c46..7001b2af00ed1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5556,7 +5556,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
return -EINVAL;
#endif
- if (lapic_in_kernel(vcpu)) {
+ if (!kvm_apic_init_sipi_allowed(vcpu) && lapic_in_kernel(vcpu)) {
if (events->smi.latched_init)
set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
else
--
2.39.2 (Apple Git-143)
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-08-27 15:27 [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS Fei Li
@ 2025-08-27 16:01 ` Sean Christopherson
2025-08-27 16:08 ` Paolo Bonzini
0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2025-08-27 16:01 UTC (permalink / raw)
To: Fei Li
Cc: pbonzini, tglx, mingo, bp, dave.hansen, liran.alon, hpa,
wanpeng.li, kvm, x86, linux-kernel, stable
On Wed, Aug 27, 2025, Fei Li wrote:
> Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in
> KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS
> handler to set pending LAPIC INIT event regardless of if vCPU is in
> SMM mode or not.
>
> However, latch INIT without checking CPU state exists race condition,
> which causes the loss of INIT event. This is fatal during the VM
> startup process because it will cause some AP to never switch to
> non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of
> pending INIT due to race") said:
> BSP AP
> kvm_vcpu_ioctl_x86_get_vcpu_events
> events->smi.latched_init = 0
>
> kvm_vcpu_block
> kvm_vcpu_check_block
> schedule
>
> send INIT to AP
> kvm_vcpu_ioctl_x86_set_vcpu_events
> (e.g. `info registers -a` when VM starts/reboots)
> if (events->smi.latched_init == 0)
> clear INIT in pending_events
This is a QEMU bug, no? IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events()
with stale data. I'm also a bit confused as to how QEMU is even gaining control
of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in kvm_vcpu_block().
> kvm_apic_accept_events
> test_bit(KVM_APIC_INIT, &pe) == false
> vcpu->arch.mp_state maintains UNINITIALIZED
>
> send SIPI to AP
> kvm_apic_accept_events
> test_bit(KVM_APIC_SIPI, &pe) == false
> vcpu->arch.mp_state will never change to RUNNABLE
> (defy: UNINITIALIZED => INIT_RECEIVED => RUNNABLE)
> AP will never switch to non-root operation
>
> In such race result, VM hangs. E.g., BSP loops in SeaBIOS's SMPLock and
> AP will never be reset, and qemu hmp "info registers -a" shows:
> CPU#0
> EAX=00000002 EBX=00000002 ECX=00000000 EDX=00020000
> ESI=00000000 EDI=00000000 EBP=00000008 ESP=00006c6c
> EIP=000ef570 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ......
> CPU#1
> EAX=00000000 EBX=00000000 ECX=00000000 EDX=00080660
> ESI=00000000 EDI=00000000 EBP=00000000 ESP=00000000
> EIP=0000fff0 EFL=00000002 [-------] CPL=0 II=0 A20=1 SMM=0 HLT=0
> ES =0000 00000000 0000ffff 00009300
> CS =f000 ffff0000 0000ffff 00009b00
> ......
>
> Fix this by handling latched INITs only in specific CPU states (SMM,
> VMX non-root mode, SVM with GIF=0) in KVM_SET_VCPU_EVENTS.
>
> Cc: stable@vger.kernel.org
> Fixes: ff90afa75573 ("KVM: x86: Evaluate latched_init in KVM_SET_VCPU_EVENTS when vCPU not in SMM")
> Signed-off-by: Fei Li <lifei.shirley@bytedance.com>
> ---
> arch/x86/kvm/x86.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a1c49bc681c46..7001b2af00ed1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5556,7 +5556,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> return -EINVAL;
> #endif
>
> - if (lapic_in_kernel(vcpu)) {
> + if (!kvm_apic_init_sipi_allowed(vcpu) && lapic_in_kernel(vcpu)) {
> if (events->smi.latched_init)
> set_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
> else
> --
> 2.39.2 (Apple Git-143)
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-08-27 16:01 ` Sean Christopherson
@ 2025-08-27 16:08 ` Paolo Bonzini
2025-08-28 15:13 ` [External] " Fei Li
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2025-08-27 16:08 UTC (permalink / raw)
To: Sean Christopherson
Cc: Fei Li, tglx, mingo, bp, dave.hansen, liran.alon, hpa, wanpeng.li,
kvm, x86, linux-kernel, stable
On Wed, Aug 27, 2025 at 6:01 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Aug 27, 2025, Fei Li wrote:
> > Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in
> > KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS
> > handler to set pending LAPIC INIT event regardless of if vCPU is in
> > SMM mode or not.
> >
> > However, latch INIT without checking CPU state exists race condition,
> > which causes the loss of INIT event. This is fatal during the VM
> > startup process because it will cause some AP to never switch to
> > non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of
> > pending INIT due to race") said:
> > BSP AP
> > kvm_vcpu_ioctl_x86_get_vcpu_events
> > events->smi.latched_init = 0
> >
> > kvm_vcpu_block
> > kvm_vcpu_check_block
> > schedule
> >
> > send INIT to AP
> > kvm_vcpu_ioctl_x86_set_vcpu_events
> > (e.g. `info registers -a` when VM starts/reboots)
> > if (events->smi.latched_init == 0)
> > clear INIT in pending_events
>
> This is a QEMU bug, no?
I think I agree.
> IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events()
> with stale data.
More precisely, it's not expecting other vCPUs to change the pending
events asynchronously.
> I'm also a bit confused as to how QEMU is even gaining control
> of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in
> kvm_vcpu_block().
With a signal. :)
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-08-27 16:08 ` Paolo Bonzini
@ 2025-08-28 15:13 ` Fei Li
2025-08-28 16:44 ` Paolo Bonzini
0 siblings, 1 reply; 9+ messages in thread
From: Fei Li @ 2025-08-28 15:13 UTC (permalink / raw)
To: Paolo Bonzini, Sean Christopherson
Cc: tglx, mingo, bp, dave.hansen, liran.alon, hpa, wanpeng.li, kvm,
x86, linux-kernel, stable
On 8/28/25 12:08 AM, Paolo Bonzini wrote:
> On Wed, Aug 27, 2025 at 6:01 PM Sean Christopherson <seanjc@google.com> wrote:
>> On Wed, Aug 27, 2025, Fei Li wrote:
>>> Commit ff90afa75573 ("KVM: x86: Evaluate latched_init in
>>> KVM_SET_VCPU_EVENTS when vCPU not in SMM") changes KVM_SET_VCPU_EVENTS
>>> handler to set pending LAPIC INIT event regardless of if vCPU is in
>>> SMM mode or not.
>>>
>>> However, latch INIT without checking CPU state exists race condition,
>>> which causes the loss of INIT event. This is fatal during the VM
>>> startup process because it will cause some AP to never switch to
>>> non-root mode. Just as commit f4ef19108608 ("KVM: X86: Fix loss of
>>> pending INIT due to race") said:
>>> BSP AP
>>> kvm_vcpu_ioctl_x86_get_vcpu_events
>>> events->smi.latched_init = 0
>>>
>>> kvm_vcpu_block
>>> kvm_vcpu_check_block
>>> schedule
>>>
>>> send INIT to AP
>>> kvm_vcpu_ioctl_x86_set_vcpu_events
>>> (e.g. `info registers -a` when VM starts/reboots)
>>> if (events->smi.latched_init == 0)
>>> clear INIT in pending_events
>> This is a QEMU bug, no?
> I think I agree.
Actually this is a bug triggered by one monitor tool in our production
environment. This monitor executes 'info registers -a' hmp at a fixed
frequency, even during VM startup process, which makes some AP stay in
KVM_MP_STATE_UNINITIALIZED forever. But thisrace only occurs with
extremely low probability, about 1~2 VM hangs per week.
Considering other emulators, like cloud-hypervisor and firecracker maybe
also have similar potential race issues, I think KVM had better do some
handling. But anyway, I will check Qemu code to avoid such race. Thanks
for both of your comments. 🙂
Have a nice day, thanks
Fei
>
>> IIUC, it's invoking kvm_vcpu_ioctl_x86_set_vcpu_events()
>> with stale data.
> More precisely, it's not expecting other vCPUs to change the pending
> events asynchronously.
Yes, will sort out a more complete calling process later.
>
>> I'm also a bit confused as to how QEMU is even gaining control
>> of the vCPU to emit KVM_SET_VCPU_EVENTS if the vCPU is in
>> kvm_vcpu_block().
> With a signal. :)
>
> Paolo
>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-08-28 15:13 ` [External] " Fei Li
@ 2025-08-28 16:44 ` Paolo Bonzini
2025-09-05 14:59 ` Fei Li
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Bonzini @ 2025-08-28 16:44 UTC (permalink / raw)
To: Fei Li
Cc: Sean Christopherson, tglx, mingo, bp, dave.hansen, liran.alon,
hpa, wanpeng.li, kvm, x86, linux-kernel, stable
On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@bytedance.com> wrote:
> Actually this is a bug triggered by one monitor tool in our production
> environment. This monitor executes 'info registers -a' hmp at a fixed
> frequency, even during VM startup process, which makes some AP stay in
> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
> extremely low probability, about 1~2 VM hangs per week.
>
> Considering other emulators, like cloud-hypervisor and firecracker maybe
> also have similar potential race issues, I think KVM had better do some
> handling. But anyway, I will check Qemu code to avoid such race. Thanks
> for both of your comments. 🙂
If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
similar cases, that of course would help understanding the situation
better.
In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
vCPUs have halted.
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-08-28 16:44 ` Paolo Bonzini
@ 2025-09-05 14:59 ` Fei Li
2025-09-08 14:55 ` Fei Li
0 siblings, 1 reply; 9+ messages in thread
From: Fei Li @ 2025-09-05 14:59 UTC (permalink / raw)
To: Paolo Bonzini, Sean Christopherson
Cc: tglx, mingo, bp, dave.hansen, liran.alon, hpa, wanpeng.li, kvm,
x86, linux-kernel, stable
On 8/29/25 12:44 AM, Paolo Bonzini wrote:
> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@bytedance.com> wrote:
>> Actually this is a bug triggered by one monitor tool in our production
>> environment. This monitor executes 'info registers -a' hmp at a fixed
>> frequency, even during VM startup process, which makes some AP stay in
>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>> extremely low probability, about 1~2 VM hangs per week.
>>
>> Considering other emulators, like cloud-hypervisor and firecracker maybe
>> also have similar potential race issues, I think KVM had better do some
>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>> for both of your comments. 🙂
> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
> similar cases, that of course would help understanding the situation
> better.
>
> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
> vCPUs have halted.
>
> Paolo
>
Hi Paolo and Sean,
Sorry for the late response, I have been a little busy with other things
recently. The complete calling processes for the bad case are as follows:
`info registers -a` hmp per 2ms[1] AP(vcpu1) thread[2]
BSP(vcpu0) send INIT/SIPI[3]
[2]
KVM: KVM_RUN and then
schedule() in kvm_vcpu_block() loop
[1]
for each cpu: cpu_synchronize_state
if !qemu_thread_is_self()
1. insert to cpu->work_list, and handle asynchronously
2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
[2]
KVM: checks signal_pending, breaks loop and
returns -EINTR
Qemu: break kvm_cpu_exec loop, run
1. qemu_wait_io_event()
=> process_queued_cpu_work => cpu->work_list.func()
e.i. do_kvm_cpu_synchronize_state() callback
=> kvm_arch_get_registers
=> kvm_get_mp_state /* KVM: get_mpstate also calls
kvm_apic_accept_events() to handle INIT and SIPI */
=> cpu->vcpu_dirty = true;
// end of qemu_wait_io_event
[3]
SeaBIOS: BSP enters non-root mode and
runs reset_vector() in SeaBIOS.
send INIT and then SIPI by
writing APIC_ICR during smp_scan
KVM: BSP(vcpu0) exits, then =>
handle_apic_write
=> kvm_lapic_reg_write =>
kvm_apic_send_ipi to all APs
=> for each AP:
__apic_accept_irq, e.g. for AP(vcpu1)
=> case APIC_DM_INIT:
apic->pending_events = (1UL << KVM_APIC_INIT)
(not kick the AP yet)
=> case APIC_DM_STARTUP:
set_bit(KVM_APIC_SIPI, &apic->pending_events)
(not kick the AP yet)
[2]
2. kvm_cpu_exec()
=> if (cpu->vcpu_dirty):
=> kvm_arch_put_registers
=> kvm_put_vcpu_events
KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
=> clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
e.i. pending_events changes from 11b to 10b
// end of kvm_vcpu_ioctl_x86_set_vcpu_events
Qemu: => after put_registers, cpu->vcpu_dirty = false;
=> kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
KVM: KVM_RUN
=> schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal
/* But AP(vcpu1)'s mp_state will never change from
KVM_MP_STATE_UNINITIALIZED
to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
without handling INIT inside kvm_apic_accept_events(), considering BSP
will never
send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
non-root mode */
[3]
SeaBIOS: waits CountCPUs ==
expected_cpus_count and loops forever
e.i. the AP(vcpu1) stays:
EIP=0000fff0 && CS =f000 ffff0000
and BSP(vcpu0) appears 100%
utilized as it is in a while loop.
As for other emulators (like cloud-hypervisor and firecracker), there is
no interactive command like 'info registers -a'.
But sorry again that I haven't had time to check code to confirm whether
they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)
Have a nice day, thanks
Fei
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-09-05 14:59 ` Fei Li
@ 2025-09-08 14:55 ` Fei Li
2025-09-08 16:14 ` Sean Christopherson
0 siblings, 1 reply; 9+ messages in thread
From: Fei Li @ 2025-09-08 14:55 UTC (permalink / raw)
To: Paolo Bonzini, Sean Christopherson
Cc: tglx, mingo, bp, dave.hansen, liran.alon, hpa, wanpeng.li, kvm,
x86, linux-kernel, stable
On 9/5/25 10:59 PM, Fei Li wrote:
>
> On 8/29/25 12:44 AM, Paolo Bonzini wrote:
>> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@bytedance.com>
>> wrote:
>>> Actually this is a bug triggered by one monitor tool in our production
>>> environment. This monitor executes 'info registers -a' hmp at a fixed
>>> frequency, even during VM startup process, which makes some AP stay in
>>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>>> extremely low probability, about 1~2 VM hangs per week.
>>>
>>> Considering other emulators, like cloud-hypervisor and firecracker
>>> maybe
>>> also have similar potential race issues, I think KVM had better do some
>>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>>> for both of your comments. 🙂
>> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
>> similar cases, that of course would help understanding the situation
>> better.
>>
>> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
>> vCPUs have halted.
>>
>> Paolo
>>
> Hi Paolo and Sean,
>
>
> Sorry for the late response, I have been a little busy with other
> things recently. The complete calling processes for the bad case are
> as follows:
>
> `info registers -a` hmp per 2ms[1] AP(vcpu1) thread[2]
> BSP(vcpu0) send INIT/SIPI[3]
>
> [2]
> KVM: KVM_RUN and then
> schedule() in kvm_vcpu_block() loop
>
> [1]
> for each cpu: cpu_synchronize_state
> if !qemu_thread_is_self()
> 1. insert to cpu->work_list, and handle asynchronously
> 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
>
> [2]
> KVM: checks signal_pending, breaks loop and
> returns -EINTR
> Qemu: break kvm_cpu_exec loop, run
> 1. qemu_wait_io_event()
> => process_queued_cpu_work => cpu->work_list.func()
> e.i. do_kvm_cpu_synchronize_state() callback
> => kvm_arch_get_registers
> => kvm_get_mp_state /* KVM: get_mpstate also calls
> kvm_apic_accept_events() to handle INIT and SIPI */
> => cpu->vcpu_dirty = true;
> // end of qemu_wait_io_event
>
> [3]
> SeaBIOS: BSP enters non-root mode
> and runs reset_vector() in SeaBIOS.
> send INIT and then SIPI by
> writing APIC_ICR during smp_scan
> KVM: BSP(vcpu0) exits, then =>
> handle_apic_write
> => kvm_lapic_reg_write =>
> kvm_apic_send_ipi to all APs
> => for each AP:
> __apic_accept_irq, e.g. for AP(vcpu1)
> => case APIC_DM_INIT:
> apic->pending_events = (1UL << KVM_APIC_INIT)
> (not kick the AP yet)
> => case APIC_DM_STARTUP:
> set_bit(KVM_APIC_SIPI, &apic->pending_events)
> (not kick the AP yet)
>
> [2]
> 2. kvm_cpu_exec()
> => if (cpu->vcpu_dirty):
> => kvm_arch_put_registers
> => kvm_put_vcpu_events
> KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
> => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
> e.i. pending_events changes from 11b to 10b
> // end of kvm_vcpu_ioctl_x86_set_vcpu_events
> Qemu: => after put_registers, cpu->vcpu_dirty = false;
> => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
> KVM: KVM_RUN
> => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1
> signal
> /* But AP(vcpu1)'s mp_state will never change from
> KVM_MP_STATE_UNINITIALIZED
> to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE
> without handling INIT inside kvm_apic_accept_events(), considering BSP
> will never
> send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
> non-root mode */
>
> [3]
> SeaBIOS: waits CountCPUs ==
> expected_cpus_count and loops forever
> e.i. the AP(vcpu1) stays:
> EIP=0000fff0 && CS =f000 ffff0000
> and BSP(vcpu0) appears 100%
> utilized as it is in a while loop.
>
> As for other emulators (like cloud-hypervisor and firecracker), there
> is no interactive command like 'info registers -a'.
> But sorry again that I haven't had time to check code to confirm
> whether they invoke KVM_SET_VCPU_EVENTS in similar cases, maybe later. :)
>
>
> Have a nice day, thanks
> Fei
>
By the way, this doesn't seem to be a Qemu bug, since calling "info
registers -a" is allowed regardless of the vcpu state (including when
the VM is in the bootloader). Thus the INIT should not be latched in
this case. To fix this, I think we need add the
kvm_apic_init_sipi_allowed() condition: only latch INITs in specific CPU
states. Or change mp_state to KVM_MP_STATE_INIT_RECEIVED and then clear
INIT here. Should I send a v2 patch with a clearer commit message?
Have a nice day, thanks
Fei
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-09-08 14:55 ` Fei Li
@ 2025-09-08 16:14 ` Sean Christopherson
2025-09-09 4:15 ` Fei Li
0 siblings, 1 reply; 9+ messages in thread
From: Sean Christopherson @ 2025-09-08 16:14 UTC (permalink / raw)
To: Fei Li
Cc: Paolo Bonzini, tglx, mingo, bp, dave.hansen, liran.alon, hpa,
wanpeng.li, kvm, x86, linux-kernel, stable
On Mon, Sep 08, 2025, Fei Li wrote:
>
> On 9/5/25 10:59 PM, Fei Li wrote:
> >
> > On 8/29/25 12:44 AM, Paolo Bonzini wrote:
> > > On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@bytedance.com>
> > > wrote:
> > > > Actually this is a bug triggered by one monitor tool in our production
> > > > environment. This monitor executes 'info registers -a' hmp at a fixed
> > > > frequency, even during VM startup process, which makes some AP stay in
> > > > KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
> > > > extremely low probability, about 1~2 VM hangs per week.
> > > >
> > > > Considering other emulators, like cloud-hypervisor and
> > > > firecracker maybe
> > > > also have similar potential race issues, I think KVM had better do some
> > > > handling. But anyway, I will check Qemu code to avoid such race. Thanks
> > > > for both of your comments. 🙂
> > > If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
> > > similar cases, that of course would help understanding the situation
> > > better.
> > >
> > > In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
> > > vCPUs have halted.
> > >
> > > Paolo
> > >
Replacing the original message with a decently formatted version. Please try to
format your emails for plain text, I assume something in your mail system inserted
a pile of line wraps and made the entire thing all but unreadable.
> > `info registers -a` hmp per 2ms[1]
> > AP(vcpu1) thread[2]
> > BSP(vcpu0) send INIT/SIPI[3]
> >
> > [1] for each cpu: cpu_synchronize_state
> > if !qemu_thread_is_self()
> > 1. insert to cpu->work_list, and handle asynchronously
> > 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
> >
> > [2] KVM: KVM_RUN and then schedule() in kvm_vcpu_block() loop
> > KVM: checks signal_pending, breaks loop and returns -EINTR
> > Qemu: break kvm_cpu_exec loop, run
> > 1. qemu_wait_io_event()
> > => process_queued_cpu_work => cpu->work_list.func()
> > e.i. do_kvm_cpu_synchronize_state() callback
> > => kvm_arch_get_registers
> > => kvm_get_mp_state
> > /* KVM: get_mpstate also calls kvm_apic_accept_events() to handle INIT and SIPI */
> > => cpu->vcpu_dirty = true;
> > // end of qemu_wait_io_event
> >
> > [3] SeaBIOS: BSP enters non-root mode and runs reset_vector() in SeaBIOS.
> > send INIT and then SIPI by writing APIC_ICR during smp_scan
> > KVM: BSP(vcpu0) exits, then
> > => handle_apic_write
> > => kvm_lapic_reg_write
> > => kvm_apic_send_ipi to all APs
> > => for each AP: __apic_accept_irq, e.g. for AP(vcpu1)
> > => case APIC_DM_INIT:
> > apic->pending_events = (1UL << KVM_APIC_INIT) (not kick the AP yet)
> > => case APIC_DM_STARTUP:
> > set_bit(KVM_APIC_SIPI, &apic->pending_events) (not kick the AP yet)
> >
> > [2] 2. kvm_cpu_exec()
> > => if (cpu->vcpu_dirty):
> > => kvm_arch_put_registers
> > => kvm_put_vcpu_events
> > KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
> > => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
> > e.i. pending_events changes from 11b to 10b
> > // end of kvm_vcpu_ioctl_x86_set_vcpu_events
Qemu is clearly "putting" stale data here.
> > Qemu: => after put_registers, cpu->vcpu_dirty = false;
> > => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
> > KVM: KVM_RUN
> > => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal
> > /* But AP(vcpu1)'s mp_state will never change from KVM_MP_STATE_UNINITIALIZED
> > to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE without
> > handling INIT inside kvm_apic_accept_events(), considering BSP will never
> > send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
> > non-root mode */
> >
> > [3] SeaBIOS: waits CountCPUs == expected_cpus_count and loops forever
> > e.i. the AP(vcpu1) stays: EIP=0000fff0 && CS =f000 ffff0000
> > and BSP(vcpu0) appears 100% utilized as it is in a while loop.
> By the way, this doesn't seem to be a Qemu bug, since calling "info
> registers -a" is allowed regardless of the vcpu state (including when the VM
> is in the bootloader). Thus the INIT should not be latched in this case.
No, this is a Qemu bug. It is the VMM's responsibility to ensure it doesn't load
stale data into a vCPU. There is simply no way for KVM to do the right thing,
because KVM can't know if userspace _wants_ to clobber events versus when userspace
is racing, as in this case.
E.g. the exact same race exists with NMIs.
1. kvm_vcpu_ioctl_x86_get_vcpu_events()
vcpu->arch.nmi_queued = 0
vcpu->arch.nmi_pending = 0
kvm_vcpu_events.pending = 0
2. kvm_inject_nmi()
vcpu->arch.nmi_queued = 1
vcpu->arch.nmi_pending = 0
kvm_vcpu_events.pending = 0
3. kvm_vcpu_ioctl_x86_set_vcpu_events()
vcpu->arch.nmi_queued = 0 // Moved to nmi_pending by process_nmi()
vcpu->arch.nmi_pending = 0 // Explicitly cleared after process_nmi() when KVM_VCPUEVENT_VALID_NMI_PENDING
kvm_vcpu_events.pending = 0 // Stale data
But for NMI, Qemu avoids clobbering state thinks to a 15+ year old commit that
specifically avoids clobbering NMI *and SIPI* when not putting "reset" state:
commit ea64305139357e89f58fc05ff5d48dc233d44d87
Author: Jan Kiszka <jan.kiszka@siemens.com>
AuthorDate: Mon Mar 1 19:10:31 2010 +0100
Commit: Marcelo Tosatti <mtosatti@redhat.com>
CommitDate: Thu Mar 4 00:29:30 2010 -0300
KVM: x86: Restrict writeback of VCPU state
Do not write nmi_pending, sipi_vector, and mpstate unless we at least go
through a reset. And TSC as well as KVM wallclocks should only be
written on full sync, otherwise we risk to drop some time on state
read-modify-write.
if (level >= KVM_PUT_RESET_STATE) { <=========================
events.flags |= KVM_VCPUEVENT_VALID_NMI_PENDING;
if (env->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
events.flags |= KVM_VCPUEVENT_VALID_SIPI_VECTOR;
}
}
Presumably "SMIs" need the same treatment, e.g.
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 6c749d4ee8..f5bc0f9327 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5033,7 +5033,7 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
events.sipi_vector = env->sipi_vector;
- if (has_msr_smbase) {
+ if (has_msr_smbase && level >= KVM_PUT_RESET_STATE) {
events.flags |= KVM_VCPUEVENT_VALID_SMM;
events.smi.smm = !!(env->hflags & HF_SMM_MASK);
events.smi.smm_inside_nmi = !!(env->hflags2 & HF2_SMM_INSIDE_NMI_MASK);
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [External] Re: [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS
2025-09-08 16:14 ` Sean Christopherson
@ 2025-09-09 4:15 ` Fei Li
0 siblings, 0 replies; 9+ messages in thread
From: Fei Li @ 2025-09-09 4:15 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, tglx, mingo, bp, dave.hansen, liran.alon, hpa,
wanpeng.li, kvm, x86, linux-kernel, stable
On 9/9/25 12:14 AM, Sean Christopherson wrote:
> On Mon, Sep 08, 2025, Fei Li wrote:
>>
>> On 9/5/25 10:59 PM, Fei Li wrote:
>>>
>>> On 8/29/25 12:44 AM, Paolo Bonzini wrote:
>>>> On Thu, Aug 28, 2025 at 5:13 PM Fei Li <lifei.shirley@bytedance.com>
>>>> wrote:
>>>>> Actually this is a bug triggered by one monitor tool in our production
>>>>> environment. This monitor executes 'info registers -a' hmp at a fixed
>>>>> frequency, even during VM startup process, which makes some AP stay in
>>>>> KVM_MP_STATE_UNINITIALIZED forever. But this race only occurs with
>>>>> extremely low probability, about 1~2 VM hangs per week.
>>>>>
>>>>> Considering other emulators, like cloud-hypervisor and
>>>>> firecracker maybe
>>>>> also have similar potential race issues, I think KVM had better do some
>>>>> handling. But anyway, I will check Qemu code to avoid such race. Thanks
>>>>> for both of your comments. 🙂
>>>> If you can check whether other emulators invoke KVM_SET_VCPU_EVENTS in
>>>> similar cases, that of course would help understanding the situation
>>>> better.
>>>>
>>>> In QEMU, it is possible to delay KVM_GET_VCPU_EVENTS until after all
>>>> vCPUs have halted.
>>>>
>>>> Paolo
>>>>
>
> Replacing the original message with a decently formatted version. Please try to
> format your emails for plain text, I assume something in your mail system inserted
> a pile of line wraps and made the entire thing all but unreadable.
Sure, sorry for the inconvenience.
>
>>> `info registers -a` hmp per 2ms[1]
>>> AP(vcpu1) thread[2]
>>> BSP(vcpu0) send INIT/SIPI[3]
>>>
>>> [1] for each cpu: cpu_synchronize_state
>>> if !qemu_thread_is_self()
>>> 1. insert to cpu->work_list, and handle asynchronously
>>> 2. then kick the AP(vcpu1) by sending SIG_IPI/SIGUSR1 signal
>>>
>>> [2] KVM: KVM_RUN and then schedule() in kvm_vcpu_block() loop
>>> KVM: checks signal_pending, breaks loop and returns -EINTR
>>> Qemu: break kvm_cpu_exec loop, run
>>> 1. qemu_wait_io_event()
>>> => process_queued_cpu_work => cpu->work_list.func()
>>> e.i. do_kvm_cpu_synchronize_state() callback
>>> => kvm_arch_get_registers
>>> => kvm_get_mp_state
>>> /* KVM: get_mpstate also calls kvm_apic_accept_events() to handle INIT and SIPI */
>>> => cpu->vcpu_dirty = true;
>>> // end of qemu_wait_io_event
>>>
>>> [3] SeaBIOS: BSP enters non-root mode and runs reset_vector() in SeaBIOS.
>>> send INIT and then SIPI by writing APIC_ICR during smp_scan
>>> KVM: BSP(vcpu0) exits, then
>>> => handle_apic_write
>>> => kvm_lapic_reg_write
>>> => kvm_apic_send_ipi to all APs
>>> => for each AP: __apic_accept_irq, e.g. for AP(vcpu1)
>>> => case APIC_DM_INIT:
>>> apic->pending_events = (1UL << KVM_APIC_INIT) (not kick the AP yet)
>>> => case APIC_DM_STARTUP:
>>> set_bit(KVM_APIC_SIPI, &apic->pending_events) (not kick the AP yet)
>>>
>>> [2] 2. kvm_cpu_exec()
>>> => if (cpu->vcpu_dirty):
>>> => kvm_arch_put_registers
>>> => kvm_put_vcpu_events
>>> KVM: kvm_vcpu_ioctl_x86_set_vcpu_events
>>> => clear_bit(KVM_APIC_INIT, &vcpu->arch.apic->pending_events);
>>> e.i. pending_events changes from 11b to 10b
>>> // end of kvm_vcpu_ioctl_x86_set_vcpu_events
>
> Qemu is clearly "putting" stale data here.
>
>>> Qemu: => after put_registers, cpu->vcpu_dirty = false;
>>> => kvm_vcpu_ioctl(cpu, KVM_RUN, 0)
>>> KVM: KVM_RUN
>>> => schedule() in kvm_vcpu_block() until Qemu's next SIG_IPI/SIGUSR1 signal
>>> /* But AP(vcpu1)'s mp_state will never change from KVM_MP_STATE_UNINITIALIZED
>>> to KVM_MP_STATE_INIT_RECEIVED, even then to KVM_MP_STATE_RUNNABLE without
>>> handling INIT inside kvm_apic_accept_events(), considering BSP will never
>>> send INIT/SIPI again during smp_scan. Then AP(vcpu1) will never enter
>>> non-root mode */
>>>
>>> [3] SeaBIOS: waits CountCPUs == expected_cpus_count and loops forever
>>> e.i. the AP(vcpu1) stays: EIP=0000fff0 && CS =f000 ffff0000
>>> and BSP(vcpu0) appears 100% utilized as it is in a while loop.
>
>> By the way, this doesn't seem to be a Qemu bug, since calling "info
>> registers -a" is allowed regardless of the vcpu state (including when the VM
>> is in the bootloader). Thus the INIT should not be latched in this case.
>
> No, this is a Qemu bug. It is the VMM's responsibility to ensure it doesn't load
> stale data into a vCPU. There is simply no way for KVM to do the right thing,
> because KVM can't know if userspace _wants_ to clobber events versus when userspace
> is racing, as in this case.
>
> E.g. the exact same race exists with NMIs.
>
> 1. kvm_vcpu_ioctl_x86_get_vcpu_events()
> vcpu->arch.nmi_queued = 0
> vcpu->arch.nmi_pending = 0
> kvm_vcpu_events.pending = 0
>
> 2. kvm_inject_nmi()
> vcpu->arch.nmi_queued = 1
> vcpu->arch.nmi_pending = 0
> kvm_vcpu_events.pending = 0
>
> 3. kvm_vcpu_ioctl_x86_set_vcpu_events()
> vcpu->arch.nmi_queued = 0 // Moved to nmi_pending by process_nmi()
> vcpu->arch.nmi_pending = 0 // Explicitly cleared after process_nmi() when KVM_VCPUEVENT_VALID_NMI_PENDING
> kvm_vcpu_events.pending = 0 // Stale data
>
> But for NMI, Qemu avoids clobbering state thinks to a 15+ year old commit that
> specifically avoids clobbering NMI *and SIPI* when not putting "reset" state:
>
> commit ea64305139357e89f58fc05ff5d48dc233d44d87
> Author: Jan Kiszka <jan.kiszka@siemens.com>
> AuthorDate: Mon Mar 1 19:10:31 2010 +0100
> Commit: Marcelo Tosatti <mtosatti@redhat.com>
> CommitDate: Thu Mar 4 00:29:30 2010 -0300
>
> KVM: x86: Restrict writeback of VCPU state
>
> Do not write nmi_pending, sipi_vector, and mpstate unless we at least go
> through a reset. And TSC as well as KVM wallclocks should only be
> written on full sync, otherwise we risk to drop some time on state
> read-modify-write.
>
> if (level >= KVM_PUT_RESET_STATE) { <=========================
> events.flags |= KVM_VCPUEVENT_VALID_NMI_PENDING;
> if (env->mp_state == KVM_MP_STATE_SIPI_RECEIVED) {
> events.flags |= KVM_VCPUEVENT_VALID_SIPI_VECTOR;
> }
> }
>
> Presumably "SMIs" need the same treatment, e.g.
>
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 6c749d4ee8..f5bc0f9327 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -5033,7 +5033,7 @@ static int kvm_put_vcpu_events(X86CPU *cpu, int level)
>
> events.sipi_vector = env->sipi_vector;
>
> - if (has_msr_smbase) {
> + if (has_msr_smbase && level >= KVM_PUT_RESET_STATE) {
> events.flags |= KVM_VCPUEVENT_VALID_SMM;
> events.smi.smm = !!(env->hflags & HF_SMM_MASK);
> events.smi.smm_inside_nmi = !!(env->hflags2 & HF2_SMM_INSIDE_NMI_MASK);
I see, this is indeed a feasible solution. I will send a patch to the
Qemu community then to seek more advises. Thanks for your suggestions.
Have a nice day, thanks a lot
Fei
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-09-09 4:15 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-27 15:27 [PATCH] KVM: x86: Latch INITs only in specific CPU states in KVM_SET_VCPU_EVENTS Fei Li
2025-08-27 16:01 ` Sean Christopherson
2025-08-27 16:08 ` Paolo Bonzini
2025-08-28 15:13 ` [External] " Fei Li
2025-08-28 16:44 ` Paolo Bonzini
2025-09-05 14:59 ` Fei Li
2025-09-08 14:55 ` Fei Li
2025-09-08 16:14 ` Sean Christopherson
2025-09-09 4:15 ` Fei Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).