* KVM: x86: Racy mp_state manipulations
@ 2013-03-03 16:48 Jan Kiszka
2013-03-04 14:12 ` Paolo Bonzini
0 siblings, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2013-03-03 16:48 UTC (permalink / raw)
To: Gleb Natapov, Marcelo Tosatti; +Cc: kvm
[-- Attachment #1: Type: text/plain, Size: 1335 bytes --]
Hi all,
KVM's mp_state on x86 is usually manipulated over the context of the
VCPU. Therefore, no locking is required. There are unfortunately two
exceptions, and one of them is definitely broken: INIT and SIPI delivery.
The lapic may set mp_state over the context of the sending VCPU. For
SIPI, it first checks if the mp_state is INIT_RECEIVED before updating
it to SIPI_RECEIVED. We can only race here with user space setting the
state in parallel, I suppose. Probably harmless in practice.
What is critical is the update on INIT. That signal is asynchronous to
the target VCPU state. And we can loose it:
vcpu 1 vcpu 2
------ ------
hlt;
vmexit
__apic_accept_irq(APIC_DM_INIT)
mp_state = KVM_MP_STATE_INIT_RECEIVED
mp_state = KVM_MP_STATE_HALTED
And there it goes, our INIT state. I've triggered this under heavy INIT
load and my nVMX patch for processing it while in VMXON.
I'm currently considering options to fix this:
- through a lock at mp_state manipulations, check under the lock that
we don't perform invalid state transitions (e.g. INIT->HLT)
- signal the INIT via some KVM_REQ_INIT to the target VCPU, fully
localizing mp_state updates, the same could be done with SIPI, just
to play safe
I'm leaning toward the latter ATM, Any thoughts or other idea?
Jan
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 263 bytes --]
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: KVM: x86: Racy mp_state manipulations
2013-03-03 16:48 KVM: x86: Racy mp_state manipulations Jan Kiszka
@ 2013-03-04 14:12 ` Paolo Bonzini
2013-03-04 14:28 ` Jan Kiszka
2013-03-05 7:28 ` Gleb Natapov
0 siblings, 2 replies; 6+ messages in thread
From: Paolo Bonzini @ 2013-03-04 14:12 UTC (permalink / raw)
To: Jan Kiszka; +Cc: Gleb Natapov, Marcelo Tosatti, kvm
Il 03/03/2013 17:48, Jan Kiszka ha scritto:
> Hi all,
>
> KVM's mp_state on x86 is usually manipulated over the context of the
> VCPU. Therefore, no locking is required. There are unfortunately two
> exceptions, and one of them is definitely broken: INIT and SIPI delivery.
>
> The lapic may set mp_state over the context of the sending VCPU. For
> SIPI, it first checks if the mp_state is INIT_RECEIVED before updating
> it to SIPI_RECEIVED. We can only race here with user space setting the
> state in parallel, I suppose. Probably harmless in practice.
Still it would be better to add an smp_wmb/smp_rmb pair between accesses
of mp_state and sipi_vector.
Also, Io
> What is critical is the update on INIT. That signal is asynchronous to
> the target VCPU state. And we can loose it:
>
> vcpu 1 vcpu 2
> ------ ------
> hlt;
> vmexit
> __apic_accept_irq(APIC_DM_INIT)
> mp_state = KVM_MP_STATE_INIT_RECEIVED
> mp_state = KVM_MP_STATE_HALTED
>
> And there it goes, our INIT state. I've triggered this under heavy INIT
> load and my nVMX patch for processing it while in VMXON.
>
> I'm currently considering options to fix this:
>
> - through a lock at mp_state manipulations, check under the lock that
> we don't perform invalid state transitions (e.g. INIT->HLT)
> - signal the INIT via some KVM_REQ_INIT to the target VCPU, fully
> localizing mp_state updates, the same could be done with SIPI, just
> to play safe
>
> I'm leaning toward the latter ATM, Any thoughts or other idea?
The latter makes sense since it's not a fast path, but the only
transition that is acceptable to KVM_MP_STATE_HALTED is from
KVM_MP_STATE_RUNNABLE:
from \ to RUNNABLE UNINIT INIT HALTED SIPI
RUNNABLE n/a yes yes yes NO
UNINIT NO n/a yes NO NO
INIT NO yes n/a NO yes
HALTED yes yes yes n/a NO
SIPI yes yes yes NO n/a
so for this particular bug it should also work to use a cmpxchg when
setting KVM_MP_STATE_HALTED. Same for INIT->SIPI, since writes to
sipi_vector are harmless.
BTW, what happens when you send an INIT IPI to the bootstrap processor?
This may be interesting if we want to emulate soft resets correctly in
QEMU; KVM makes it go to wait-for-SIPI state if I read the code
correctly, but that is wrong.
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: KVM: x86: Racy mp_state manipulations
2013-03-04 14:12 ` Paolo Bonzini
@ 2013-03-04 14:28 ` Jan Kiszka
2013-03-04 15:08 ` Paolo Bonzini
2013-03-05 7:28 ` Gleb Natapov
1 sibling, 1 reply; 6+ messages in thread
From: Jan Kiszka @ 2013-03-04 14:28 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Gleb Natapov, Marcelo Tosatti, kvm
On 2013-03-04 15:12, Paolo Bonzini wrote:
> Il 03/03/2013 17:48, Jan Kiszka ha scritto:
>> Hi all,
>>
>> KVM's mp_state on x86 is usually manipulated over the context of the
>> VCPU. Therefore, no locking is required. There are unfortunately two
>> exceptions, and one of them is definitely broken: INIT and SIPI delivery.
>>
>> The lapic may set mp_state over the context of the sending VCPU. For
>> SIPI, it first checks if the mp_state is INIT_RECEIVED before updating
>> it to SIPI_RECEIVED. We can only race here with user space setting the
>> state in parallel, I suppose. Probably harmless in practice.
>
> Still it would be better to add an smp_wmb/smp_rmb pair between accesses
> of mp_state and sipi_vector.
Do we need a mb between sipi_vector assignment and kvm_make_request (see
my patch to fix this issue)?
>
> Also, Io
>> What is critical is the update on INIT. That signal is asynchronous to
>> the target VCPU state. And we can loose it:
>>
>> vcpu 1 vcpu 2
>> ------ ------
>> hlt;
>> vmexit
>> __apic_accept_irq(APIC_DM_INIT)
>> mp_state = KVM_MP_STATE_INIT_RECEIVED
>> mp_state = KVM_MP_STATE_HALTED
>>
>> And there it goes, our INIT state. I've triggered this under heavy INIT
>> load and my nVMX patch for processing it while in VMXON.
>>
>> I'm currently considering options to fix this:
>>
>> - through a lock at mp_state manipulations, check under the lock that
>> we don't perform invalid state transitions (e.g. INIT->HLT)
>> - signal the INIT via some KVM_REQ_INIT to the target VCPU, fully
>> localizing mp_state updates, the same could be done with SIPI, just
>> to play safe
>>
>> I'm leaning toward the latter ATM, Any thoughts or other idea?
>
> The latter makes sense since it's not a fast path, but the only
> transition that is acceptable to KVM_MP_STATE_HALTED is from
> KVM_MP_STATE_RUNNABLE:
>
> from \ to RUNNABLE UNINIT INIT HALTED SIPI
> RUNNABLE n/a yes yes yes NO
> UNINIT NO n/a yes NO NO
> INIT NO yes n/a NO yes
> HALTED yes yes yes n/a NO
> SIPI yes yes yes NO n/a
>
> so for this particular bug it should also work to use a cmpxchg when
> setting KVM_MP_STATE_HALTED. Same for INIT->SIPI, since writes to
> sipi_vector are harmless.
OK, but I already went for request bits. :)
>
> BTW, what happens when you send an INIT IPI to the bootstrap processor?
> This may be interesting if we want to emulate soft resets correctly in
> QEMU; KVM makes it go to wait-for-SIPI state if I read the code
> correctly, but that is wrong.
Where is this restriction specified? How do you reset the BP without
resetting the while system then?
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: KVM: x86: Racy mp_state manipulations
2013-03-04 14:28 ` Jan Kiszka
@ 2013-03-04 15:08 ` Paolo Bonzini
2013-03-04 15:19 ` Jan Kiszka
0 siblings, 1 reply; 6+ messages in thread
From: Paolo Bonzini @ 2013-03-04 15:08 UTC (permalink / raw)
To: Jan Kiszka; +Cc: Gleb Natapov, Marcelo Tosatti, kvm
Il 04/03/2013 15:28, Jan Kiszka ha scritto:
> On 2013-03-04 15:12, Paolo Bonzini wrote:
>> Il 03/03/2013 17:48, Jan Kiszka ha scritto:
>>> Hi all,
>>>
>>> KVM's mp_state on x86 is usually manipulated over the context of the
>>> VCPU. Therefore, no locking is required. There are unfortunately two
>>> exceptions, and one of them is definitely broken: INIT and SIPI delivery.
>>>
>>> The lapic may set mp_state over the context of the sending VCPU. For
>>> SIPI, it first checks if the mp_state is INIT_RECEIVED before updating
>>> it to SIPI_RECEIVED. We can only race here with user space setting the
>>> state in parallel, I suppose. Probably harmless in practice.
>>
>> Still it would be better to add an smp_wmb/smp_rmb pair between accesses
>> of mp_state and sipi_vector.
>
> Do we need a mb between sipi_vector assignment and kvm_make_request (see
> my patch to fix this issue)?
Yes. kvm_make_request does a set_bit, which has no implicit barrier.
>> BTW, what happens when you send an INIT IPI to the bootstrap processor?
>> This may be interesting if we want to emulate soft resets correctly in
>> QEMU; KVM makes it go to wait-for-SIPI state if I read the code
>> correctly, but that is wrong.
>
> Where is this restriction specified?
I now found it explicitly in "8.4.2 MP Initialization Protocol
Requirements and Restrictions": "If the MP protocol has completed and a
BSP is chosen, subsequent INITs (either to a specific processor or
system wide) do not cause the MP protocol to be repeated. Instead,
each logical processor examines its BSP flag (in the IA32_APIC_BASE MSR)
to determine whether it should execute the BIOS boot-strap code (if it
is the BSP) or enter a wait-for-SIPI state (if it is an AP)".
In my previous message I asked because I only found two implicit references:
- in "9.1 Initialization Overview", where it says for a hardware reset:
"A hardware reset sets each processor’s registers to a known state and
places the processor in real-address mode. [...] the bootstrap processor
(BSP) then immediately starts executing software-initialization code in
the current code segment beginning at the offset in the EIP register.
The application (non-BSP) processors (APs) go into a Wait For Startup
IPI (SIPI) state while the BSP is executing initialization code."
And then: "Asserting the INIT# pin on the processor invokes a similar
response to a hardware reset." The two together say that asserting the
INIT# pin makes the BSP start running from 0xFFFFFFF0.
- in "10.4.7.1 Local APIC State After Power-Up or Reset" where it says
"If the processor is the only processor in the system or it is the BSP
[...] the local APIC will respond normally to INIT and NMI messages, to
INIT# signals and to STPCLK# signals. If the processor is in an MP
system and has been designated as an AP; the local APIC will respond the
same as for the BSP. In addition, it will respond to SIPI messages."
Apart from the funky semicolon placement, this means that (1) sending an
INIT to the BSP is legal; (2) the BSP will not respond to a SIPI.
There is of course :) a possible reference to the contrary in "10.4.7.3
Local APIC State After an INIT Reset ('Wait-for-SIPI' State)". However,
that's where there is a reference to 8.4.2 so I don't think it counts.
> How do you reset the BP without resetting the while system then?
You can reset it, but it will immediately restart execution from 0xFFFF0000.
Paolo
^ permalink raw reply [flat|nested] 6+ messages in thread* Re: KVM: x86: Racy mp_state manipulations
2013-03-04 15:08 ` Paolo Bonzini
@ 2013-03-04 15:19 ` Jan Kiszka
0 siblings, 0 replies; 6+ messages in thread
From: Jan Kiszka @ 2013-03-04 15:19 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Gleb Natapov, Marcelo Tosatti, kvm
On 2013-03-04 16:08, Paolo Bonzini wrote:
> Il 04/03/2013 15:28, Jan Kiszka ha scritto:
>> On 2013-03-04 15:12, Paolo Bonzini wrote:
>>> Il 03/03/2013 17:48, Jan Kiszka ha scritto:
>>>> Hi all,
>>>>
>>>> KVM's mp_state on x86 is usually manipulated over the context of the
>>>> VCPU. Therefore, no locking is required. There are unfortunately two
>>>> exceptions, and one of them is definitely broken: INIT and SIPI delivery.
>>>>
>>>> The lapic may set mp_state over the context of the sending VCPU. For
>>>> SIPI, it first checks if the mp_state is INIT_RECEIVED before updating
>>>> it to SIPI_RECEIVED. We can only race here with user space setting the
>>>> state in parallel, I suppose. Probably harmless in practice.
>>>
>>> Still it would be better to add an smp_wmb/smp_rmb pair between accesses
>>> of mp_state and sipi_vector.
>>
>> Do we need a mb between sipi_vector assignment and kvm_make_request (see
>> my patch to fix this issue)?
>
> Yes. kvm_make_request does a set_bit, which has no implicit barrier.
>
>>> BTW, what happens when you send an INIT IPI to the bootstrap processor?
>>> This may be interesting if we want to emulate soft resets correctly in
>>> QEMU; KVM makes it go to wait-for-SIPI state if I read the code
>>> correctly, but that is wrong.
>>
>> Where is this restriction specified?
>
> I now found it explicitly in "8.4.2 MP Initialization Protocol
> Requirements and Restrictions": "If the MP protocol has completed and a
> BSP is chosen, subsequent INITs (either to a specific processor or
> system wide) do not cause the MP protocol to be repeated. Instead,
> each logical processor examines its BSP flag (in the IA32_APIC_BASE MSR)
> to determine whether it should execute the BIOS boot-strap code (if it
> is the BSP) or enter a wait-for-SIPI state (if it is an AP)".
>
>
> In my previous message I asked because I only found two implicit references:
>
> - in "9.1 Initialization Overview", where it says for a hardware reset:
> "A hardware reset sets each processor’s registers to a known state and
> places the processor in real-address mode. [...] the bootstrap processor
> (BSP) then immediately starts executing software-initialization code in
> the current code segment beginning at the offset in the EIP register.
> The application (non-BSP) processors (APs) go into a Wait For Startup
> IPI (SIPI) state while the BSP is executing initialization code."
>
> And then: "Asserting the INIT# pin on the processor invokes a similar
> response to a hardware reset." The two together say that asserting the
> INIT# pin makes the BSP start running from 0xFFFFFFF0.
>
> - in "10.4.7.1 Local APIC State After Power-Up or Reset" where it says
> "If the processor is the only processor in the system or it is the BSP
> [...] the local APIC will respond normally to INIT and NMI messages, to
> INIT# signals and to STPCLK# signals. If the processor is in an MP
> system and has been designated as an AP; the local APIC will respond the
> same as for the BSP. In addition, it will respond to SIPI messages."
>
> Apart from the funky semicolon placement, this means that (1) sending an
> INIT to the BSP is legal; (2) the BSP will not respond to a SIPI.
>
>
> There is of course :) a possible reference to the contrary in "10.4.7.3
> Local APIC State After an INIT Reset ('Wait-for-SIPI' State)". However,
> that's where there is a reference to 8.4.2 so I don't think it counts.
>
>> How do you reset the BP without resetting the while system then?
>
> You can reset it, but it will immediately restart execution from 0xFFFF0000.
Yes, makes sense. Checking QEMU... yeah, it resets the CPU to BP state
on INIT, keeping APs halted and adjusting their CS on SIPI.
I will try to fix this separately.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SDP-DE
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: KVM: x86: Racy mp_state manipulations
2013-03-04 14:12 ` Paolo Bonzini
2013-03-04 14:28 ` Jan Kiszka
@ 2013-03-05 7:28 ` Gleb Natapov
1 sibling, 0 replies; 6+ messages in thread
From: Gleb Natapov @ 2013-03-05 7:28 UTC (permalink / raw)
To: Paolo Bonzini; +Cc: Jan Kiszka, Marcelo Tosatti, kvm
On Mon, Mar 04, 2013 at 03:12:45PM +0100, Paolo Bonzini wrote:
> The latter makes sense since it's not a fast path, but the only
> transition that is acceptable to KVM_MP_STATE_HALTED is from
> KVM_MP_STATE_RUNNABLE:
>
> from \ to RUNNABLE UNINIT INIT HALTED SIPI
> RUNNABLE n/a yes yes yes NO
> UNINIT NO n/a yes NO NO
> INIT NO yes n/a NO yes
> HALTED yes yes yes n/a NO
> SIPI yes yes yes NO n/a
>
About UNINIT column. It should be NO for all of them. vcpu enters UNINIT
state only after creation and never returns to it.
--
Gleb.
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2013-03-05 7:28 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-03 16:48 KVM: x86: Racy mp_state manipulations Jan Kiszka
2013-03-04 14:12 ` Paolo Bonzini
2013-03-04 14:28 ` Jan Kiszka
2013-03-04 15:08 ` Paolo Bonzini
2013-03-04 15:19 ` Jan Kiszka
2013-03-05 7:28 ` Gleb Natapov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox