* VMX: System lock-up in guest mode, BIOS under suspect
@ 2010-10-01 16:30 Jan Kiszka
2010-10-02 17:25 ` Avi Kivity
0 siblings, 1 reply; 3+ messages in thread
From: Jan Kiszka @ 2010-10-01 16:30 UTC (permalink / raw)
To: kvm
Hi,
for the past days I've been trying to understand a very strange hard
lock-up of some Intel i7 boxes when running our 16-bit guest OS under
KVM. After applying some instrumentation before and after the VM entry
(e.g. direct write to VGA memory), it turned out that the system is
apparently stuck inside guest mode!
I double-checked that VM exits on external IRQs and NMIs are properly
enabled in the VMCS - they are. I also tried to capture any potential
last words via serial console and even via remote DMA over Firewire) -
nothing. This likely means that not only the one core in guest mode is
stuck but all the others as well (note: the freeze is reproducible both
in UP and SMP mode). Very uncommon for an OS crash I would say...
So I decided to go for some nice conspiracy theory and put SMIs and
related BIOS code under suspect. Interestingly, this worked out:
After disabling all SMIs on my box (Fujitsu Celsius H700) via the
chipset register, the hard freezes no longer occurred up to now. My
customer was able to confirm this on some Lenovo Notebook as well. We
are currently collecting data about the affected systems to correlate
it, and we are performing longer test runs.
Nevertheless, I would like to collect some first comments on this. I'm
specifically wondering...
- if there is anything the host OS can mess up to make VM exits crash
on the way into SMM or out again (I cannot imagine as the SMM monitor
should always be able to run, at least in the absence of CPU
erratas).
- what the SMM monitor could do wrong to cause such a crash,
especially as it looks like the hardware does all the switching for
it.
- if there could still be some KVM crash around host<->guest switching
that just happens to be triggered by the SMI noise and that affects
the whole system (including cores that do not host KVM threads).
Any ideas warmly welcome!
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: VMX: System lock-up in guest mode, BIOS under suspect
2010-10-01 16:30 VMX: System lock-up in guest mode, BIOS under suspect Jan Kiszka
@ 2010-10-02 17:25 ` Avi Kivity
2010-10-04 8:41 ` Jan Kiszka
0 siblings, 1 reply; 3+ messages in thread
From: Avi Kivity @ 2010-10-02 17:25 UTC (permalink / raw)
To: Jan Kiszka; +Cc: kvm
On 10/01/2010 06:30 PM, Jan Kiszka wrote:
> Hi,
>
> for the past days I've been trying to understand a very strange hard
> lock-up of some Intel i7 boxes when running our 16-bit guest OS under
> KVM. After applying some instrumentation before and after the VM entry
> (e.g. direct write to VGA memory), it turned out that the system is
> apparently stuck inside guest mode!
Strictly speaking, it could also be a crash in the small window between
vmexit and your writes. However it's likely to be as you say.
> I double-checked that VM exits on external IRQs and NMIs are properly
> enabled in the VMCS - they are. I also tried to capture any potential
> last words via serial console and even via remote DMA over Firewire) -
> nothing. This likely means that not only the one core in guest mode is
> stuck but all the others as well (note: the freeze is reproducible both
> in UP and SMP mode). Very uncommon for an OS crash I would say...
>
> So I decided to go for some nice conspiracy theory and put SMIs and
> related BIOS code under suspect. Interestingly, this worked out:
>
> After disabling all SMIs on my box (Fujitsu Celsius H700) via the
> chipset register, the hard freezes no longer occurred up to now. My
> customer was able to confirm this on some Lenovo Notebook as well. We
> are currently collecting data about the affected systems to correlate
> it, and we are performing longer test runs.
>
> Nevertheless, I would like to collect some first comments on this. I'm
> specifically wondering...
>
> - if there is anything the host OS can mess up to make VM exits crash
> on the way into SMM or out again (I cannot imagine as the SMM monitor
> should always be able to run, at least in the absence of CPU
> erratas).
Yes. It's basically a small hypervisor, and the host OS is its guest.
So a well written SMM handler should not depend on any OS setting.
Whether they're actually tested this way is another matter.
> - what the SMM monitor could do wrong to cause such a crash,
> especially as it looks like the hardware does all the switching for
> it.
Looks like SMM saves some handler-visible state when EPT is enabled.
Are all your failures on EPT-capable hosts? If so, what happens when
EPT is disabled?
> - if there could still be some KVM crash around host<->guest switching
> that just happens to be triggered by the SMI noise and that affects
> the whole system (including cores that do not host KVM threads).
>
> Any ideas warmly welcome!
Besides trying with ept=0, I suggest looking for machines that have SMIs
but do not crash. If we find them, this seems to indicate a badly
written SMM handler. If not, then there may be a systemic problem with
kvm (or perhaps all SMM handlers are badly written).
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: VMX: System lock-up in guest mode, BIOS under suspect
2010-10-02 17:25 ` Avi Kivity
@ 2010-10-04 8:41 ` Jan Kiszka
0 siblings, 0 replies; 3+ messages in thread
From: Jan Kiszka @ 2010-10-04 8:41 UTC (permalink / raw)
To: Avi Kivity; +Cc: kvm
Am 02.10.2010 19:25, Avi Kivity wrote:
> On 10/01/2010 06:30 PM, Jan Kiszka wrote:
>> Hi,
>>
>> for the past days I've been trying to understand a very strange hard
>> lock-up of some Intel i7 boxes when running our 16-bit guest OS under
>> KVM. After applying some instrumentation before and after the VM entry
>> (e.g. direct write to VGA memory), it turned out that the system is
>> apparently stuck inside guest mode!
>
> Strictly speaking, it could also be a crash in the small window between
> vmexit and your writes. However it's likely to be as you say.
>
>> I double-checked that VM exits on external IRQs and NMIs are properly
>> enabled in the VMCS - they are. I also tried to capture any potential
>> last words via serial console and even via remote DMA over Firewire) -
>> nothing. This likely means that not only the one core in guest mode is
>> stuck but all the others as well (note: the freeze is reproducible both
>> in UP and SMP mode). Very uncommon for an OS crash I would say...
>>
>> So I decided to go for some nice conspiracy theory and put SMIs and
>> related BIOS code under suspect. Interestingly, this worked out:
>>
>> After disabling all SMIs on my box (Fujitsu Celsius H700) via the
>> chipset register, the hard freezes no longer occurred up to now. My
>> customer was able to confirm this on some Lenovo Notebook as well. We
>> are currently collecting data about the affected systems to correlate
>> it, and we are performing longer test runs.
>>
>> Nevertheless, I would like to collect some first comments on this. I'm
>> specifically wondering...
>>
>> - if there is anything the host OS can mess up to make VM exits crash
>> on the way into SMM or out again (I cannot imagine as the SMM monitor
>> should always be able to run, at least in the absence of CPU
>> erratas).
>
> Yes. It's basically a small hypervisor, and the host OS is its guest.
> So a well written SMM handler should not depend on any OS setting.
> Whether they're actually tested this way is another matter.
>
>> - what the SMM monitor could do wrong to cause such a crash,
>> especially as it looks like the hardware does all the switching for
>> it.
>
> Looks like SMM saves some handler-visible state when EPT is enabled.
> Are all your failures on EPT-capable hosts? If so, what happens when
> EPT is disabled?
All Core i7 should support EPT, so we should have this enabled on all
affected systems. However, ept=0 makes no difference on my box, it still
locks up.
>
>> - if there could still be some KVM crash around host<->guest switching
>> that just happens to be triggered by the SMI noise and that affects
>> the whole system (including cores that do not host KVM threads).
>>
>> Any ideas warmly welcome!
>
> Besides trying with ept=0, I suggest looking for machines that have SMIs
> but do not crash. If we find them, this seems to indicate a badly
> written SMM handler. If not, then there may be a systemic problem with
> kvm (or perhaps all SMM handlers are badly written).
We are looking for the BIOS vendors. In my case, it is Phoenix, but at
least the Lenovos have been re-branded.
Jan
--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2010-10-04 8:41 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-01 16:30 VMX: System lock-up in guest mode, BIOS under suspect Jan Kiszka
2010-10-02 17:25 ` Avi Kivity
2010-10-04 8:41 ` Jan Kiszka
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox