public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
* VMX: System lock-up in guest mode, BIOS under suspect
@ 2010-10-01 16:30 Jan Kiszka
  2010-10-02 17:25 ` Avi Kivity
  0 siblings, 1 reply; 3+ messages in thread
From: Jan Kiszka @ 2010-10-01 16:30 UTC (permalink / raw)
  To: kvm

Hi,

for the past days I've been trying to understand a very strange hard
lock-up of some Intel i7 boxes when running our 16-bit guest OS under
KVM. After applying some instrumentation before and after the VM entry
(e.g. direct write to VGA memory), it turned out that the system is
apparently stuck inside guest mode!

I double-checked that VM exits on external IRQs and NMIs are properly
enabled in the VMCS - they are. I also tried to capture any potential
last words via serial console and even via remote DMA over Firewire) -
nothing. This likely means that not only the one core in guest mode is
stuck but all the others as well (note: the freeze is reproducible both
in UP and SMP mode). Very uncommon for an OS crash I would say...

So I decided to go for some nice conspiracy theory and put SMIs and
related BIOS code under suspect. Interestingly, this worked out:

After disabling all SMIs on my box (Fujitsu Celsius H700) via the
chipset register, the hard freezes no longer occurred up to now. My
customer was able to confirm this on some Lenovo Notebook as well. We
are currently collecting data about the affected systems to correlate
it, and we are performing longer test runs.

Nevertheless, I would like to collect some first comments on this. I'm
specifically wondering...

 - if there is anything the host OS can mess up to make VM exits crash
   on the way into SMM or out again (I cannot imagine as the SMM monitor
   should always be able to run, at least in the absence of CPU
   erratas).

 - what the SMM monitor could do wrong to cause such a crash,
   especially as it looks like the hardware does all the switching for
   it.

 - if there could still be some KVM crash around host<->guest switching
   that just happens to be triggered by the SMI noise and that affects
   the whole system (including cores that do not host KVM threads).

Any ideas warmly welcome!

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: VMX: System lock-up in guest mode, BIOS under suspect
  2010-10-01 16:30 VMX: System lock-up in guest mode, BIOS under suspect Jan Kiszka
@ 2010-10-02 17:25 ` Avi Kivity
  2010-10-04  8:41   ` Jan Kiszka
  0 siblings, 1 reply; 3+ messages in thread
From: Avi Kivity @ 2010-10-02 17:25 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: kvm

  On 10/01/2010 06:30 PM, Jan Kiszka wrote:
> Hi,
>
> for the past days I've been trying to understand a very strange hard
> lock-up of some Intel i7 boxes when running our 16-bit guest OS under
> KVM. After applying some instrumentation before and after the VM entry
> (e.g. direct write to VGA memory), it turned out that the system is
> apparently stuck inside guest mode!

Strictly speaking, it could also be a crash in the small window between 
vmexit and your writes.  However it's likely to be as you say.

> I double-checked that VM exits on external IRQs and NMIs are properly
> enabled in the VMCS - they are. I also tried to capture any potential
> last words via serial console and even via remote DMA over Firewire) -
> nothing. This likely means that not only the one core in guest mode is
> stuck but all the others as well (note: the freeze is reproducible both
> in UP and SMP mode). Very uncommon for an OS crash I would say...
>
> So I decided to go for some nice conspiracy theory and put SMIs and
> related BIOS code under suspect. Interestingly, this worked out:
>
> After disabling all SMIs on my box (Fujitsu Celsius H700) via the
> chipset register, the hard freezes no longer occurred up to now. My
> customer was able to confirm this on some Lenovo Notebook as well. We
> are currently collecting data about the affected systems to correlate
> it, and we are performing longer test runs.
>
> Nevertheless, I would like to collect some first comments on this. I'm
> specifically wondering...
>
>   - if there is anything the host OS can mess up to make VM exits crash
>     on the way into SMM or out again (I cannot imagine as the SMM monitor
>     should always be able to run, at least in the absence of CPU
>     erratas).

Yes.  It's basically a small hypervisor, and the host OS is its guest.  
So a well written SMM handler should not depend on any OS setting.  
Whether they're actually tested this way is another matter.

>   - what the SMM monitor could do wrong to cause such a crash,
>     especially as it looks like the hardware does all the switching for
>     it.

Looks like SMM saves some handler-visible state when EPT is enabled.  
Are all your failures on EPT-capable hosts?  If so, what happens when 
EPT is disabled?

>   - if there could still be some KVM crash around host<->guest switching
>     that just happens to be triggered by the SMI noise and that affects
>     the whole system (including cores that do not host KVM threads).
>
> Any ideas warmly welcome!

Besides trying with ept=0, I suggest looking for machines that have SMIs 
but do not crash.  If we find them, this seems to indicate a badly 
written SMM handler.  If not, then there may be a systemic problem with 
kvm (or perhaps all SMM handlers are badly written).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: VMX: System lock-up in guest mode, BIOS under suspect
  2010-10-02 17:25 ` Avi Kivity
@ 2010-10-04  8:41   ` Jan Kiszka
  0 siblings, 0 replies; 3+ messages in thread
From: Jan Kiszka @ 2010-10-04  8:41 UTC (permalink / raw)
  To: Avi Kivity; +Cc: kvm

Am 02.10.2010 19:25, Avi Kivity wrote:
>   On 10/01/2010 06:30 PM, Jan Kiszka wrote:
>> Hi,
>>
>> for the past days I've been trying to understand a very strange hard
>> lock-up of some Intel i7 boxes when running our 16-bit guest OS under
>> KVM. After applying some instrumentation before and after the VM entry
>> (e.g. direct write to VGA memory), it turned out that the system is
>> apparently stuck inside guest mode!
> 
> Strictly speaking, it could also be a crash in the small window between 
> vmexit and your writes.  However it's likely to be as you say.
> 
>> I double-checked that VM exits on external IRQs and NMIs are properly
>> enabled in the VMCS - they are. I also tried to capture any potential
>> last words via serial console and even via remote DMA over Firewire) -
>> nothing. This likely means that not only the one core in guest mode is
>> stuck but all the others as well (note: the freeze is reproducible both
>> in UP and SMP mode). Very uncommon for an OS crash I would say...
>>
>> So I decided to go for some nice conspiracy theory and put SMIs and
>> related BIOS code under suspect. Interestingly, this worked out:
>>
>> After disabling all SMIs on my box (Fujitsu Celsius H700) via the
>> chipset register, the hard freezes no longer occurred up to now. My
>> customer was able to confirm this on some Lenovo Notebook as well. We
>> are currently collecting data about the affected systems to correlate
>> it, and we are performing longer test runs.
>>
>> Nevertheless, I would like to collect some first comments on this. I'm
>> specifically wondering...
>>
>>   - if there is anything the host OS can mess up to make VM exits crash
>>     on the way into SMM or out again (I cannot imagine as the SMM monitor
>>     should always be able to run, at least in the absence of CPU
>>     erratas).
> 
> Yes.  It's basically a small hypervisor, and the host OS is its guest.  
> So a well written SMM handler should not depend on any OS setting.  
> Whether they're actually tested this way is another matter.
> 
>>   - what the SMM monitor could do wrong to cause such a crash,
>>     especially as it looks like the hardware does all the switching for
>>     it.
> 
> Looks like SMM saves some handler-visible state when EPT is enabled.  
> Are all your failures on EPT-capable hosts?  If so, what happens when 
> EPT is disabled?

All Core i7 should support EPT, so we should have this enabled on all
affected systems. However, ept=0 makes no difference on my box, it still
locks up.

> 
>>   - if there could still be some KVM crash around host<->guest switching
>>     that just happens to be triggered by the SMI noise and that affects
>>     the whole system (including cores that do not host KVM threads).
>>
>> Any ideas warmly welcome!
> 
> Besides trying with ept=0, I suggest looking for machines that have SMIs 
> but do not crash.  If we find them, this seems to indicate a badly 
> written SMM handler.  If not, then there may be a systemic problem with 
> kvm (or perhaps all SMM handlers are badly written).

We are looking for the BIOS vendors. In my case, it is Phoenix, but at
least the Lenovos have been re-branded.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2010-10-04  8:41 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-10-01 16:30 VMX: System lock-up in guest mode, BIOS under suspect Jan Kiszka
2010-10-02 17:25 ` Avi Kivity
2010-10-04  8:41   ` Jan Kiszka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox