From mboxrd@z Thu Jan  1 00:00:00 1970
From: Avi Kivity <avi@redhat.com>
Subject: Re: VMX: System lock-up in guest mode, BIOS under suspect
Date: Sat, 02 Oct 2010 19:25:00 +0200
Message-ID: <4CA76AEC.7010604@redhat.com>
References: <4CA60CB8.8020506@siemens.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: 7bit
Cc: kvm <kvm@vger.kernel.org>
To: Jan Kiszka <jan.kiszka@siemens.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:29379 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1750931Ab0JBRZH (ORCPT <rfc822;kvm@vger.kernel.org>);
	Sat, 2 Oct 2010 13:25:07 -0400
In-Reply-To: <4CA60CB8.8020506@siemens.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

  On 10/01/2010 06:30 PM, Jan Kiszka wrote:
> Hi,
>
> for the past days I've been trying to understand a very strange hard
> lock-up of some Intel i7 boxes when running our 16-bit guest OS under
> KVM. After applying some instrumentation before and after the VM entry
> (e.g. direct write to VGA memory), it turned out that the system is
> apparently stuck inside guest mode!

Strictly speaking, it could also be a crash in the small window between 
vmexit and your writes.  However it's likely to be as you say.

> I double-checked that VM exits on external IRQs and NMIs are properly
> enabled in the VMCS - they are. I also tried to capture any potential
> last words via serial console and even via remote DMA over Firewire) -
> nothing. This likely means that not only the one core in guest mode is
> stuck but all the others as well (note: the freeze is reproducible both
> in UP and SMP mode). Very uncommon for an OS crash I would say...
>
> So I decided to go for some nice conspiracy theory and put SMIs and
> related BIOS code under suspect. Interestingly, this worked out:
>
> After disabling all SMIs on my box (Fujitsu Celsius H700) via the
> chipset register, the hard freezes no longer occurred up to now. My
> customer was able to confirm this on some Lenovo Notebook as well. We
> are currently collecting data about the affected systems to correlate
> it, and we are performing longer test runs.
>
> Nevertheless, I would like to collect some first comments on this. I'm
> specifically wondering...
>
>   - if there is anything the host OS can mess up to make VM exits crash
>     on the way into SMM or out again (I cannot imagine as the SMM monitor
>     should always be able to run, at least in the absence of CPU
>     erratas).

Yes.  It's basically a small hypervisor, and the host OS is its guest.  
So a well written SMM handler should not depend on any OS setting.  
Whether they're actually tested this way is another matter.

>   - what the SMM monitor could do wrong to cause such a crash,
>     especially as it looks like the hardware does all the switching for
>     it.

Looks like SMM saves some handler-visible state when EPT is enabled.  
Are all your failures on EPT-capable hosts?  If so, what happens when 
EPT is disabled?

>   - if there could still be some KVM crash around host<->guest switching
>     that just happens to be triggered by the SMI noise and that affects
>     the whole system (including cores that do not host KVM threads).
>
> Any ideas warmly welcome!

Besides trying with ept=0, I suggest looking for machines that have SMIs 
but do not crash.  If we find them, this seems to indicate a badly 
written SMM handler.  If not, then there may be a systemic problem with 
kvm (or perhaps all SMM handlers are badly written).

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.