From mboxrd@z Thu Jan 1 00:00:00 1970 From: Avi Kivity Subject: Re: VMX: System lock-up in guest mode, BIOS under suspect Date: Sat, 02 Oct 2010 19:25:00 +0200 Message-ID: <4CA76AEC.7010604@redhat.com> References: <4CA60CB8.8020506@siemens.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Cc: kvm To: Jan Kiszka Return-path: Received: from mx1.redhat.com ([209.132.183.28]:29379 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750931Ab0JBRZH (ORCPT ); Sat, 2 Oct 2010 13:25:07 -0400 In-Reply-To: <4CA60CB8.8020506@siemens.com> Sender: kvm-owner@vger.kernel.org List-ID: On 10/01/2010 06:30 PM, Jan Kiszka wrote: > Hi, > > for the past days I've been trying to understand a very strange hard > lock-up of some Intel i7 boxes when running our 16-bit guest OS under > KVM. After applying some instrumentation before and after the VM entry > (e.g. direct write to VGA memory), it turned out that the system is > apparently stuck inside guest mode! Strictly speaking, it could also be a crash in the small window between vmexit and your writes. However it's likely to be as you say. > I double-checked that VM exits on external IRQs and NMIs are properly > enabled in the VMCS - they are. I also tried to capture any potential > last words via serial console and even via remote DMA over Firewire) - > nothing. This likely means that not only the one core in guest mode is > stuck but all the others as well (note: the freeze is reproducible both > in UP and SMP mode). Very uncommon for an OS crash I would say... > > So I decided to go for some nice conspiracy theory and put SMIs and > related BIOS code under suspect. Interestingly, this worked out: > > After disabling all SMIs on my box (Fujitsu Celsius H700) via the > chipset register, the hard freezes no longer occurred up to now. My > customer was able to confirm this on some Lenovo Notebook as well. We > are currently collecting data about the affected systems to correlate > it, and we are performing longer test runs. > > Nevertheless, I would like to collect some first comments on this. I'm > specifically wondering... > > - if there is anything the host OS can mess up to make VM exits crash > on the way into SMM or out again (I cannot imagine as the SMM monitor > should always be able to run, at least in the absence of CPU > erratas). Yes. It's basically a small hypervisor, and the host OS is its guest. So a well written SMM handler should not depend on any OS setting. Whether they're actually tested this way is another matter. > - what the SMM monitor could do wrong to cause such a crash, > especially as it looks like the hardware does all the switching for > it. Looks like SMM saves some handler-visible state when EPT is enabled. Are all your failures on EPT-capable hosts? If so, what happens when EPT is disabled? > - if there could still be some KVM crash around host<->guest switching > that just happens to be triggered by the SMI noise and that affects > the whole system (including cores that do not host KVM threads). > > Any ideas warmly welcome! Besides trying with ept=0, I suggest looking for machines that have SMIs but do not crash. If we find them, this seems to indicate a badly written SMM handler. If not, then there may be a systemic problem with kvm (or perhaps all SMM handlers are badly written). -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain.