From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kiszka Subject: Re: VMX: System lock-up in guest mode, BIOS under suspect Date: Mon, 04 Oct 2010 10:41:01 +0200 Message-ID: <4CA9931D.1080708@siemens.com> References: <4CA60CB8.8020506@siemens.com> <4CA76AEC.7010604@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit Cc: kvm To: Avi Kivity Return-path: Received: from thoth.sbs.de ([192.35.17.2]:18087 "EHLO thoth.sbs.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753238Ab0JDIlV (ORCPT ); Mon, 4 Oct 2010 04:41:21 -0400 In-Reply-To: <4CA76AEC.7010604@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Am 02.10.2010 19:25, Avi Kivity wrote: > On 10/01/2010 06:30 PM, Jan Kiszka wrote: >> Hi, >> >> for the past days I've been trying to understand a very strange hard >> lock-up of some Intel i7 boxes when running our 16-bit guest OS under >> KVM. After applying some instrumentation before and after the VM entry >> (e.g. direct write to VGA memory), it turned out that the system is >> apparently stuck inside guest mode! > > Strictly speaking, it could also be a crash in the small window between > vmexit and your writes. However it's likely to be as you say. > >> I double-checked that VM exits on external IRQs and NMIs are properly >> enabled in the VMCS - they are. I also tried to capture any potential >> last words via serial console and even via remote DMA over Firewire) - >> nothing. This likely means that not only the one core in guest mode is >> stuck but all the others as well (note: the freeze is reproducible both >> in UP and SMP mode). Very uncommon for an OS crash I would say... >> >> So I decided to go for some nice conspiracy theory and put SMIs and >> related BIOS code under suspect. Interestingly, this worked out: >> >> After disabling all SMIs on my box (Fujitsu Celsius H700) via the >> chipset register, the hard freezes no longer occurred up to now. My >> customer was able to confirm this on some Lenovo Notebook as well. We >> are currently collecting data about the affected systems to correlate >> it, and we are performing longer test runs. >> >> Nevertheless, I would like to collect some first comments on this. I'm >> specifically wondering... >> >> - if there is anything the host OS can mess up to make VM exits crash >> on the way into SMM or out again (I cannot imagine as the SMM monitor >> should always be able to run, at least in the absence of CPU >> erratas). > > Yes. It's basically a small hypervisor, and the host OS is its guest. > So a well written SMM handler should not depend on any OS setting. > Whether they're actually tested this way is another matter. > >> - what the SMM monitor could do wrong to cause such a crash, >> especially as it looks like the hardware does all the switching for >> it. > > Looks like SMM saves some handler-visible state when EPT is enabled. > Are all your failures on EPT-capable hosts? If so, what happens when > EPT is disabled? All Core i7 should support EPT, so we should have this enabled on all affected systems. However, ept=0 makes no difference on my box, it still locks up. > >> - if there could still be some KVM crash around host<->guest switching >> that just happens to be triggered by the SMI noise and that affects >> the whole system (including cores that do not host KVM threads). >> >> Any ideas warmly welcome! > > Besides trying with ept=0, I suggest looking for machines that have SMIs > but do not crash. If we find them, this seems to indicate a badly > written SMM handler. If not, then there may be a systemic problem with > kvm (or perhaps all SMM handlers are badly written). We are looking for the BIOS vendors. In my case, it is Phoenix, but at least the Lenovos have been re-branded. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux