From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jan Kiszka Subject: VMX: System lock-up in guest mode, BIOS under suspect Date: Fri, 01 Oct 2010 18:30:48 +0200 Message-ID: <4CA60CB8.8020506@siemens.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-15 Content-Transfer-Encoding: 7bit To: kvm Return-path: Received: from goliath.siemens.de ([192.35.17.28]:19371 "EHLO goliath.siemens.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751156Ab0JAQau (ORCPT ); Fri, 1 Oct 2010 12:30:50 -0400 Received: from mail1.siemens.de (localhost [127.0.0.1]) by goliath.siemens.de (8.12.11.20060308/8.12.11) with ESMTP id o91GUm9c027060 for ; Fri, 1 Oct 2010 18:30:49 +0200 Received: from mchn199C.mchp.siemens.de ([139.25.109.49]) by mail1.siemens.de (8.12.11.20060308/8.12.11) with ESMTP id o91GUmQE015654 for ; Fri, 1 Oct 2010 18:30:48 +0200 Sender: kvm-owner@vger.kernel.org List-ID: Hi, for the past days I've been trying to understand a very strange hard lock-up of some Intel i7 boxes when running our 16-bit guest OS under KVM. After applying some instrumentation before and after the VM entry (e.g. direct write to VGA memory), it turned out that the system is apparently stuck inside guest mode! I double-checked that VM exits on external IRQs and NMIs are properly enabled in the VMCS - they are. I also tried to capture any potential last words via serial console and even via remote DMA over Firewire) - nothing. This likely means that not only the one core in guest mode is stuck but all the others as well (note: the freeze is reproducible both in UP and SMP mode). Very uncommon for an OS crash I would say... So I decided to go for some nice conspiracy theory and put SMIs and related BIOS code under suspect. Interestingly, this worked out: After disabling all SMIs on my box (Fujitsu Celsius H700) via the chipset register, the hard freezes no longer occurred up to now. My customer was able to confirm this on some Lenovo Notebook as well. We are currently collecting data about the affected systems to correlate it, and we are performing longer test runs. Nevertheless, I would like to collect some first comments on this. I'm specifically wondering... - if there is anything the host OS can mess up to make VM exits crash on the way into SMM or out again (I cannot imagine as the SMM monitor should always be able to run, at least in the absence of CPU erratas). - what the SMM monitor could do wrong to cause such a crash, especially as it looks like the hardware does all the switching for it. - if there could still be some KVM crash around host<->guest switching that just happens to be triggered by the SMI noise and that affects the whole system (including cores that do not host KVM threads). Any ideas warmly welcome! Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux