From mboxrd@z Thu Jan 1 00:00:00 1970 From: Marcelo Tosatti Subject: kvm guest loops_per_jiffy miscalibration under host load Date: Wed, 2 Jul 2008 13:40:21 -0300 Message-ID: <20080702164021.GA31751@dmt.cnet> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: gcosta@redhat.com, kraxel@redhat.com, chrisw@redhat.com, aliguori@us.ibm.com To: kvm-devel Return-path: Received: from mx1.redhat.com ([66.187.233.31]:50089 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750826AbYGBQoU (ORCPT ); Wed, 2 Jul 2008 12:44:20 -0400 Content-Disposition: inline Sender: kvm-owner@vger.kernel.org List-ID: Hello, I have been discussing with Glauber and Gerd the problem where KVM guests miscalibrate loops_per_jiffy if there's sufficient load on the host. calibrate_delay_direct() failed to get a good estimate for loops_per_jiffy. Probably due to long platform interrupts. Consider using "lpj=" boot option. Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016) While this particular host calculates lpj=1597041. This means that udelay() can delay for less than what asked for, with fatal results such as: ..MP-BIOS bug: 8254 timer not connected to IO-APIC Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the 'noapic' kernel parameter This bug is easily triggered with a CPU hungry task on nice -20 running only during guest calibration (so that the timer check code on io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire). The problem is that the calibration routines assume a stable relation between timer interrupt frequency (PIT at this boot stage) and TSC/execution frequency. The emulated timer frequency is based on the host system time and therefore virtually resistant against heavy load, while the execution of these routines on the guest is suspectible to scheduling of the QEMU process. To fix this in a transparent way (without direct "lpj=" boot parameter assignment or a paravirt equivalent), it would be necessary to base the emulated timer frequency on guest execution time instead of host system time. But this can introduce timekeeping issues (recent Linux guests seem to handle lost/late interrupts fine as long as the clocksource is reliable) and just sounds scary. Possible solutions: - Require the admin to preset "lpj=". Nasty, not user friendly. - Pass the proper lpj value via a paravirt interface. Won't cover fullvirt guests. - Have the management app guarantee a minimum amount of CPU required for proper calibration during guest initialization. Comments, ideas?