From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Liguori Subject: Re: kvm guest loops_per_jiffy miscalibration under host load Date: Sun, 06 Jul 2008 20:56:27 -0500 Message-ID: <487177CB.60104@us.ibm.com> References: <20080702164021.GA31751@dmt.cnet> <486CD151.8020004@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Marcelo Tosatti , kvm-devel , kraxel@redhat.com, chrisw@redhat.com To: Glauber Costa Return-path: Received: from e31.co.us.ibm.com ([32.97.110.149]:52702 "EHLO e31.co.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752045AbYGGB5A (ORCPT ); Sun, 6 Jul 2008 21:57:00 -0400 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by e31.co.us.ibm.com (8.13.8/8.13.8) with ESMTP id m671ut51009098 for ; Sun, 6 Jul 2008 21:56:55 -0400 Received: from d03av01.boulder.ibm.com (d03av01.boulder.ibm.com [9.17.195.167]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v9.0) with ESMTP id m671utlT177494 for ; Sun, 6 Jul 2008 19:56:55 -0600 Received: from d03av01.boulder.ibm.com (loopback [127.0.0.1]) by d03av01.boulder.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m671usZD014136 for ; Sun, 6 Jul 2008 19:56:54 -0600 In-Reply-To: <486CD151.8020004@redhat.com> Sender: kvm-owner@vger.kernel.org List-ID: Glauber Costa wrote: > Marcelo Tosatti wrote: >> Hello, >> >> I have been discussing with Glauber and Gerd the problem where KVM >> guests miscalibrate loops_per_jiffy if there's sufficient load on the >> host. >> >> calibrate_delay_direct() failed to get a good estimate for >> loops_per_jiffy. >> Probably due to long platform interrupts. Consider using "lpj=" boot >> option. >> Calibrating delay loop... <3>107.00 BogoMIPS (lpj=214016) >> >> While this particular host calculates lpj=1597041. >> >> This means that udelay() can delay for less than what asked for, with >> fatal results such as: >> >> ..MP-BIOS bug: 8254 timer not connected to IO-APIC >> Kernel panic - not syncing: IO-APIC + timer doesn't work! Try using the >> 'noapic' kernel parameter >> >> This bug is easily triggered with a CPU hungry task on nice -20 >> running only during guest calibration (so that the timer check code on >> io_apic_{32,64}.c fails to wait long enough for PIT interrupts to fire). >> >> The problem is that the calibration routines assume a stable relation >> between timer interrupt frequency (PIT at this boot stage) and >> TSC/execution frequency. >> >> The emulated timer frequency is based on the host system time and >> therefore virtually resistant against heavy load, while the execution >> of these routines on the guest is suspectible to scheduling of the QEMU >> process. >> >> To fix this in a transparent way (without direct "lpj=" boot parameter >> assignment or a paravirt equivalent), it would be necessary to base the >> emulated timer frequency on guest execution time instead of host system >> time. But this can introduce timekeeping issues (recent Linux guests >> seem to handle lost/late interrupts fine as long as the clocksource is >> reliable) and just sounds scary. >> >> Possible solutions: >> >> - Require the admin to preset "lpj=". Nasty, not user friendly. >> - Pass the proper lpj value via a paravirt interface. Won't cover >> fullvirt guests. >> - Have the management app guarantee a minimum amount of CPU required >> for proper calibration during guest initialization. > I don't like any of these solutions, and won't defend any of "the > one". So no hard feelings. But I think the "less worse" among them > IMHO is the > paravirt one. At least it goes in the general direction of "paravirt > if you need to scale over xyz". I agree. A paravirt solution solves the problem. > I think passing lpj is out of question, and giving the cpu resources > for that time is kind of a kludge. It's all heuristics unfortunately. > Or maybe we could put the timer expiration alone in a separate thread, > with maximum priority (maybe rt priority)? dunno... But then if you have high-load because of a lot of guests running, you defeat yourself. Any attempt to guarantee time to a guest will be defeated by lots of guests all attempting calibration at the same time. Regards, Anthony Liguori