From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Dom0 losing interrupts??? Date: Mon, 14 Feb 2011 12:46:56 +0100 Message-ID: <4D591630.1090302@ts.fujitsu.com> References: <4D58D2D7.9010803@ts.fujitsu.com> <4D59034A0200007800031B7A@vpn.id2.novell.com> <4D58F820.80401@ts.fujitsu.com> <4D590AE70200007800031BC1@vpn.id2.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: George Dunlap Cc: "xen-devel@lists.xensource.com" , Jan Beulich List-Id: xen-devel@lists.xenproject.org On 02/14/11 12:21, George Dunlap wrote: > My sense is that: > * Pinning N vcpus to N-M pcpus (where M is a significant fraction of > N) is just a really bad idea; it would be better just not to do that. I just wanted to make sure the interrupts are not lost due to the cpupool operation itself. So I tried with an extreme configuration and was proved right :-) > It would be ideal if somehow when dom0's cpu pool shrinks, it > automatically offlines an appropriate number of vcpus; but it > shouldn't be difficult for an administrator to do that themselves. I've sent a patch for the cpupool-numa-split case, which will always remove a significant number of physical cpus for dom0. > * On average, a vcpu shouldn't have to wait more than 60ms or so for > an interrupt. It seems like there's a non-negligible possibility that > there's some kind of bug in the interrupt delivery and handling, > either on the Xen side or the Linux side (or as Jan pointed out, a bug > in the driver). In that case, doing something in the scheduler isn't > actually fixing the problem, it's just making it less likely to > happen. (NB that we've had intermittent failures in the xen.org > testing infrastructure with what looks like might be missed interrupts > as well -- and those weren't on heavily loaded boxes.) Any idea what I could do to help? Our larger test machines are not just idling, but I could use one from time to time without much problems. It's rather easy for me to reproduce the problem, OTOH it should be easy for others with a reasonable large machine, too. > * Even if it is ultimately a scheduler bug, understanding exactly what > the scheduler is doing and why is key to making a proper fix. It's > possible that there's just a simple quirk in the algorithm, such that > a general fix will make everything work better without needing to > introduce a special case for hardware interrupts. > * I'm not opposed in principle to a mechanism which will prioritize > vcpus awaiting hardware interrupts. But I am wary of guessing what > the problem is and then introducing a patch without proper root-cause > analysis. Even if it seems to fix the immediate problem, it may > simply be masking the real problem, and may also cause problems of its > own. Behavior of the scheduler is hard enough to understand already, > and every special case makes it even harder. I absolutely agree! Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html