From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steve Traugott Subject: Re: blocking Xen 3.X production use: soft lockup bugs Date: Fri, 4 Aug 2006 13:21:21 -0700 Message-ID: <20060804202121.GA13827@terraluna.org> References: <1efc279e1b6db92d9564c61b25c06df8@cl.cam.ac.uk> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <1efc279e1b6db92d9564c61b25c06df8@cl.cam.ac.uk> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Keir Fraser Cc: Ian Pratt , xen-devel List-Id: xen-devel@lists.xenproject.org You nailed it, Keir. On Thu, Aug 03, 2006 at 09:03:18AM +0100, Keir Fraser wrote: > Also older versions using sedf scheduler (which has now been patched to > avoid this) could end up with domain0 consuming all CPU and starving > other guests, leading to softlockup errors. We haven't seen any such > errors on our own test machines since this was fixed. Of course, that > doesn't mean there aren't problems with other test scenarios! That is exactly what was happening. I did more testing yesterday and last night (-testing changeset 9732), and realized that I was only seeing soft lockups on the second of two domU guests, and only when running a heavy load in dom0. According to 'xm vcpu-list' the second guest was on CPU 0, as was the workload in dom0... I added more workload processes to consume both CPUs in dom0, and of course when I did that, the first guest ground to a halt and started showing soft lockups as well. I was usually able to trigger the soft lockups in a few seconds simply by running one or more of these in dom0: cat /dev/zero > /dev/null Variants of 'nc -ub 255.255.255.255 10000 < /dev/zero' and 'nc -u -l -p 10000 > /dev/null' in dom0 or domU also made things interesting, though I'm not sure that the network traffic is a factor. (Kids, don't do this on a production net...) So I built -unstable changeset 10868, and ran an even heavier workload (the above, plus 'bonnie' in the guests) on dom0 and two guests overnight, and they experienced no soft lockups; running -unstable, changeset 10868, credit scheduler. This same workload would have caused soft lockups within seconds in -testing changeset 9732 using the sedf scheduler; I may not have been able to get it started at all. Response time remained subsecond under -unstable; -testing would have been on its knees. Steve -- Stephen G. Traugott (KG6HDQ) UNIX/Linux Infrastructure Architect, TerraLuna LLC stevegt@TerraLuna.Org http://www.stevegt.com -- http://Infrastructures.Org