From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: [PATCH] Avoid race when moving cpu between cpupools Date: Mon, 28 Feb 2011 10:29:28 +0100 Message-ID: <4D6B6AF8.1040305@ts.fujitsu.com> References: <5485071c8b0a6a49f65b.1298541625@nehalem1> <4D666678.1000301@amd.com> <4D67BBDA.5070603@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D67BBDA.5070603@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andre Przywara Cc: George Dunlap , "xen-devel@lists.xensource.com" , Keir Fraser , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org On 02/25/11 15:25, Andre Przywara wrote: > George Dunlap wrote: >> Looks good -- thanks Juergen. >> >> Acked-by: George Dunlap >> >> -George >> >> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara >> wrote: >>> Juergen Gross wrote: >>>> Moving cpus between cpupools is done under the schedule lock of the >>>> moved >>>> cpu. >>>> When checking a cpu being member of a cpupool this must be done with >>>> the >>>> lock >>>> of that cpu being held. >>> I have reviewed and tested the patch. It fixes my problem. My script has >>> been running for several hundred iterations without any Xen crash, >>> whereas >>> without the patch the hypervisor crashed mostly at the second iteration. > > Juergen, > > can you rule out that this code will be triggered on two CPUs trying to > switch to each other? As Stephan pointed out: the code looks like as > this could trigger a possible dead-lock condition, where: > 1) CPU A grabs lock (a) while CPU B grabs lock (b) > 2) CPU A tries to grab (b) and CPU B tries to grab (a) > 3) both fail and loop to 1) Good point. Not quite a dead-lock, but a possible live-lock :-) > A possible fix would be to introduce some ordering for the locks (just > the pointer address) and let the "bigger" pointer yield to the "smaller" > one. Done this and sent a patch. > I am not sure if this is really necessary, but I now see strange > hangs after running the script for a while (30min to 1hr). > Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or > getting spurious ones, on two occasions the machine totally locked up. > > I am not 100% sure whether this is CPUpools related, but I put some load > on Dom0 (without messing with CPUpools) for the whole night and it ran > fine. Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus? I had the same problems when using only few physical cpus for many vcpus. And I'm pretty sure this was NOT the possible live-lock, as it happened already without this change when I tried to reproduce your problem. > > Sorry for this :-( > I will try to further isolate this. > > Anyway, it works much better with the fix than without and I will try to > trigger this with the "reduce number of Dom0 vCPUs" patch. Thanks, Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html