From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: [PATCH] Avoid race when moving cpu between cpupools Date: Mon, 28 Feb 2011 11:00:04 +0100 Message-ID: <4D6B7224.7030803@amd.com> References: <5485071c8b0a6a49f65b.1298541625@nehalem1> <4D666678.1000301@amd.com> <4D67BBDA.5070603@amd.com> <4D6B6AF8.1040305@ts.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D6B6AF8.1040305@ts.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Juergen Gross Cc: George Dunlap , "xen-devel@lists.xensource.com" , Keir Fraser , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org Juergen Gross wrote: > On 02/25/11 15:25, Andre Przywara wrote: >> George Dunlap wrote: >>> Looks good -- thanks Juergen. >>> >>> Acked-by: George Dunlap >>> >>> -George >>> >>> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara >>> wrote: >>>> Juergen Gross wrote: >>>>> Moving cpus between cpupools is done under the schedule lock of the >>>>> moved >>>>> cpu. >>>>> When checking a cpu being member of a cpupool this must be done with >>>>> the >>>>> lock >>>>> of that cpu being held. >>>> I have reviewed and tested the patch. It fixes my problem. My script has >>>> been running for several hundred iterations without any Xen crash, >>>> whereas >>>> without the patch the hypervisor crashed mostly at the second iteration. >> Juergen, >> >> can you rule out that this code will be triggered on two CPUs trying to >> switch to each other? As Stephan pointed out: the code looks like as >> this could trigger a possible dead-lock condition, where: >> 1) CPU A grabs lock (a) while CPU B grabs lock (b) >> 2) CPU A tries to grab (b) and CPU B tries to grab (a) >> 3) both fail and loop to 1) > > Good point. Not quite a dead-lock, but a possible live-lock :-) Yeah, sorry. That was the wrong wording. Kudos go to Stephan for pointing this out. > >> A possible fix would be to introduce some ordering for the locks (just >> the pointer address) and let the "bigger" pointer yield to the "smaller" >> one. > > Done this and sent a patch. Thanks, it looks good on the first glance. Not yet tested, though. > >> I am not sure if this is really necessary, but I now see strange >> hangs after running the script for a while (30min to 1hr). >> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or >> getting spurious ones, on two occasions the machine totally locked up. >> >> I am not 100% sure whether this is CPUpools related, but I put some load >> on Dom0 (without messing with CPUpools) for the whole night and it ran >> fine. > > Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus? > I had the same problems when using only few physical cpus for many vcpus. > And I'm pretty sure this was NOT the possible live-lock, as it happened > already without this change when I tried to reproduce your problem. That is my current theory, too. I inserted counters in the try-loop for the locks to detect possible lock-ups, but they didn't went over 99, so this is not the reason. The high overcommit (48 vCPUs on 6 pCPUs) is probably responsible. The new reduction of Dom0 vCPUs should avoid this situation in the future. > >> Sorry for this :-( >> I will try to further isolate this. >> >> Anyway, it works much better with the fix than without and I will try to >> trigger this with the "reduce number of Dom0 vCPUs" patch. Unfortunately I got a Dom0 crash with the new patch. Reverting 22934 worked fine. I will investigate this now. root@dosorca:/data/images# xl cpupool-numa-split (XEN) Domain 0 crashed: rebooting machine in 5 seconds. (XEN) Resetting with ACPI MEMORY or I/O RESET_REG. Regards, Andre. -- Andre Przywara AMD-Operating System Research Center (OSRC), Dresden, Germany