From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andre Przywara <andre.przywara@amd.com>
Subject: Re: [PATCH] Avoid race when moving cpu between cpupools
Date: Mon, 28 Feb 2011 11:00:04 +0100
Message-ID: <4D6B7224.7030803@amd.com>
References: <5485071c8b0a6a49f65b.1298541625@nehalem1>	<4D666678.1000301@amd.com>	<AANLkTikSiJKLH=ginoEgO4Tx0-Z1AC2bwP4qBDjVSfAg@mail.gmail.com>
	<4D67BBDA.5070603@amd.com> <4D6B6AF8.1040305@ts.fujitsu.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="ISO-8859-1"; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4D6B6AF8.1040305@ts.fujitsu.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Juergen Gross <juergen.gross@ts.fujitsu.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Keir Fraser <keir@xen.org>, "Diestelhorst,
	Stephan" <Stephan.Diestelhorst@amd.com>
List-Id: xen-devel@lists.xenproject.org

Juergen Gross wrote:
> On 02/25/11 15:25, Andre Przywara wrote:
>> George Dunlap wrote:
>>> Looks good -- thanks Juergen.
>>>
>>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>>>
>>> -George
>>>
>>> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara
>>> <andre.przywara@amd.com> wrote:
>>>> Juergen Gross wrote:
>>>>> Moving cpus between cpupools is done under the schedule lock of the
>>>>> moved
>>>>> cpu.
>>>>> When checking a cpu being member of a cpupool this must be done with
>>>>> the
>>>>> lock
>>>>> of that cpu being held.
>>>> I have reviewed and tested the patch. It fixes my problem. My script has
>>>> been running for several hundred iterations without any Xen crash,
>>>> whereas
>>>> without the patch the hypervisor crashed mostly at the second iteration.
>> Juergen,
>>
>> can you rule out that this code will be triggered on two CPUs trying to
>> switch to each other? As Stephan pointed out: the code looks like as
>> this could trigger a possible dead-lock condition, where:
>> 1) CPU A grabs lock (a) while CPU B grabs lock (b)
>> 2) CPU A tries to grab (b) and CPU B tries to grab (a)
>> 3) both fail and loop to 1)
> 
> Good point. Not quite a dead-lock, but a possible live-lock :-)

Yeah, sorry. That was the wrong wording. Kudos go to Stephan for 
pointing this out.
> 
>> A possible fix would be to introduce some ordering for the locks (just
>> the pointer address) and let the "bigger" pointer yield to the "smaller"
>> one.
> 
> Done this and sent a patch.

Thanks, it looks good on the first glance. Not yet tested, though.

> 
>> I am not sure if this is really necessary, but I now see strange
>> hangs after running the script for a while (30min to 1hr).
>> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or
>> getting spurious ones, on two occasions the machine totally locked up.
>>
>> I am not 100% sure whether this is CPUpools related, but I put some load
>> on Dom0 (without messing with CPUpools) for the whole night and it ran
>> fine.
> 
> Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus?
> I had the same problems when using only few physical cpus for many vcpus.
> And I'm pretty sure this was NOT the possible live-lock, as it happened
> already without this change when I tried to reproduce your problem.

That is my current theory, too. I inserted counters in the try-loop for 
the locks to detect possible lock-ups, but they didn't went over 99, so 
this is not the reason.
The high overcommit (48 vCPUs on 6 pCPUs) is probably responsible. The 
new reduction of Dom0 vCPUs should avoid this situation in the future.

> 
>> Sorry for this :-(
>> I will try to further isolate this.
>>
>> Anyway, it works much better with the fix than without and I will try to
>> trigger this with the "reduce number of Dom0 vCPUs" patch.

Unfortunately I got a Dom0 crash with the new patch. Reverting 22934 
worked fine. I will investigate this now.

root@dosorca:/data/images# xl cpupool-numa-split
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.
(XEN) Resetting with ACPI MEMORY or I/O RESET_REG.

Regards,
Andre.

-- 
Andre Przywara
AMD-Operating System Research Center (OSRC), Dresden, Germany