From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: [PATCH] Avoid race when moving cpu between cpupools
Date: Mon, 28 Feb 2011 10:29:28 +0100
Message-ID: <4D6B6AF8.1040305@ts.fujitsu.com>
References: <5485071c8b0a6a49f65b.1298541625@nehalem1>	<4D666678.1000301@amd.com>	<AANLkTikSiJKLH=ginoEgO4Tx0-Z1AC2bwP4qBDjVSfAg@mail.gmail.com>
	<4D67BBDA.5070603@amd.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4D67BBDA.5070603@amd.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Andre Przywara <andre.przywara@amd.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, Keir Fraser <keir@xen.org>, "Diestelhorst,
	Stephan" <Stephan.Diestelhorst@amd.com>
List-Id: xen-devel@lists.xenproject.org

On 02/25/11 15:25, Andre Przywara wrote:
> George Dunlap wrote:
>> Looks good -- thanks Juergen.
>>
>> Acked-by: George Dunlap <george.dunlap@eu.citrix.com>
>>
>> -George
>>
>> On Thu, Feb 24, 2011 at 2:08 PM, Andre Przywara
>> <andre.przywara@amd.com> wrote:
>>> Juergen Gross wrote:
>>>> Moving cpus between cpupools is done under the schedule lock of the
>>>> moved
>>>> cpu.
>>>> When checking a cpu being member of a cpupool this must be done with
>>>> the
>>>> lock
>>>> of that cpu being held.
>>> I have reviewed and tested the patch. It fixes my problem. My script has
>>> been running for several hundred iterations without any Xen crash,
>>> whereas
>>> without the patch the hypervisor crashed mostly at the second iteration.
>
> Juergen,
>
> can you rule out that this code will be triggered on two CPUs trying to
> switch to each other? As Stephan pointed out: the code looks like as
> this could trigger a possible dead-lock condition, where:
> 1) CPU A grabs lock (a) while CPU B grabs lock (b)
> 2) CPU A tries to grab (b) and CPU B tries to grab (a)
> 3) both fail and loop to 1)

Good point. Not quite a dead-lock, but a possible live-lock :-)

> A possible fix would be to introduce some ordering for the locks (just
> the pointer address) and let the "bigger" pointer yield to the "smaller"
> one.

Done this and sent a patch.

> I am not sure if this is really necessary, but I now see strange
> hangs after running the script for a while (30min to 1hr).
> Sometimes Dom0 hangs for a while, loosing interrupts (sda or eth0) or
> getting spurious ones, on two occasions the machine totally locked up.
>
> I am not 100% sure whether this is CPUpools related, but I put some load
> on Dom0 (without messing with CPUpools) for the whole night and it ran
> fine.

Did you try to do this with all Dom0-vcpus pinned to 6 physical cpus?
I had the same problems when using only few physical cpus for many vcpus.
And I'm pretty sure this was NOT the possible live-lock, as it happened
already without this change when I tried to reproduce your problem.

>
> Sorry for this :-(
> I will try to further isolate this.
>
> Anyway, it works much better with the fix than without and I will try to
> trigger this with the "reduce number of Dom0 vCPUs" patch.


Thanks, Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html