From mboxrd@z Thu Jan  1 00:00:00 1970
From: Juergen Gross <juergen.gross@ts.fujitsu.com>
Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split
Date: Mon, 21 Feb 2011 15:50:14 +0100
Message-ID: <4D627BA6.4020406@ts.fujitsu.com>
References: <4D41FD3A.5090506@amd.com>	<4D4C08B6.30600@amd.com>	<4D4FE7E2.9070605@amd.com>	<4D4FF452.6060508@ts.fujitsu.com>	<AANLkTinoRUQC_suVYFM9-x3D00KvYofq3R=XkCQUj6RP@mail.gmail.com>	<4D50D80F.9000007@ts.fujitsu.com>	<AANLkTinKJUAXhiXpKui_XX8XCD6T5fmzNARwHE6Fjafv@mail.gmail.com>	<AANLkTinP0z9GynF1RFd8RwzWuqvxYdb+UBE+7xKpX6D4@mail.gmail.com>	<4D517051.10402@amd.com>	<AANLkTi=MiELBnPFvb6-jzVth+T7aKxP5JMFhVh3Crdmo@mail.gmail.com>	<AANLkTikgGNz=imS1xRVVjntY5P=+MuT_Qsb=-h3QHajY@mail.gmail.com>	<4D529BD9.5050200@amd.com>	<4D52A2CD.9090507@ts.fujitsu.com>	<4D5388DF.8040900@ts.fujitsu.com>	<4D53AF27.7030909@amd.com>	<4D53F3BC.4070807@amd.com>	<4D54D478.9000402@ts.fujitsu.com>	<4D54E79E.3000800@amd.com>	<AANLkTimkRAHtM4CoTskQ7w6B-8Pis4B2+k7=frxM3oyW@mail.gmail.com>	<4D5A29C0.4050702@ts.fujitsu
 .com>	<4D5B9D2B.107@ts.fujitsu.com>	<AANLkTin+rE1=+vpmTg9xeQdYn7_hucSFkrz1qCtiKfkY@mail.gmail.com>
	<4D6237C6.1050206@amd.c om> <4D62666C.6010608@ts.fujitsu. com>
	<4D627A6F.5070105@amd.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <4D627A6F.5070105@amd.com>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Andre Przywara <andre.przywara@amd.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>, "xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>, "Diestelhorst, Stephan" <Stephan.Diestelhorst@amd.com>
List-Id: xen-devel@lists.xenproject.org

On 02/21/11 15:45, Andre Przywara wrote:
> Juergen Gross wrote:
>> On 02/21/11 11:00, Andre Przywara wrote:
>>> George Dunlap wrote:
>>>> Andre (and Juergen), can you try again with the attached patch?
>>> I applied this patch on top of 22931 and it did _not_ work.
>>> The crash occurred almost immediately after I started my script, so the
>>> same behaviour as without the patch.
>>
>> Did you try my patch addressing races in the scheduler when moving cpus
>> between cpupools?
> Sorry, I tried yours first, but it didn't apply cleanly on my particular
> tree (sched_jg_fix ;-). So I tested George's first.
>
>> I've attached it again. For me it works quite well, while George's patch
>> seems not to be enough (machine hanging after some tests with cpupools).
> OK, it now applied after a rebase.
> And yes, I didn't see a crash! At least until the script stopped while
> at lot of these messages appeared:
> (XEN) do_IRQ: 0.89 No irq handler for vector (irq -1)
>
> That is what I reported before and is most probably totally unrelated to
> this issue.
> So I consider this fix working!
> I will try to match my recent theories and debug results with your patch
> to see whether this fits.
>
>> OTOH I can't reproduce an error as fast as you even without any patch :-)
>>
>>> (attached my script for reference, though it will most likely only make
>>> sense on bigger NUMA machines)
>>
>> Yeah, on my 2-node system I need several hundred tries to get an error.
>> But it seems to be more effective than George's script.
> I consider the large over-provisioning the reason. With Dom0 having 48
> VCPUs finally squashed together to 6 pCPUs, my script triggered at the
> second run the latest.
> With your patch it made 24 iterations before the other bug kicked in.

Okay, I'll prepare an official patch. Might last some days, as I'm not in the
office until Thursday.


Juergen

-- 
Juergen Gross                 Principal Developer Operating Systems
TSP ES&S SWE OS6                       Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions              e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28                           Internet: ts.fujitsu.com
D-80807 Muenchen                 Company details: ts.fujitsu.com/imprint.html