From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Mon, 21 Feb 2011 15:50:14 +0100 Message-ID: <4D627BA6.4020406@ts.fujitsu.com> References: <4D41FD3A.5090506@amd.com> <4D4C08B6.30600@amd.com> <4D4FE7E2.9070605@amd.com> <4D4FF452.6060508@ts.fujitsu.com> <4D50D80F.9000007@ts.fujitsu.com> <4D517051.10402@amd.com> <4D529BD9.5050200@amd.com> <4D52A2CD.9090507@ts.fujitsu.com> <4D5388DF.8040900@ts.fujitsu.com> <4D53AF27.7030909@amd.com> <4D53F3BC.4070807@amd.com> <4D54D478.9000402@ts.fujitsu.com> <4D54E79E.3000800@amd.com> <4D5A29C0.4050702@ts.fujitsu .com> <4D5B9D2B.107@ts.fujitsu.com> <4D6237C6.1050206@amd.c om> <4D62666C.6010608@ts.fujitsu. com> <4D627A6F.5070105@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D627A6F.5070105@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andre Przywara Cc: George Dunlap , "xen-devel@lists.xensource.com" , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org On 02/21/11 15:45, Andre Przywara wrote: > Juergen Gross wrote: >> On 02/21/11 11:00, Andre Przywara wrote: >>> George Dunlap wrote: >>>> Andre (and Juergen), can you try again with the attached patch? >>> I applied this patch on top of 22931 and it did _not_ work. >>> The crash occurred almost immediately after I started my script, so the >>> same behaviour as without the patch. >> >> Did you try my patch addressing races in the scheduler when moving cpus >> between cpupools? > Sorry, I tried yours first, but it didn't apply cleanly on my particular > tree (sched_jg_fix ;-). So I tested George's first. > >> I've attached it again. For me it works quite well, while George's patch >> seems not to be enough (machine hanging after some tests with cpupools). > OK, it now applied after a rebase. > And yes, I didn't see a crash! At least until the script stopped while > at lot of these messages appeared: > (XEN) do_IRQ: 0.89 No irq handler for vector (irq -1) > > That is what I reported before and is most probably totally unrelated to > this issue. > So I consider this fix working! > I will try to match my recent theories and debug results with your patch > to see whether this fits. > >> OTOH I can't reproduce an error as fast as you even without any patch :-) >> >>> (attached my script for reference, though it will most likely only make >>> sense on bigger NUMA machines) >> >> Yeah, on my 2-node system I need several hundred tries to get an error. >> But it seems to be more effective than George's script. > I consider the large over-provisioning the reason. With Dom0 having 48 > VCPUs finally squashed together to 6 pCPUs, my script triggered at the > second run the latest. > With your patch it made 24 iterations before the other bug kicked in. Okay, I'll prepare an official patch. Might last some days, as I'm not in the office until Thursday. Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html