From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Fri, 11 Feb 2011 07:17:28 +0100 Message-ID: <4D54D478.9000402@ts.fujitsu.com> References: <4D41FD3A.5090506@amd.com> <201102021539.06664.stephan.diestelhorst@amd.com> <4D4974D1.1080503@ts.fujitsu.com> <201102021701.05665.stephan.diestelhorst@amd.com> <4D4A43B7.5040707@ts.fujitsu.com> <4D4A72D8.3020502@ts.fujitsu.com> <4D4C08B6.30600@amd.com> <4D4FE7E2.9070605@amd.com> <4D4FF452.6060508@ts.fujitsu.com> <4D50D80F.9000007@ts.fujitsu.com> <4D517051.10402@amd.com> <4D529BD9.5050200@amd.com> <4D52A2CD.9090507@ts.fujitsu.com> <4D5388DF.8040900@ts.fujitsu.com> <4D53AF27.7030909@amd.com> <4D53F3BC.4070807@amd.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D53F3BC.4070807@amd.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Andre Przywara Cc: George Dunlap , "xen-devel@lists.xensource.com" , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org On 02/10/11 15:18, Andre Przywara wrote: > Andre Przywara wrote: >> On 02/10/2011 07:42 AM, Juergen Gross wrote: >>> On 02/09/11 15:21, Juergen Gross wrote: >>>> Andre, George, >>>> >>>> >>>> What seems to be interesting: I think the problem did always occur when >>>> a new cpupool was created and the first cpu was moved to it. >>>> >>>> I think my previous assumption regarding the master_ticker was not >>>> too bad. >>>> I think somehow the master_ticker of the new cpupool is becoming active >>>> before the scheduler is really initialized properly. This could >>>> happen, if >>>> enough time is spent between alloc_pdata for the cpu to be moved and >>>> the >>>> critical section in schedule_cpu_switch(). >>>> >>>> The solution should be to activate the timers only if the scheduler is >>>> ready for them. >>>> >>>> George, do you think the master_ticker should be stopped in >>>> suspend_ticker >>>> as well? I still see potential problems for entering deep C-States. >>>> I think >>>> I'll prepare a patch which will keep the master_ticker active for the >>>> C-State case and migrate it for the schedule_cpu_switch() case. >>> Okay, here is a patch for this. It ran on my 4-core machine without any >>> problems. >>> Andre, could you give it a try? >> Did, but unfortunately it crashed as always. Tried twice and made sure >> I booted the right kernel. Sorry. >> The idea with the race between the timer and the state changing >> sounded very appealing, actually that was suspicious to me from the >> beginning. >> >> I will add some code to dump the state of all cpupools to the BUG_ON >> to see in which situation we are when the bug triggers. > OK, here is a first try of this, the patch iterates over all CPU pools > and outputs some data if the BUG_ON > ((sdom->weight * sdom->active_vcpu_count) > weight_left) condition > triggers: > (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: fffffffc003f > (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0 > (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000 > (XEN) Xen BUG at sched_credit.c:1010 > .... > The masks look proper (6 cores per node), the bug triggers when the > first CPU is about to be(?) inserted. Sure? I'm missing the cpu with mask 2000. I'll try to reproduce the problem on a larger machine here (24 cores, 4 numa nodes). Andre, can you give me your xen boot parameters? Which xen changeset are you running, and do you have any additional patches in use? Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html