From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andre Przywara Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Mon, 21 Feb 2011 15:45:03 +0100 Message-ID: <4D627A6F.5070105@amd.com> References: <4D41FD3A.5090506@amd.com> <4D4A72D8.3020502@ts.fujitsu.com> <4D4C08B6.30600@amd.com> <4D4FE7E2.9070605@amd.com> <4D4FF452.6060508@ts.fujitsu.com> <4D50D80F.9000007@ts.fujitsu.com> <4D517051.10402@amd.com> <4D529BD9.5050200@amd.com> <4D52A2CD.9090507@ts.fujitsu.com> <4D5388DF.8040900@ts.fujitsu.com> <4D53AF27.7030909@amd.com> <4D53F3BC.4070807@amd.com> <4D54D478.9000402@ts.fujitsu.com> <4D54E79E.3000800@amd.com> <4D5A29C0.4050702@ts.fujitsu.com> <4D5B9D2B.107@ts.fujitsu.com> <4D6237C6.1050206@amd.c om> <4D62666C.6010608@ts.fujitsu. com> Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D62666C.6010608@ts.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Juergen Gross Cc: George Dunlap , "xen-devel@lists.xensource.com" , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org Juergen Gross wrote: > On 02/21/11 11:00, Andre Przywara wrote: >> George Dunlap wrote: >>> Andre (and Juergen), can you try again with the attached patch? >> I applied this patch on top of 22931 and it did _not_ work. >> The crash occurred almost immediately after I started my script, so the >> same behaviour as without the patch. > > Did you try my patch addressing races in the scheduler when moving cpus > between cpupools? Sorry, I tried yours first, but it didn't apply cleanly on my particular tree (sched_jg_fix ;-). So I tested George's first. > I've attached it again. For me it works quite well, while George's patch > seems not to be enough (machine hanging after some tests with cpupools). OK, it now applied after a rebase. And yes, I didn't see a crash! At least until the script stopped while at lot of these messages appeared: (XEN) do_IRQ: 0.89 No irq handler for vector (irq -1) That is what I reported before and is most probably totally unrelated to this issue. So I consider this fix working! I will try to match my recent theories and debug results with your patch to see whether this fits. > OTOH I can't reproduce an error as fast as you even without any patch :-) > >> (attached my script for reference, though it will most likely only make >> sense on bigger NUMA machines) > > Yeah, on my 2-node system I need several hundred tries to get an error. > But it seems to be more effective than George's script. I consider the large over-provisioning the reason. With Dom0 having 48 VCPUs finally squashed together to 6 pCPUs, my script triggered at the second run the latest. With your patch it made 24 iterations before the other bug kicked in. Thanks very much! Andre. > > > Juergen > >> Regards, >> Andre. >> >> >>> What the patch basically does is try to make "cpu_disable_scheduler()" >>> do what it seems to say it does. :-) Namely, the various >>> scheduler-related interrutps (both per-cpu ticks and the master tick) >>> is a part of the scheduler, so disable them before doing anything, and >>> don't enable them until the cpu is really ready to go again. >>> >>> To be precise: >>> * cpu_disable_scheduler() disables ticks >>> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool, >>> and does it after inserting the idle vcpu >>> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or >>> stop tickers >>> + Call tick_{resume,suspend} in cpu_{up,down}, respectively >>> * Modify credit1's tick_{suspend,resume} to handle the master ticker >>> as well. >>> >>> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being >>> on one pcpu), I can perform thousands of operations successfully. >>> >>> (NB this is not ready for application yet, I just wanted to check to >>> see if it fixes Andre's problem) >>> >>> -George >>> >>> On Wed, Feb 16, 2011 at 9:47 AM, Juergen Gross >>> wrote: >>>> Okay, I have some more data. >>>> >>>> I activated cpupool_dprintk() and included checks in sched_credit.c to >>>> test for weight inconsistencies. To reduce race possibilities I've added >>>> my patch to execute cpu assigning/unassigning always in a tasklet on the >>>> cpu to be moved. >>>> >>>> Here is the result: >>>> >>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) >>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 >>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) >>>> (XEN) cpupool_unassign_cpu(pool=0,cpu=6) ret -16 >>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) >>>> (XEN) cpupool_assign_cpu(pool=0,cpu=1) ffff83083fff74c0 >>>> (XEN) cpupool_assign_cpu(cpu=1) ret 0 >>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) >>>> (XEN) cpupool_assign_cpu(pool=1,cpu=4) ffff831002ad5e40 >>>> (XEN) cpupool_assign_cpu(cpu=4) ret 0 >>>> (XEN) cpu 4, weight 0,prv ffff831002ad5e40, dom 0: >>>> (XEN) sdom->weight: 256, sdom->active_vcpu_count: 1 >>>> (XEN) Xen BUG at sched_credit.c:570 >>>> (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]---- >>>> (XEN) CPU: 4 >>>> (XEN) RIP: e008:[] csched_tick+0x186/0x37f >>>> (XEN) RFLAGS: 0000000000010086 CONTEXT: hypervisor >>>> (XEN) rax: 0000000000000000 rbx: ffff830839d3ec30 rcx: 0000000000000000 >>>> (XEN) rdx: ffff830839dcff18 rsi: 000000000000000a rdi: ffff82c4802542e8 >>>> (XEN) rbp: ffff830839dcfe38 rsp: ffff830839dcfde8 r8: 0000000000000004 >>>> (XEN) r9: ffff82c480213520 r10: 00000000fffffffc r11: 0000000000000001 >>>> (XEN) r12: 0000000000000004 r13: ffff830839d3ec40 r14: ffff831002ad5e40 >>>> (XEN) r15: ffff830839d66f90 cr0: 000000008005003b cr4: 00000000000026f0 >>>> (XEN) cr3: 0000001020a98000 cr2: 00007fc5e9b79d98 >>>> (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e010 cs: e008 >>>> (XEN) Xen stack trace from rsp=ffff830839dcfde8: >>>> (XEN) ffff83083ffa3ba0 ffff831002ad5e40 0000000000000246 >>>> ffff830839d6c000 >>>> (XEN) 0000000000000000 ffff830839dd1100 0000000000000004 >>>> ffff82c480119651 >>>> (XEN) ffff831002b28018 ffff831002b28010 ffff830839dcfe68 >>>> ffff82c480126204 >>>> (XEN) 0000000000000002 ffff83083ffa3bb8 ffff830839dd1100 >>>> 000000cae439ea7e >>>> (XEN) ffff830839dcfeb8 ffff82c480126539 00007fc5e9fa5b20 >>>> ffff830839dd1100 >>>> (XEN) ffff831002b28010 0000000000000004 0000000000000004 >>>> ffff82c4802b0880 >>>> (XEN) ffff830839dcff18 ffffffffffffffff ffff830839dcfef8 >>>> ffff82c480123647 >>>> (XEN) ffff830839dcfed8 ffff830077eee000 00007fc5e9b79d98 >>>> 00007fc5e9fa5b20 >>>> (XEN) 0000000000000002 00007fff46826f20 ffff830839dcff08 >>>> ffff82c4801236c2 >>>> (XEN) 00007cf7c62300c7 ffff82c480206ad6 00007fff46826f20 >>>> 0000000000000002 >>>> (XEN) 00007fc5e9fa5b20 00007fc5e9b79d98 00007fff46827260 >>>> 00007fff46826f50 >>>> (XEN) 0000000000000246 0000000000000032 0000000000000000 >>>> 00000000ffffffff >>>> (XEN) 0000000000000009 00007fc5e9d9de1a 0000000000000003 >>>> 0000000000004848 >>>> (XEN) 00007fc5e9b7a000 0000010000000000 ffffffff800073f0 >>>> 000000000000e033 >>>> (XEN) 0000000000000246 ffff880f97b51fc8 000000000000e02b >>>> 0000000000000000 >>>> (XEN) 0000000000000000 0000000000000000 0000000000000000 >>>> 0000000000000004 >>>> (XEN) ffff830077eee000 00000043b9afd180 0000000000000000 >>>> (XEN) Xen call trace: >>>> (XEN) [] csched_tick+0x186/0x37f >>>> (XEN) [] execute_timer+0x4e/0x6c >>>> (XEN) [] timer_softirq_action+0xf6/0x239 >>>> (XEN) [] __do_softirq+0x88/0x99 >>>> (XEN) [] do_softirq+0x6a/0x7a >>>> (XEN) >>>> (XEN) >>>> (XEN) **************************************** >>>> (XEN) Panic on CPU 4: >>>> (XEN) Xen BUG at sched_credit.c:570 >>>> (XEN) **************************************** >>>> >>>> As you can see, a Dom0 vcpus is becoming active on a pool 1 cpu. The >>>> BUG_ON >>>> triggered in csched_acct() is a logical result of this. >>>> >>>> How this can happen I don't know yet. >>>> Anyone any idea? I'll keep searching... >>>> >>>> >>>> Juergen >>>> >>>> On 02/15/11 08:22, Juergen Gross wrote: >>>>> On 02/14/11 18:57, George Dunlap wrote: >>>>>> The good news is, I've managed to reproduce this on my local test >>>>>> hardware with 1x4x2 (1 socket, 4 cores, 2 threads per core) using the >>>>>> attached script. It's time to go home now, but I should be able to >>>>>> dig something up tomorrow. >>>>>> >>>>>> To use the script: >>>>>> * Rename cpupool0 to "p0", and create an empty second pool, "p1" >>>>>> * You can modify elements by adding "arg=val" as arguments. >>>>>> * Arguments are: >>>>>> + dryrun={true,false} Do the work, but don't actually execute any xl >>>>>> arguments. Default false. >>>>>> + left: Number commands to execute. Default 10. >>>>>> + maxcpus: highest numerical value for a cpu. Default 7 (i.e., 0-7 is >>>>>> 8 cpus). >>>>>> + verbose={true,false} Print what you're doing. Default is true. >>>>>> >>>>>> The script sometimes attempts to remove the last cpu from cpupool0; in >>>>>> this case, libxl will print an error. If the script gets an error >>>>>> under that condition, it will ignore it; under any other condition, it >>>>>> will print diagnostic information. >>>>>> >>>>>> What finally crashed it for me was this command: >>>>>> # ./cpupool-test.sh verbose=false left=1000 >>>>> Nice! >>>>> With your script I finally managed to get the error, too. On my box (2 >>>>> sockets >>>>> a 6 cores) I had to use >>>>> >>>>> ./cpupool-test.sh verbose=false left=10000 maxcpus=11 >>>>> >>>>> to trigger it. >>>>> Looking for more data now... >>>>> >>>>> >>>>> Juergen >>>>> >>>>>> -George >>>>>> >>>>>> On Fri, Feb 11, 2011 at 7:39 AM, Andre >>>>>> Przywara wrote: >>>>>>> Juergen Gross wrote: >>>>>>>> On 02/10/11 15:18, Andre Przywara wrote: >>>>>>>>> Andre Przywara wrote: >>>>>>>>>> On 02/10/2011 07:42 AM, Juergen Gross wrote: >>>>>>>>>>> On 02/09/11 15:21, Juergen Gross wrote: >>>>>>>>>>>> Andre, George, >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> What seems to be interesting: I think the problem did always >>>>>>>>>>>> occur >>>>>>>>>>>> when >>>>>>>>>>>> a new cpupool was created and the first cpu was moved to it. >>>>>>>>>>>> >>>>>>>>>>>> I think my previous assumption regarding the master_ticker >>>>>>>>>>>> was not >>>>>>>>>>>> too bad. >>>>>>>>>>>> I think somehow the master_ticker of the new cpupool is becoming >>>>>>>>>>>> active >>>>>>>>>>>> before the scheduler is really initialized properly. This could >>>>>>>>>>>> happen, if >>>>>>>>>>>> enough time is spent between alloc_pdata for the cpu to be moved >>>>>>>>>>>> and >>>>>>>>>>>> the >>>>>>>>>>>> critical section in schedule_cpu_switch(). >>>>>>>>>>>> >>>>>>>>>>>> The solution should be to activate the timers only if the >>>>>>>>>>>> scheduler is >>>>>>>>>>>> ready for them. >>>>>>>>>>>> >>>>>>>>>>>> George, do you think the master_ticker should be stopped in >>>>>>>>>>>> suspend_ticker >>>>>>>>>>>> as well? I still see potential problems for entering deep >>>>>>>>>>>> C-States. >>>>>>>>>>>> I think >>>>>>>>>>>> I'll prepare a patch which will keep the master_ticker active >>>>>>>>>>>> for the >>>>>>>>>>>> C-State case and migrate it for the schedule_cpu_switch() case. >>>>>>>>>>> Okay, here is a patch for this. It ran on my 4-core machine >>>>>>>>>>> without any >>>>>>>>>>> problems. >>>>>>>>>>> Andre, could you give it a try? >>>>>>>>>> Did, but unfortunately it crashed as always. Tried twice and made >>>>>>>>>> sure >>>>>>>>>> I booted the right kernel. Sorry. >>>>>>>>>> The idea with the race between the timer and the state changing >>>>>>>>>> sounded very appealing, actually that was suspicious to me from >>>>>>>>>> the >>>>>>>>>> beginning. >>>>>>>>>> >>>>>>>>>> I will add some code to dump the state of all cpupools to the >>>>>>>>>> BUG_ON >>>>>>>>>> to see in which situation we are when the bug triggers. >>>>>>>>> OK, here is a first try of this, the patch iterates over all CPU >>>>>>>>> pools >>>>>>>>> and outputs some data if the BUG_ON >>>>>>>>> ((sdom->weight * sdom->active_vcpu_count)> weight_left) condition >>>>>>>>> triggers: >>>>>>>>> (XEN) CPU pool #0: 1 domains (SMP Credit Scheduler), mask: >>>>>>>>> fffffffc003f >>>>>>>>> (XEN) CPU pool #1: 0 domains (SMP Credit Scheduler), mask: fc0 >>>>>>>>> (XEN) CPU pool #2: 0 domains (SMP Credit Scheduler), mask: 1000 >>>>>>>>> (XEN) Xen BUG at sched_credit.c:1010 >>>>>>>>> .... >>>>>>>>> The masks look proper (6 cores per node), the bug triggers when the >>>>>>>>> first CPU is about to be(?) inserted. >>>>>>>> Sure? I'm missing the cpu with mask 2000. >>>>>>>> I'll try to reproduce the problem on a larger machine here (24 >>>>>>>> cores, 4 >>>>>>>> numa >>>>>>>> nodes). >>>>>>>> Andre, can you give me your xen boot parameters? Which xen changeset >>>>>>>> are >>>>>>>> you >>>>>>>> running, and do you have any additional patches in use? >>>>>>> The grub lines: >>>>>>> kernel (hd1,0)/boot/xen-22858_debug_04.gz console=com1,vga >>>>>>> com1=115200 >>>>>>> module (hd1,0)/boot/vmlinuz-2.6.32.27_pvops console=tty0 >>>>>>> console=ttyS0,115200 ro root=/dev/sdb1 xencons=hvc0 >>>>>>> >>>>>>> All of my experiments are use c/s 22858 as a base. >>>>>>> If you use a AMD Magny-Cours box for your experiments (socket C32 or >>>>>>> G34), >>>>>>> you should add the following patch (removing the line) >>>>>>> --- a/xen/arch/x86/traps.c >>>>>>> +++ b/xen/arch/x86/traps.c >>>>>>> @@ -803,7 +803,6 @@ static void pv_cpuid(struct cpu_user_regs *regs) >>>>>>> __clear_bit(X86_FEATURE_SKINIT % 32,&c); >>>>>>> __clear_bit(X86_FEATURE_WDT % 32,&c); >>>>>>> __clear_bit(X86_FEATURE_LWP % 32,&c); >>>>>>> - __clear_bit(X86_FEATURE_NODEID_MSR % 32,&c); >>>>>>> __clear_bit(X86_FEATURE_TOPOEXT % 32,&c); >>>>>>> break; >>>>>>> case 5: /* MONITOR/MWAIT */ >>>>>>> >>>>>>> This is not necessary (in fact that reverts my patch c/s 22815), but >>>>>>> raises >>>>>>> the probability to trigger the bug, probably because it increases the >>>>>>> pressure of the Dom0 scheduler. If you cannot trigger it with Dom0, >>>>>>> try to >>>>>>> create a guest with many VCPUs and squeeze it into a small CPU-pool. >>>>>>> >>>>>>> Good luck ;-) >>>>>>> Andre. >>>>>>> >>>>>>> -- >>>>>>> Andre Przywara >>>>>>> AMD-OSRC (Dresden) >>>>>>> Tel: x29712 >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@lists.xensource.com >>>>>>> http://lists.xensource.com/xen-devel >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> Xen-devel mailing list >>>>>>> Xen-devel@lists.xensource.com >>>>>>> http://lists.xensource.com/xen-devel >>>> -- >>>> Juergen Gross Principal Developer Operating Systems >>>> TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 >>>> Fujitsu Technology Solutions e-mail: >>>> juergen.gross@ts.fujitsu.com >>>> Domagkstr. 28 Internet: ts.fujitsu.com >>>> D-80807 Muenchen Company details: >>>> ts.fujitsu.com/imprint.html >>>> >>>> _______________________________________________ >>>> Xen-devel mailing list >>>> Xen-devel@lists.xensource.com >>>> http://lists.xensource.com/xen-devel >>>> >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xensource.com >> http://lists.xensource.com/xen-devel > > > -- > Juergen Gross Principal Developer Operating Systems > TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 > Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com > Domagkstr. 28 Internet: ts.fujitsu.com > D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html > -- Andre Przywara AMD-OSRC (Dresden) Tel: x29712