From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Hypervisor crash(!) on xl cpupool-numa-split Date: Thu, 17 Feb 2011 10:11:25 +0100 Message-ID: <4D5CE63D.5040704@ts.fujitsu.com> References: <4D41FD3A.5090506@amd.com> <4D4A72D8.3020502@ts.fujitsu.com> <4D4C08B6.30600@amd.com> <4D4FE7E2.9070605@amd.com> <4D4FF452.6060508@ts.fujitsu.com> <4D50D80F.9000007@ts.fujitsu.com> <4D517051.10402@amd.com> <4D529BD9.5050200@amd.com> <4D52A2CD.9090507@ts.fujitsu.com> <4D5388DF.8040900@ts.fujitsu.com> <4D53AF27.7030909@amd.com> <4D53F3BC.4070807@amd.com> <4D54D478.9000402@ts.fujitsu.com> <4D54E79E.3000800@amd.com> <4D5A29C0.4050702@ts.fujitsu.com > <4D5B9D2B.107@ts.fujitsu.com> <4D5CC89C.7020306@ts.fujitsu.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D5CC89C.7020306@ts.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: George Dunlap Cc: Andre Przywara , "xen-devel@lists.xensource.com" , "Diestelhorst, Stephan" List-Id: xen-devel@lists.xenproject.org On 02/17/11 08:05, Juergen Gross wrote: > On 02/16/11 14:54, George Dunlap wrote: >> Andre (and Juergen), can you try again with the attached patch? >> >> What the patch basically does is try to make "cpu_disable_scheduler()" >> do what it seems to say it does. :-) Namely, the various >> scheduler-related interrutps (both per-cpu ticks and the master tick) >> is a part of the scheduler, so disable them before doing anything, and >> don't enable them until the cpu is really ready to go again. >> >> To be precise: >> * cpu_disable_scheduler() disables ticks >> * scheduler_cpu_switch() only enables ticks if adding a cpu to a pool, >> and does it after inserting the idle vcpu >> * Modify semantics, s.t., {alloc,free}_pdata() don't actually start or >> stop tickers >> + Call tick_{resume,suspend} in cpu_{up,down}, respectively >> * Modify credit1's tick_{suspend,resume} to handle the master ticker >> as well. >> >> With this patch (if dom0 doesn't get wedged due to all 8 vcpus being >> on one pcpu), I can perform thousands of operations successfully. >> >> (NB this is not ready for application yet, I just wanted to check to >> see if it fixes Andre's problem) Tried again, this time with the following patch: diff -r 72470de157ce xen/common/sched_credit.c --- a/xen/common/sched_credit.c Wed Feb 16 09:49:33 2011 +0000 +++ b/xen/common/sched_credit.c Wed Feb 16 15:09:54 2011 +0100 @@ -1268,7 +1268,8 @@ csched_load_balance(struct csched_privat /* * Any work over there to steal? */ - speer = csched_runq_steal(peer_cpu, cpu, snext->pri); + speer = cpu_isset(peer_cpu, *online) ? + csched_runq_steal(peer_cpu, cpu, snext->pri) : NULL; pcpu_schedule_unlock(peer_cpu); if ( speer != NULL ) { Worked without any flaw for 30000 iterations. Juergen > > After some thousand iterations the machine hang and after dumping Dom0 > registers to console it continued running and crashed about a second later: > > (XEN) cpupool_unassign_cpu(pool=0,cpu=9) > (XEN) cpupool_unassign_cpu(pool=0,cpu=9) ffff83083fff74c0 > (XEN) cpupool_unassign_cpu ret=0 > (XEN) cpupool_unassign_cpu(pool=0,cpu=4) > (XEN) cpupool_unassign_cpu(pool=0,cpu=4) ffff83083fff74c0 > (XEN) cpupool_unassign_cpu ret=0 > (XEN) cpupool_assign_cpu(pool=1,cpu=9) > (XEN) cpupool_assign_cpu(pool=1,cpu=9) ffff83083002de40 > (XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at > timer.c:279 > (XEN) ----[ Xen-4.1.0-rc5-pre x86_64 debug=y Tainted: C ]---- > (XEN) CPU: 9 > (XEN) RIP: e008:[] active_timer+0xc/0x37 > (XEN) RFLAGS: 0000000000010046 CONTEXT: hypervisor > (XEN) rax: 0000000000000000 rbx: 0000000000000000 rcx: 0000000000000000 > (XEN) rdx: ffff830839d8ff18 rsi: 0000010dbb628a80 rdi: ffff83083ffbcf98 > (XEN) rbp: ffff830839d8fd50 rsp: ffff830839d8fd50 r8: ffff83083ffbcf90 > (XEN) r9: ffff82c480213680 r10: 00000000ffffffff r11: 0000000000000010 > (XEN) r12: ffff82c4802d3f80 r13: ffff82c4802d3f80 r14: ffff83083ffbcf98 > (XEN) r15: ffff83083ffbcfc0 cr0: 000000008005003b cr4: 00000000000026f0 > (XEN) cr3: 000000007809c000 cr2: 0000000000620048 > (XEN) ds: 002b es: 002b fs: 0000 gs: 0000 ss: e010 cs: e008 > (XEN) Xen stack trace from rsp=ffff830839d8fd50: > (XEN) ffff830839d8fda0 ffff82c480126ef9 0000000000000000 0000010dbb628a80 > (XEN) 0000000000000086 0000000000000009 ffff83083002de40 ffff83083002dd50 > (XEN) 0000000000000009 0000000000000009 ffff830839d8fdc0 ffff82c480117906 > (XEN) ffff83083ffa3b40 ffff83083ffa5d70 ffff830839d8fe30 ffff82c4801214fa > (XEN) ffff83083002dd00 0000000900000100 0000000000000286 ffff8300780da000 > (XEN) ffff83083ffbcf80 ffff83083ffbcf90 ffff82c480247e00 0000000000000009 > (XEN) 00000000fffffff0 ffff83083002dd00 0000000000000000 ffff8300781cc198 > (XEN) ffff830839d8fe60 ffff82c4801019ff 0000000000000009 0000000000000009 > (XEN) ffff8300781cc198 ffff830839d990d0 ffff830839d8fe80 ffff82c480101bd9 > (XEN) ffff83107e80c5b0 ffff8300781cc000 ffff830839d8fea0 ffff82c480104f21 > (XEN) 0000000000000009 ffff830839d990e0 ffff830839d8fee0 ffff82c480125b6c > (XEN) ffff82c48024a020 ffff830839d8ff18 ffff82c48024a020 ffff830839d8ff18 > (XEN) ffff830839d99060 ffff830839d99040 ffff830839d8ff10 ffff82c48015645a > (XEN) 0000000000000000 ffff8300780da000 ffff8300780da000 ffffffffffffffff > (XEN) ffff830839d8fe00 0000000000000000 0000000000000000 0000000000000000 > (XEN) 0000000000000000 ffffffff8062bda0 ffff880fbb1e5fd8 0000000000000246 > (XEN) 0000000000000000 000000010003347d 0000000000000000 0000000000000000 > (XEN) ffffffff800033aa 00000000deadbeef 00000000deadbeef 00000000deadbeef > (XEN) 0000010000000000 ffffffff800033aa 000000000000e033 0000000000000246 > (XEN) ffff880fbb1e5f08 000000000000e02b 0000000000000000 0000000000000000 > (XEN) Xen call trace: > (XEN) [] active_timer+0xc/0x37 > (XEN) [] set_timer+0x102/0x218 > (XEN) [] csched_tick_resume+0x53/0x75 > (XEN) [] schedule_cpu_switch+0x1f1/0x25c > (XEN) [] cpupool_assign_cpu_locked+0x61/0xd6 > (XEN) [] cpupool_assign_cpu_helper+0x9f/0xcd > (XEN) [] continue_hypercall_tasklet_handler+0x51/0xc3 > (XEN) [] do_tasklet+0xe1/0x155 > (XEN) [] idle_loop+0x5f/0x67 > (XEN) > (XEN) > (XEN) **************************************** > (XEN) Panic on CPU 9: > (XEN) Assertion 'timer->status >= TIMER_STATUS_inactive' failed at > timer.c:279 > (XEN) **************************************** > > > Juergen > -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html