From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: [xen-unstable test] 6374: regressions - FAIL Date: Mon, 14 Mar 2011 15:40:27 +0100 Message-ID: <4D7E28DB.6080005@ts.fujitsu.com> References: <19834.24888.630582.491364@mariner.uk.xensource.com> <4D7DFD130200007800036344@vpn.id2.novell.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <4D7DFD130200007800036344@vpn.id2.novell.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: Jan Beulich Cc: xen-devel@lists.xensource.com, Ian Jackson List-Id: xen-devel@lists.xenproject.org On 03/14/11 11:33, Jan Beulich wrote: >>>> On 11.03.11 at 18:51, Ian Jackson wrote: >> xen.org writes ("[Xen-devel] [xen-unstable test] 6374: regressions - FAIL"): >>> flight 6374 xen-unstable real [real] >>> Tests which did not succeed and are blocking: >>> test-amd64-i386-pv 5 xen-boot fail REGR. vs. 6369 >> >> Xen crash in scheduler (non-credit2). >> >> Mar 11 13:46:53.646796 (XEN) Watchdog timer detects that CPU1 is stuck! >> Mar 11 13:46:57.922794 (XEN) ----[ Xen-4.1.0-rc7-pre x86_64 debug=y Not tainted ]---- >> Mar 11 13:46:57.931763 (XEN) CPU: 1 >> Mar 11 13:46:57.931784 (XEN) RIP: e008:[] __bitmap_empty+0x0/0x7f >> Mar 11 13:46:57.931817 (XEN) RFLAGS: 0000000000000047 CONTEXT: hypervisor >> Mar 11 13:46:57.946773 (XEN) rax: ffff82c4802d1ac0 rbx: ffff8301a7fafc78 rcx: 0000000000000002 >> Mar 11 13:46:57.946813 (XEN) rdx: ffff82c4802d0cc0 rsi: 0000000000000080 rdi: ffff8301a7fafc78 >> Mar 11 13:46:57.954780 (XEN) rbp: ffff8301a7fafcb8 rsp: ffff8301a7fafc00 r8: 0000000000000002 >> Mar 11 13:46:57.966770 (XEN) r9: 0000ffff0000ffff r10: 00ff00ff00ff00ff r11: 0f0f0f0f0f0f0f0f >> Mar 11 13:46:57.966805 (XEN) r12: ffff8301a7fafc68 r13: 0000000000000001 r14: 0000000000000001 >> Mar 11 13:46:57.975780 (XEN) r15: ffff82c4802d1ac0 cr0: 000000008005003b cr4: 00000000000006f0 >> Mar 11 13:46:57.987771 (XEN) cr3: 00000000d7c9c000 cr2: 00000000c45e5770 >> Mar 11 13:46:57.987800 (XEN) ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0000 cs: e008 >> Mar 11 13:46:57.998773 (XEN) Xen stack trace from rsp=ffff8301a7fafc00: >> ... >> Mar 11 13:46:58.154777 (XEN) Xen call trace: >> Mar 11 13:46:58.154798 (XEN) [] __bitmap_empty+0x0/0x7f >> Mar 11 13:46:58.163767 (XEN) [] csched_cpu_pick+0xe/0x10 >> Mar 11 13:46:58.163802 (XEN) [] vcpu_migrate+0xfb/0x230 >> Mar 11 13:46:58.178768 (XEN) [] context_saved+0x62/0x7b >> Mar 11 13:46:58.178799 (XEN) [] context_switch+0xd98/0xdca >> Mar 11 13:46:58.183766 (XEN) [] schedule+0x5fc/0x624 >> Mar 11 13:46:58.183795 (XEN) [] __do_softirq+0x88/0x99 >> Mar 11 13:46:58.198784 (XEN) [] do_softirq+0x6a/0x7a > > I suppose that's a result of 22957:c5c4688d5654 - as I understand it > exiting the loop is only possible if two consecutive invocations of > pick_cpu return the same result. This, however, is precisely what the > pCPU's idle_bias is supposed to prevent on hyper-threaded/multi-core > systems (so that it's not always the same entity that gets selected). > > But even beyond that particular aspect, relying on any form of > "stability" of the returned value isn't correct. > > Plus running pick_cpu repeatedly without actually using its result > is wrong wrt to idle_bias updating too - that's why > cached_vcpu_acct() calls _csched_cpu_pick() with the commit > argument set to false (which will result in a subsequent call - > through pick_cpu - with the argument set to true to be likely > to return the same value, but there's no correctness dependency > on that). So 22948:2d35823a86e7 already wasn't really correct > in putting a loop around pick_cpu. > > It's also not clear to me what the surrounding > if ( old_lock == per_cpu(schedule_data, old_cpu).schedule_lock ) > is supposed to filter, as the lock pointer gets set only when a > CPU gets brought up. Yeah, but the vcpu can change cpus while we don't hold the lock. This means old_cpu can change between selecting the lock and actually taking it... > As I don't really understand what is being tried to achieve here, > I also can't really suggest a possible fix other than reverting both > offending changesets. I'll send a patch as a suggestion :-) Juergen -- Juergen Gross Principal Developer Operating Systems TSP ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html