From mboxrd@z Thu Jan 1 00:00:00 1970 From: Juergen Gross Subject: Re: Race condition with scheduler runqueues Date: Wed, 27 Feb 2013 07:28:11 +0100 Message-ID: <512DA77B.90001@ts.fujitsu.com> References: <51226EBB.3030503@citrix.com> <512353E502000078000BF521@nat28.tlf.novell.com> <512DA58C.4060604@ts.fujitsu.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------000006070305000104040700" Return-path: In-Reply-To: <512DA58C.4060604@ts.fujitsu.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Jan Beulich Cc: George Dunlap , Andrew Cooper , Keir Fraser , Xen-devel List List-Id: xen-devel@lists.xenproject.org This is a multi-part message in MIME format. --------------000006070305000104040700 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable On 27.02.2013 07:19, Juergen Gross wrote: > On 19.02.2013 10:28, Jan Beulich wrote: >>>>> On 18.02.13 at 19:11, Andrew Cooper wrot= e: >>> Hello, >>> >>> Our testing has discovered a crash (pagefault at 0x0000000000000008) >>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep= () >>> in sched_credit.c (because a static function of the same name also >>> exists in sched_credit2.c, which confused matters to start with) >>> >>> The test case was a loop of localhost migrate of a 1vcpu HVM win8 >>> domain. The test case itself has passed many times in the past on the >>> same Xen codebase (Xen-4.1.3), indicating that it is very rare. There >>> does not appear to be any relevant changes between the version of Xen= in >>> the test and xen-4.1-testing. >>> >>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace bel= ow) >>> from dom0, targeting the VCPU of the migrating domain. >>> >>> (XEN) Xen call trace: >>> (XEN) [] csched_vcpu_sleep+0x44/0x70 >>> (XEN) 0[] vcpu_sleep_nosync+0xe7/0x3b0 >>> (XEN) 12[] vcpu_sleep_sync+0x9/0x50 >>> (XEN) 14[] sched_adjust+0xac/0x230 >>> (XEN) 24[] do_domctl+0x731/0x1130 >>> (XEN) 64[] compat_hypercall+0x74/0x80 >>> >>> The relevant part of csched_vcpu_sleep() is >>> >>> else if ( __vcpu_on_runq(svc) ) >>> __runq_remove(svc); >>> >>> which disassembles to >>> >>> ffff82c480116a01: 49 8b 10 mov (%r8),%rdx >>> ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx >>> ffff82c480116a07: 75 07 jne ffff82c480116a10 >>> >>> ffff82c480116a09: f3 c3 repz retq >>> ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1) >>> ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax >>> ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) # >>> <- Pagefault here >>> ffff82c480116a18: 48 89 10 mov %rdx,(%rax) >>> ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8) >>> ffff82c480116a1f: 4d 89 00 mov %r8,(%r8) >>> >>> The relevant crash registers from the pagefault are: >>> rax: 0000000000000000 >>> rdx: 0000000000000000 >>> r8: ffff83080c89ed90 >>> >>> If I am reading the code correctly, this means that runq->next is NUL= L, >>> so we fail list_empty() and erroneously pass __vcpu_on_runq(). We the= n >>> fail with a fault when trying to update runq->prev, which is also NUL= L. >>> >>> The only place I can spot in the code where the runq->{next,prev} cou= ld >>> conceivably be NULL is in csched_alloc_vdata() between the memset() a= nd >>> INIT_LIST_HEAD(). This is logically sensible in combination with the >>> localhost migrate loop, and I cant immediately see anything to preven= t >>> this race happening. >> >> But that doesn't make sense: csched_alloc_vdata() doesn't store >> svc into vc->sched_priv; that's being done by the generic >> scheduler code once the actor returns. >> >> So I'd rather suspect a stale pointer being used, which is easily >> possible when racing with sched_move_domain() (as opposed to >> schedule_cpu_switch(), where the new pointer gets stored >> _before_ de-allocating the old one). >> >> However, sched_move_domain() (as well as schedule_cpu_switch()) >> get called only from CPU pools code, and I would guess CPU pools >> aren't involved here, and you don't in parallel soft offline/online >> pCPU-s (as I'm sure you otherwise would have mentioned it). >> >> But wait - libxl__domain_make() _unconditionally_ calls >> xc_cpupool_movedomain(), as does XendDomainInfo's >> _constructDomain(). The reason for this escapes me - J=FCrgen? Yet >> I'd expect the pool ID matching check to short cut the resulting >> sysctl, i.e. sched_move_domain() ought to not be reached anyway >> (worth verifying of course). >> >> The race there nevertheless ought to be fixed. > > Something like the attached patch? > > Not tested thoroughly yet. Argh. Sent an old version, sorry. This one should be better. Juergen --=20 Juergen Gross Principal Developer Operating Systems PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967 Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujits= u.com Domagkstr. 28 Internet: ts.fujitsu.com D-80807 Muenchen Company details: ts.fujitsu.com/imprint.= html --------------000006070305000104040700 Content-Type: text/x-patch; name="movedom.patch" Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename="movedom.patch" Avoid stale pointer when moving domain to another cpupool When a domain is moved to another cpupool the scheduler private data pointers in vcpu and domain structures must never point to an already freed memory area. Signed-off-by: Juergen Gross diff -r 1d8c65aee03e xen/common/schedule.c --- a/xen/common/schedule.c Tue Feb 26 10:12:46 2013 +0000 +++ b/xen/common/schedule.c Wed Feb 27 07:16:05 2013 +0100 @@ -231,12 +231,14 @@ int sched_move_domain(struct domain *d, unsigned int new_p; void **vcpu_priv; void *domdata; + struct scheduler *old_ops; + void *old_domdata; domdata = SCHED_OP(c->sched, alloc_domdata, d); if ( domdata == NULL ) return -ENOMEM; - vcpu_priv = xzalloc_array(void *, d->max_vcpus); + vcpu_priv = xzalloc_array(void *, d->max_vcpus * 2); if ( vcpu_priv == NULL ) { SCHED_OP(c->sched, free_domdata, domdata); @@ -257,18 +259,18 @@ int sched_move_domain(struct domain *d, SCHED_OP(c->sched, free_domdata, domdata); return -ENOMEM; } + vcpu_priv[d->max_vcpus + v->vcpu_id] = v->sched_priv; } domain_pause(d); + old_ops = DOM2OP(d); + old_domdata = d->sched_priv; + for_each_vcpu ( d, v ) { SCHED_OP(VCPU2OP(v), remove_vcpu, v); - SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv); - v->sched_priv = NULL; } - - SCHED_OP(DOM2OP(d), free_domdata, d->sched_priv); d->cpupool = c; d->sched_priv = domdata; @@ -289,6 +291,13 @@ int sched_move_domain(struct domain *d, SCHED_OP(c->sched, insert_vcpu, v); } + + for_each_vcpu ( d, v ) + { + SCHED_OP(old_ops, free_vdata, vcpu_priv[d->max_vcpus + v->vcpu_id]); + } + + SCHED_OP(old_ops, free_domdata, old_domdata); domain_update_node_affinity(d); --------------000006070305000104040700 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --------------000006070305000104040700--