From: Juergen Gross <juergen.gross@ts.fujitsu.com>
To: Jan Beulich <JBeulich@suse.com>
Cc: George Dunlap <George.Dunlap@eu.citrix.com>,
Andrew Cooper <andrew.cooper3@citrix.com>,
Keir Fraser <keir@xen.org>,
Xen-devel List <xen-devel@lists.xen.org>
Subject: Re: Race condition with scheduler runqueues
Date: Wed, 27 Feb 2013 07:28:11 +0100 [thread overview]
Message-ID: <512DA77B.90001@ts.fujitsu.com> (raw)
In-Reply-To: <512DA58C.4060604@ts.fujitsu.com>
[-- Attachment #1: Type: text/plain, Size: 4393 bytes --]
On 27.02.2013 07:19, Juergen Gross wrote:
> On 19.02.2013 10:28, Jan Beulich wrote:
>>>>> On 18.02.13 at 19:11, Andrew Cooper<andrew.cooper3@citrix.com> wrote:
>>> Hello,
>>>
>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>> in sched_credit.c (because a static function of the same name also
>>> exists in sched_credit2.c, which confused matters to start with)
>>>
>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>> domain. The test case itself has passed many times in the past on the
>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare. There
>>> does not appear to be any relevant changes between the version of Xen in
>>> the test and xen-4.1-testing.
>>>
>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>> from dom0, targeting the VCPU of the migrating domain.
>>>
>>> (XEN) Xen call trace:
>>> (XEN) [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>> (XEN) 0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>> (XEN) 12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>> (XEN) 14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>> (XEN) 24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>> (XEN) 64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>
>>> The relevant part of csched_vcpu_sleep() is
>>>
>>> else if ( __vcpu_on_runq(svc) )
>>> __runq_remove(svc);
>>>
>>> which disassembles to
>>>
>>> ffff82c480116a01: 49 8b 10 mov (%r8),%rdx
>>> ffff82c480116a04: 4c 39 c2 cmp %r8,%rdx
>>> ffff82c480116a07: 75 07 jne ffff82c480116a10
>>> <csched_vcpu_sleep+0x40>
>>> ffff82c480116a09: f3 c3 repz retq
>>> ffff82c480116a0b: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
>>> ffff82c480116a10: 49 8b 40 08 mov 0x8(%r8),%rax
>>> ffff82c480116a14: 48 89 42 08 mov %rax,0x8(%rdx) #
>>> <- Pagefault here
>>> ffff82c480116a18: 48 89 10 mov %rdx,(%rax)
>>> ffff82c480116a1b: 4d 89 40 08 mov %r8,0x8(%r8)
>>> ffff82c480116a1f: 4d 89 00 mov %r8,(%r8)
>>>
>>> The relevant crash registers from the pagefault are:
>>> rax: 0000000000000000
>>> rdx: 0000000000000000
>>> r8: ffff83080c89ed90
>>>
>>> If I am reading the code correctly, this means that runq->next is NULL,
>>> so we fail list_empty() and erroneously pass __vcpu_on_runq(). We then
>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>
>>> The only place I can spot in the code where the runq->{next,prev} could
>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>> INIT_LIST_HEAD(). This is logically sensible in combination with the
>>> localhost migrate loop, and I cant immediately see anything to prevent
>>> this race happening.
>>
>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>> svc into vc->sched_priv; that's being done by the generic
>> scheduler code once the actor returns.
>>
>> So I'd rather suspect a stale pointer being used, which is easily
>> possible when racing with sched_move_domain() (as opposed to
>> schedule_cpu_switch(), where the new pointer gets stored
>> _before_ de-allocating the old one).
>>
>> However, sched_move_domain() (as well as schedule_cpu_switch())
>> get called only from CPU pools code, and I would guess CPU pools
>> aren't involved here, and you don't in parallel soft offline/online
>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>
>> But wait - libxl__domain_make() _unconditionally_ calls
>> xc_cpupool_movedomain(), as does XendDomainInfo's
>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>> I'd expect the pool ID matching check to short cut the resulting
>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>> (worth verifying of course).
>>
>> The race there nevertheless ought to be fixed.
>
> Something like the attached patch?
>
> Not tested thoroughly yet.
Argh. Sent an old version, sorry. This one should be better.
Juergen
--
Juergen Gross Principal Developer Operating Systems
PBG PDG ES&S SWE OS6 Telephone: +49 (0) 89 3222 2967
Fujitsu Technology Solutions e-mail: juergen.gross@ts.fujitsu.com
Domagkstr. 28 Internet: ts.fujitsu.com
D-80807 Muenchen Company details: ts.fujitsu.com/imprint.html
[-- Attachment #2: movedom.patch --]
[-- Type: text/x-patch, Size: 1882 bytes --]
Avoid stale pointer when moving domain to another cpupool
When a domain is moved to another cpupool the scheduler private data pointers
in vcpu and domain structures must never point to an already freed memory
area.
Signed-off-by: Juergen Gross <juergen.gross@ts.fujitsu.com>
diff -r 1d8c65aee03e xen/common/schedule.c
--- a/xen/common/schedule.c Tue Feb 26 10:12:46 2013 +0000
+++ b/xen/common/schedule.c Wed Feb 27 07:16:05 2013 +0100
@@ -231,12 +231,14 @@ int sched_move_domain(struct domain *d,
unsigned int new_p;
void **vcpu_priv;
void *domdata;
+ struct scheduler *old_ops;
+ void *old_domdata;
domdata = SCHED_OP(c->sched, alloc_domdata, d);
if ( domdata == NULL )
return -ENOMEM;
- vcpu_priv = xzalloc_array(void *, d->max_vcpus);
+ vcpu_priv = xzalloc_array(void *, d->max_vcpus * 2);
if ( vcpu_priv == NULL )
{
SCHED_OP(c->sched, free_domdata, domdata);
@@ -257,18 +259,18 @@ int sched_move_domain(struct domain *d,
SCHED_OP(c->sched, free_domdata, domdata);
return -ENOMEM;
}
+ vcpu_priv[d->max_vcpus + v->vcpu_id] = v->sched_priv;
}
domain_pause(d);
+ old_ops = DOM2OP(d);
+ old_domdata = d->sched_priv;
+
for_each_vcpu ( d, v )
{
SCHED_OP(VCPU2OP(v), remove_vcpu, v);
- SCHED_OP(VCPU2OP(v), free_vdata, v->sched_priv);
- v->sched_priv = NULL;
}
-
- SCHED_OP(DOM2OP(d), free_domdata, d->sched_priv);
d->cpupool = c;
d->sched_priv = domdata;
@@ -289,6 +291,13 @@ int sched_move_domain(struct domain *d,
SCHED_OP(c->sched, insert_vcpu, v);
}
+
+ for_each_vcpu ( d, v )
+ {
+ SCHED_OP(old_ops, free_vdata, vcpu_priv[d->max_vcpus + v->vcpu_id]);
+ }
+
+ SCHED_OP(old_ops, free_domdata, old_domdata);
domain_update_node_affinity(d);
[-- Attachment #3: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
next prev parent reply other threads:[~2013-02-27 6:28 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-02-18 18:11 Race condition with scheduler runqueues Andrew Cooper
2013-02-19 9:28 ` Jan Beulich
2013-02-19 11:47 ` Andrew Cooper
2013-02-26 12:08 ` George Dunlap
2013-02-26 13:44 ` Andrew Cooper
2013-02-27 6:19 ` Juergen Gross
2013-02-27 6:28 ` Juergen Gross [this message]
2013-02-27 9:12 ` Jan Beulich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=512DA77B.90001@ts.fujitsu.com \
--to=juergen.gross@ts.fujitsu.com \
--cc=George.Dunlap@eu.citrix.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=keir@xen.org \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.