Re: Race condition with scheduler runqueues

All of lore.kernel.org
 help / color / mirror / Atom feed

From: George Dunlap <george.dunlap@eu.citrix.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Juergen Gross <juergen.gross@ts.fujitsu.com>,
	"Keir (Xen.org)" <keir@xen.org>, Jan Beulich <JBeulich@suse.com>,
	Xen-devel List <xen-devel@lists.xen.org>
Subject: Re: Race condition with scheduler runqueues
Date: Tue, 26 Feb 2013 12:08:14 +0000	[thread overview]
Message-ID: <512CA5AE.4090002@eu.citrix.com> (raw)
In-Reply-To: <51236654.3050709@citrix.com>

On 02/19/2013 11:47 AM, Andrew Cooper wrote:
> On 19/02/13 09:28, Jan Beulich wrote:
>>>>> On 18.02.13 at 19:11, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
>>> Hello,
>>>
>>> Our testing has discovered a crash (pagefault at 0x0000000000000008)
>>> which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
>>> in sched_credit.c (because a static function of the same name also
>>> exists in sched_credit2.c, which confused matters to start with)
>>>
>>> The test case was a loop of localhost migrate of a 1vcpu HVM win8
>>> domain.  The test case itself has passed many times in the past on the
>>> same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
>>> does not appear to be any relevant changes between the version of Xen in
>>> the test and xen-4.1-testing.
>>>
>>> The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
>>> from dom0, targeting the VCPU of the migrating domain.
>>>
>>> (XEN) Xen call trace:
>>> (XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
>>> (XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
>>> (XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
>>> (XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
>>> (XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
>>> (XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80
>>>
>>> The relevant part of csched_vcpu_sleep() is
>>>
>>>      else if ( __vcpu_on_runq(svc) )
>>>          __runq_remove(svc);
>>>
>>> which disassembles to
>>>
>>> ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
>>> ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
>>> ffff82c480116a07:       75 07                   jne    ffff82c480116a10
>>> <csched_vcpu_sleep+0x40>
>>> ffff82c480116a09:       f3 c3                   repz retq
>>> ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
>>> ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
>>> ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
>>> <- Pagefault here
>>> ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
>>> ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
>>> ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)
>>>
>>> The relevant crash registers from the pagefault are:
>>> rax: 0000000000000000
>>> rdx: 0000000000000000
>>>   r8: ffff83080c89ed90
>>>
>>> If I am reading the code correctly, this means that runq->next is NULL,
>>> so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
>>> fail with a fault when trying to update runq->prev, which is also NULL.
>>>
>>> The only place I can spot in the code where the runq->{next,prev} could
>>> conceivably be NULL is in csched_alloc_vdata() between the memset() and
>>> INIT_LIST_HEAD().  This is logically sensible in combination with the
>>> localhost migrate loop, and I cant immediately see anything to prevent
>>> this race happening.
>> But that doesn't make sense: csched_alloc_vdata() doesn't store
>> svc into vc->sched_priv; that's being done by the generic
>> scheduler code once the actor returns.
>
> D'oh yes - I overlooked that.
>
>>
>> So I'd rather suspect a stale pointer being used, which is easily
>> possible when racing with sched_move_domain() (as opposed to
>> schedule_cpu_switch(), where the new pointer gets stored
>> _before_ de-allocating the old one).
>
>>
>> However, sched_move_domain() (as well as schedule_cpu_switch())
>> get called only from CPU pools code, and I would guess CPU pools
>> aren't involved here, and you don't in parallel soft offline/online
>> pCPU-s (as I'm sure you otherwise would have mentioned it).
>>
>> But wait - libxl__domain_make() _unconditionally_ calls
>> xc_cpupool_movedomain(), as does XendDomainInfo's
>> _constructDomain(). The reason for this escapes me - Jürgen? Yet
>> I'd expect the pool ID matching check to short cut the resulting
>> sysctl, i.e. sched_move_domain() ought to not be reached anyway
>> (worth verifying of course).
>>
>> The race there nevertheless ought to be fixed.
>>
>> Jan
>
> Our toolstack hooks directly into libxc and is completely ignorant of
> cpupools.  Looking at the crash more closely, it might be a race elsewhere
>
> Another dom0 vcpu is in an HVMOP_track_dirty_vram hypercall, and the
> associated Xen stack trace is
>
> [ffff82c4801777b2] time_calibration_std_rendezvous+0xb2/0xc0
>   ffff82c480172d12  __smp_call_function_interrupt+0x62/0xb0
>   ffff82c48017339e  smp_call_function_interrupt+0x4e/0x90
>   ffff82c48014a65a  call_function_interrupt+0x2a/0x30
>   ffff82c4801223b2  _spin_lock+0x12/0x20
>   ffff82c4801734ab  flush_area_mask+0xcb/0x1c0
>   ffff82c4801c862a  paging_log_dirty_range+0x3a/0x780
>   ffff82c480121ea8  cpumask_raise_softirq+0x78/0x80
>   ffff82c480117ed3  csched_vcpu_wake+0x193/0x420
>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>   ffff82c4801f21c7  hap_track_dirty_vram+0x137/0x1c0
>   ffff82c4801ad3fd  do_hvm_op+0x16dd/0x1f50
>   ffff82c480106251  evtchn_send+0xa1/0x160
>   ffff82c480106d36  do_event_channel_op+0x876/0xfd0
>   ffff82c4801f9027  compat_update_descriptor+0x27/0x30
>   ffff82c4801354f8  compat_multicall+0x198/0x380
>   ffff82c48014a5fa  event_check_interrupt+0x2a/0x30
>   ffff82c4802013c4  compat_hypercall+0x74/0x80
>
> the hap_track_dirty_vram() and paging_log_dirty_range() are part of the
> same logical call trace, but it appears that we have taken an
> event_check_interrupt() in the middle and called schedule() off the
> bottom of it, calling csched_vcpu_wake().
>
> I am currently trying to reason as to whether it is possible for a race
> between csched_vcpu_{sleep,wake}() could result in the seen crash, but
> it certainly looks like a smoking gun.

Any more progress on this one?

In theory all of those should be made mutually exclusive by holding the 
lock of the runqueue on which the VCPU is running.

Any chance there's a race with the assignment of the vcpu -- that is, a 
race in vcpu_schedule_lock() such that someone ends up grabbing the 
wrong lock?

I think that in theory once you call INIT_LIST_HEAD, none of those 
pointers should ever be set to zero; if one ever were it might get 
passed around a bit before actually being followed.  Any chance there's 
something uninitialized somewhere?

And of course, if all else fails, there's good old-fashioned memory 
clobbering as a possibility...

  -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

next prev parent reply	other threads:[~2013-02-26 12:08 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-02-18 18:11 Race condition with scheduler runqueues Andrew Cooper
2013-02-19  9:28 ` Jan Beulich
2013-02-19 11:47   ` Andrew Cooper
2013-02-26 12:08     ` George Dunlap [this message]
2013-02-26 13:44       ` Andrew Cooper
2013-02-27  6:19   ` Juergen Gross
2013-02-27  6:28     ` Juergen Gross
2013-02-27  9:12       ` Jan Beulich

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=512CA5AE.4090002@eu.citrix.com \
    --to=george.dunlap@eu.citrix.com \
    --cc=JBeulich@suse.com \
    --cc=andrew.cooper3@citrix.com \
    --cc=juergen.gross@ts.fujitsu.com \
    --cc=keir@xen.org \
    --cc=xen-devel@lists.xen.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.