From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: Race condition with scheduler runqueues
Date: Mon, 18 Feb 2013 18:11:07 +0000
Message-ID: <51226EBB.3030503@citrix.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xen.org>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: Xen-devel List <xen-devel@lists.xen.org>, George Dunlap <George.Dunlap@eu.citrix.com>, Keir Fraser <keir@xen.org>, Jan Beulich <JBeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

Hello,

Our testing has discovered a crash (pagefault at 0x0000000000000008)
which I have tracked down to bad __runq_remove() in csched_vcpu_sleep()
in sched_credit.c (because a static function of the same name also
exists in sched_credit2.c, which confused matters to start with)

The test case was a loop of localhost migrate of a 1vcpu HVM win8
domain.  The test case itself has passed many times in the past on the
same Xen codebase (Xen-4.1.3), indicating that it is very rare.  There
does not appear to be any relevant changes between the version of Xen in
the test and xen-4.1-testing.

The failure itself is because of a XEN_DOMCTL_scheduler_op (trace below)
from dom0, targeting the VCPU of the migrating domain.

(XEN) Xen call trace:
(XEN)       [<ffff82c480116a14>] csched_vcpu_sleep+0x44/0x70
(XEN)      0[<ffff82c480120117>] vcpu_sleep_nosync+0xe7/0x3b0
(XEN)     12[<ffff82c4801203e9>] vcpu_sleep_sync+0x9/0x50
(XEN)     14[<ffff82c48011fd4c>] sched_adjust+0xac/0x230
(XEN)     24[<ffff82c480102bc1>] do_domctl+0x731/0x1130
(XEN)     64[<ffff82c4802013c4>] compat_hypercall+0x74/0x80

The relevant part of csched_vcpu_sleep() is

    else if ( __vcpu_on_runq(svc) )
        __runq_remove(svc);

which disassembles to

ffff82c480116a01:       49 8b 10                mov    (%r8),%rdx
ffff82c480116a04:       4c 39 c2                cmp    %r8,%rdx
ffff82c480116a07:       75 07                   jne    ffff82c480116a10
<csched_vcpu_sleep+0x40>
ffff82c480116a09:       f3 c3                   repz retq
ffff82c480116a0b:       0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
ffff82c480116a10:       49 8b 40 08             mov    0x8(%r8),%rax
ffff82c480116a14:       48 89 42 08             mov    %rax,0x8(%rdx) #
<- Pagefault here
ffff82c480116a18:       48 89 10                mov    %rdx,(%rax)
ffff82c480116a1b:       4d 89 40 08             mov    %r8,0x8(%r8)
ffff82c480116a1f:       4d 89 00                mov    %r8,(%r8)

The relevant crash registers from the pagefault are:
rax: 0000000000000000
rdx: 0000000000000000
 r8: ffff83080c89ed90

If I am reading the code correctly, this means that runq->next is NULL,
so we fail list_empty() and erroneously pass __vcpu_on_runq().  We then
fail with a fault when trying to update runq->prev, which is also NULL.

The only place I can spot in the code where the runq->{next,prev} could
conceivably be NULL is in csched_alloc_vdata() between the memset() and
INIT_LIST_HEAD().  This is logically sensible in combination with the
localhost migrate loop, and I cant immediately see anything to prevent
this race happening.

Can anyone with more knowledge than me in this area of code comment on
the plausibility of my analysis?  Unfortunately, I cant see an easy
solution.

~Andrew