From mboxrd@z Thu Jan  1 00:00:00 1970
From: Stefan Bader <stefan.bader@canonical.com>
Subject: Strange PVM spinlock case revisited (nailed down)
Date: Thu, 14 Feb 2013 12:04:56 +0100
Message-ID: <511CC4D8.3060203@canonical.com>
References: <51151ABF.1070007@canonical.com>
Mime-Version: 1.0
Content-Type: multipart/mixed; boundary="===============4915402263337268652=="
Return-path: <xen-devel-bounces@lists.xen.org>
In-Reply-To: <51151ABF.1070007@canonical.com>
List-Unsubscribe: <http://lists.xen.org/cgi-bin/mailman/options/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xen.org>
List-Help: <mailto:xen-devel-request@lists.xen.org?subject=help>
List-Subscribe: <http://lists.xen.org/cgi-bin/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xen.org?subject=subscribe>
Sender: xen-devel-bounces@lists.xen.org
Errors-To: xen-devel-bounces@lists.xen.org
To: "xen-devel@lists.xen.org" <xen-devel@lists.xen.org>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>, Ian Campbell <ian.campbell@citrix.com>, Jan Beulich <jbeulich@suse.com>
List-Id: xen-devel@lists.xenproject.org

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--===============4915402263337268652==
Content-Type: multipart/signed; micalg=pgp-sha512;
 protocol="application/pgp-signature";
 boundary="------------enigD23FB554A653276F059D5719"

This is an OpenPGP/MIME signed message (RFC 2440 and 3156)
--------------enigD23FB554A653276F059D5719
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think I finally can proof why allowing interrupts for the duration of p=
oll_irq
in xen_spin_lock_slow is bad! \o/

So with another run/dump:

VCPUs 1,2,5 and 6 are spinning on the runq lock of VCPU 1. The number of
spinners is 4 (ok) and the lock released.
According to dom0 debug info VCPUs 2,5 and 6 are polling, VCPU 1 is not
but seems running (has=3DT). The event channel for spinlock1 is pending.

Checking the interrupt stack of VCPU 1 in the guest dump shows:

hypercall_page+938
xen_poll_irq+16
xen_spin_lock_slow+151
xen_spin_lock_flags+99
_raw_spin_lock_irqsave+46
update_shares+156
rebalance_domains+72
run_rebalance_domains+72
__do_softirq+168
call_softirq+99
do_softirq+101
irq_exit+142
xen_evtchn_do_upcall+53
xen_do_hypervisor_callback+101

So right before finishing a previous callback, irq_exit triggers softirq
processing (interrupts enabled) and while updating the shares this tries =
to grab
the runq lock which we see in lock_spinners.

Since irq_count is 2 there is one more interruption executing right now (=
though
irq_regs wont show this). So I just proceeded upwards in the interrupt st=
ack and
found:
try_to_wake_up+57
default_wake_function+18
autoremove_wake_function+22
wake_bit_function+59
__wake_up_common+88
__wake_up+72
__wake_up_bit+49
end_page_writeback+64
put_io_page+36
ext4_end_bio+338
bio_endio+29
req_bio_endio.isra.45+163
blk_update_request+245
blk_update_bidi_request+49
__blk_end_bidi_request+32
__blk_end_request_all+31
blkif_interrupt+460
handle_irq_event_percpu+85
handle_irq_event+78
handle_edge_irq+132
__xen_evtchn_do_upcall+409
xen_evtchn_do_upcall+47
xen_do_hypervisor_callback+101

There was a bit more on the stack but try_to_wake_up seemed a believable =
current
path. Even more so after looking into the function:

#ifdef CONFIG_SMP
        /*
         * If the owning (remote) cpu is still in the middle of schedule(=
) with
         * this task as prev, wait until its done referencing the task.
         */
        while (p->on_cpu) {
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
                /*
                 * In case the architecture enables interrupts in
                 * context_switch(), we cannot busy wait, since that
                 * would lead to deadlocks when an interrupt hits and
                 * tries to wake up @prev. So bail and do a complete
                 * remote wakeup.
                 */
                if (ttwu_activate_remote(p, wake_flags))
                        goto stat;
#else
                cpu_relax();
#endif

And prying out the task in question from the stack, it was one currently
being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters=
 for
the runq lock of VCPU 1. Which would get picked up as soon as this callba=
ck is
done. Unfortunately we "stay awhile... stay forever!".

Ok, as an experiment I defined  __ARCH_WANT_INTERRUPTS_ON_CTXSW for x86 a=
nd
running that kernel in the PVM guest no more locks up.
Problem there is that this define is gone since v3.7 (f3e9478674). And th=
e
same testcase is not able to cause the issue since v3.6 (611ae8e3f5). The=

change of TLB flushes unlikely is directly related (rather changing the n=
umber
of IPIs and by that the likelihood to observe this).

So for v3.6+ kernels, the question is whether it can be ruled out that du=
ring
softirq handling rebalance_domains->update_blocked_averages (was update_s=
hares)
is called which then can get into spin_lock_slow and enable interrupts.
I have not seen it or am aware of reports about it but its likely one of =
those
rare occurrences.

As for a solution, the simplest one would be never to re-enable interrupt=
s
before poll_irq... Anything else feels atm right complicated (like trying=
 to
make the scheduler use disable interrupts and spin_lock variants instead =
of
spin_lock_irq ones... weird hidden implications in common code "just" for=
 PVM).

-Stefan

Note: comparing the number of db records created on a 15min run of the te=
stcase
seems to yield similar results for interrupts enabled (with that arch def=
ine in
v3.2) or disabled. Though I do not have a huge number base and that PVM i=
s the
only one running on the host.


--------------enigD23FB554A653276F059D5719
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="signature.asc"

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQIcBAEBCgAGBQJRHMThAAoJEOhnXe7L7s6jiUoP/2QUWD3TdNlR53z4Mo4ZObEM
rvyMLW291XaSYVVx94d2tW5yKNdpbKFM9kOV0rTfOJ2o/j7/w3O4uZoOncDX5q+l
I88wG0fVV43qfwzCtl5UKh7Pgqk97AOO1k7qcZA7o6RntW1fJSOL7TDrUq3SV4Ec
uuTMKJXhtJ9WGw+FSeHWCYdElT6QZb54BxRvNUzDuAtJ0jqvy8Lh6gIR1lmwaSGN
hpbCirFJkKxFLnJukEWPaCHrrzTvlHCjRSefdPXxZuivZCBDMRSLc2ROYLTAVZxA
UNqJYX83/Aj4kDbGR/54PcWkc1KyPRzDB2kySfLorfJhkK48usjGsKK07eRSZXdk
M+P5X/M6CtVnwgZ2kY5XFap7ihVOnN8ibbOYSshE/r407L6p9OYipAF6puBZ/tmx
bi/r+61LH54+R1YdPOy4xisPM9XajIu85uyrc75aUY4bYzsvbMHLl5RuwYfN7omu
RHwhszPbNdThmHcp/p3SzeG5z2P14rVtunNRJRe28JXnRiksNeNEf5KuvS9offKb
bqLjy5dHNaIS2k/uEO634M1EUGnHeLYhg8EShu0XDHhnk76ASbHS6UHiRi8yPwVp
ybSOqwjmh/A8uS0HqMI+MF8/MgSZl+CkiFWDZ7NUW+wLOUuihjErkM3YBeYmPnD8
XNOyEWnICM4FbtWPRaR6
=mT37
-----END PGP SIGNATURE-----

--------------enigD23FB554A653276F059D5719--


--===============4915402263337268652==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

--===============4915402263337268652==--