From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Bader Subject: Strange PVM spinlock case revisited (nailed down) Date: Thu, 14 Feb 2013 12:04:56 +0100 Message-ID: <511CC4D8.3060203@canonical.com> References: <51151ABF.1070007@canonical.com> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============4915402263337268652==" Return-path: In-Reply-To: <51151ABF.1070007@canonical.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: "xen-devel@lists.xen.org" Cc: Andrew Cooper , Ian Campbell , Jan Beulich List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --===============4915402263337268652== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="------------enigD23FB554A653276F059D5719" This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enigD23FB554A653276F059D5719 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I think I finally can proof why allowing interrupts for the duration of p= oll_irq in xen_spin_lock_slow is bad! \o/ So with another run/dump: VCPUs 1,2,5 and 6 are spinning on the runq lock of VCPU 1. The number of spinners is 4 (ok) and the lock released. According to dom0 debug info VCPUs 2,5 and 6 are polling, VCPU 1 is not but seems running (has=3DT). The event channel for spinlock1 is pending. Checking the interrupt stack of VCPU 1 in the guest dump shows: hypercall_page+938 xen_poll_irq+16 xen_spin_lock_slow+151 xen_spin_lock_flags+99 _raw_spin_lock_irqsave+46 update_shares+156 rebalance_domains+72 run_rebalance_domains+72 __do_softirq+168 call_softirq+99 do_softirq+101 irq_exit+142 xen_evtchn_do_upcall+53 xen_do_hypervisor_callback+101 So right before finishing a previous callback, irq_exit triggers softirq processing (interrupts enabled) and while updating the shares this tries = to grab the runq lock which we see in lock_spinners. Since irq_count is 2 there is one more interruption executing right now (= though irq_regs wont show this). So I just proceeded upwards in the interrupt st= ack and found: try_to_wake_up+57 default_wake_function+18 autoremove_wake_function+22 wake_bit_function+59 __wake_up_common+88 __wake_up+72 __wake_up_bit+49 end_page_writeback+64 put_io_page+36 ext4_end_bio+338 bio_endio+29 req_bio_endio.isra.45+163 blk_update_request+245 blk_update_bidi_request+49 __blk_end_bidi_request+32 __blk_end_request_all+31 blkif_interrupt+460 handle_irq_event_percpu+85 handle_irq_event+78 handle_edge_irq+132 __xen_evtchn_do_upcall+409 xen_evtchn_do_upcall+47 xen_do_hypervisor_callback+101 There was a bit more on the stack but try_to_wake_up seemed a believable = current path. Even more so after looking into the function: #ifdef CONFIG_SMP /* * If the owning (remote) cpu is still in the middle of schedule(= ) with * this task as prev, wait until its done referencing the task. */ while (p->on_cpu) { #ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW /* * In case the architecture enables interrupts in * context_switch(), we cannot busy wait, since that * would lead to deadlocks when an interrupt hits and * tries to wake up @prev. So bail and do a complete * remote wakeup. */ if (ttwu_activate_remote(p, wake_flags)) goto stat; #else cpu_relax(); #endif And prying out the task in question from the stack, it was one currently being accounted for VCPU 6 and on_cpu. VCPU 6 is one of the other waiters= for the runq lock of VCPU 1. Which would get picked up as soon as this callba= ck is done. Unfortunately we "stay awhile... stay forever!". Ok, as an experiment I defined __ARCH_WANT_INTERRUPTS_ON_CTXSW for x86 a= nd running that kernel in the PVM guest no more locks up. Problem there is that this define is gone since v3.7 (f3e9478674). And th= e same testcase is not able to cause the issue since v3.6 (611ae8e3f5). The= change of TLB flushes unlikely is directly related (rather changing the n= umber of IPIs and by that the likelihood to observe this). So for v3.6+ kernels, the question is whether it can be ruled out that du= ring softirq handling rebalance_domains->update_blocked_averages (was update_s= hares) is called which then can get into spin_lock_slow and enable interrupts. I have not seen it or am aware of reports about it but its likely one of = those rare occurrences. As for a solution, the simplest one would be never to re-enable interrupt= s before poll_irq... Anything else feels atm right complicated (like trying= to make the scheduler use disable interrupts and spin_lock variants instead = of spin_lock_irq ones... weird hidden implications in common code "just" for= PVM). -Stefan Note: comparing the number of db records created on a 15min run of the te= stcase seems to yield similar results for interrupts enabled (with that arch def= ine in v3.2) or disabled. Though I do not have a huge number base and that PVM i= s the only one running on the host. --------------enigD23FB554A653276F059D5719 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with undefined - http://www.enigmail.net/ iQIcBAEBCgAGBQJRHMThAAoJEOhnXe7L7s6jiUoP/2QUWD3TdNlR53z4Mo4ZObEM rvyMLW291XaSYVVx94d2tW5yKNdpbKFM9kOV0rTfOJ2o/j7/w3O4uZoOncDX5q+l I88wG0fVV43qfwzCtl5UKh7Pgqk97AOO1k7qcZA7o6RntW1fJSOL7TDrUq3SV4Ec uuTMKJXhtJ9WGw+FSeHWCYdElT6QZb54BxRvNUzDuAtJ0jqvy8Lh6gIR1lmwaSGN hpbCirFJkKxFLnJukEWPaCHrrzTvlHCjRSefdPXxZuivZCBDMRSLc2ROYLTAVZxA UNqJYX83/Aj4kDbGR/54PcWkc1KyPRzDB2kySfLorfJhkK48usjGsKK07eRSZXdk M+P5X/M6CtVnwgZ2kY5XFap7ihVOnN8ibbOYSshE/r407L6p9OYipAF6puBZ/tmx bi/r+61LH54+R1YdPOy4xisPM9XajIu85uyrc75aUY4bYzsvbMHLl5RuwYfN7omu RHwhszPbNdThmHcp/p3SzeG5z2P14rVtunNRJRe28JXnRiksNeNEf5KuvS9offKb bqLjy5dHNaIS2k/uEO634M1EUGnHeLYhg8EShu0XDHhnk76ASbHS6UHiRi8yPwVp ybSOqwjmh/A8uS0HqMI+MF8/MgSZl+CkiFWDZ7NUW+wLOUuihjErkM3YBeYmPnD8 XNOyEWnICM4FbtWPRaR6 =mT37 -----END PGP SIGNATURE----- --------------enigD23FB554A653276F059D5719-- --===============4915402263337268652== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============4915402263337268652==--