From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Bader Subject: Re: Xen PVM: Strange lockups when running PostgreSQL load Date: Thu, 18 Oct 2012 12:20:14 +0200 Message-ID: <507FD7DE.2010209@canonical.com> References: <1350479456-4007-1-git-send-email-stefan.bader@canonical.com> <507EB27D.8050308@citrix.com> <1350482118.2460.74.camel@zakaz.uk.xensource.com> <507ECD06.2050407@canonical.com> <507ED038.8000806@citrix.com> <507FC51102000078000A235E@nat28.tlf.novell.com> <507FC71502000078000A236C@nat28.tlf.novell.com> <507FB1E1.8080700@canonical.com> <1350546483.28188.25.camel@dagon.hellion.org.uk> Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="===============3410626000692852923==" Return-path: In-Reply-To: <1350546483.28188.25.camel@dagon.hellion.org.uk> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: Ian Campbell Cc: Andrew Cooper , Konrad Rzeszutek Wilk , Jan Beulich , "xen-devel@lists.xen.org" List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --===============3410626000692852923== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="------------enig8243ECFB12C8B34C5DC00DE5" This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --------------enig8243ECFB12C8B34C5DC00DE5 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On 18.10.2012 09:48, Ian Campbell wrote: > On Thu, 2012-10-18 at 08:38 +0100, Stefan Bader wrote: >> On 18.10.2012 09:08, Jan Beulich wrote: >>>>>> On 18.10.12 at 09:00, "Jan Beulich" wrote: >>>>>>> On 17.10.12 at 17:35, Andrew Cooper w= rote: >>>>> In each case, the event channels are masked (no surprise given the >>>>> conversation so far on this thread), and have no pending events.=20 >>>>> Therefore, I believe we are looking at the same bug. >>>> >>>> That seems very unlikely (albeit not impossible) to me, given that >>>> the non-pvops kernel uses ticket locks while the pvops one doesn't. >>> >>> And in fact we had a similar problem with our original ticket lock >>> implementation, exposed by an open coded lock in the scheduler's >>> run queue management. But that was really ticket lock specific, >>> in that the fact that a CPU could passively become the owner of >>> a lock while polling - that's impossible with pvops' byte locks afaic= t. >> >> One of the trains of thought I had was whether it could happen that a = cpu is in >> polling and the task gets moved. But I don't think it can happen as th= e >> hypercall unlikely is a place where any schedule happens (preempt is n= one). And >> it would be much more common... >> >> One detail which I hope someone can fill in is the whole "interrupted = spinlock" >> thing. Saving the last lock pointer stored on the per-cpu lock_spinner= s and so >> on. Is that really only for spinlocks taken without interrupts disable= d or do I >> miss something there? >=20 > spinning_lock() returns the old lock which the caller is expected to > remember and replace via unspinning_lock() -- it effectively implements= > a stack of locks which are being waited on. xen_spin_lock_slow (the onl= y > caller0 appears to do this correctly from a brief inspection. Yes, just *when* can there be a stack of locks (spinlocks). The poll_irq hypercall seems to be an active (in the sense of not preemting to another= task) process. How could there be a situation that another lock (on the same cp= u is tried to be taken). >=20 > Is there any chance this is just a simple AB-BA or similar type > deadlock? Do we have data which suggests all vCPUs are waiting on the > same lock or just that they are waiting on some lock? I suppose lockdep= > (which I think you mentioned before?) would have caught this, unless pv= > locks somehow confound it? The one situation where I went deeper into the tasks that appeared to be = on a cpu it was one waiting for signalling a task that looked to be just sched= uled out and the cpu it was running on doing a idle balance that waited on the= lock for cpu#0's runqueue. Which cpu#0 itself seemed to be waiting slow (the l= ock pointer was on lock_spinners[0]) but the lock itself was 0. Though there is a chance that this is always just a coincidental state wh= ere the lock just was released and more related to how the Xen stack does a guest= dump. So it would be to find who holds the other lock. Unfortunately at least a full lock debugging enabled kernel is sufficient= ly different in timing that I cannot reproduce the issue on a test machine. = And from reported crashes in production I have no data. >=20 > Ian. >=20 >=20 > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel >=20 --------------enig8243ECFB12C8B34C5DC00DE5 Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.11 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://www.enigmail.net/ iQIcBAEBCgAGBQJQf9feAAoJEOhnXe7L7s6jWi0QAIUmb9NQ1ilLciNpEM9r0ytT /MKqWbj5nrwXiYnwV5TLX/aEf+f0sDX/YMQuY6rFcroGSRTzVhfP6Br2eWe095jQ uyaJj7CsNP/tkHhohLLG5V3gYQu2dr5Gbxt2hCUuC8URYULsLI5ghBcawHeMNVll hfU4DKlMEHWu6BxhEWZWvhvoA3o4j/GH5Kj16Jaqv1ejuxHxXfx4QXg7GseRPx7S 7Z+lRLmEkyTpj3E7yVyZatah3umm6CxsFX4biDKxolgbDAoROE2L04i5Lu/oQLdy hcZiC5kiJDDN4A6tY5GmdDj4bbthDCwrWjudFrHXo0paR37qtzS7T/K8gUr8ZhtI Bm7Wka7X3k6JTfDb8u4+h3TcTh6q/TRjEvyDOeyQHAt91GYDChfWxYwBxAjVHF/w DvxAyz2r2atUdJwmza+vm11iuDr/8s41Hi6LUV7dKd3iEAp6tRC+G1FNQhi0/yxM gzr4PNBGYdKkhQpDhm4Rm+X2FdrnLlP3B31gSMuQwBUzwkum5R1jk2IbUOkuEOgU hr7BYCMVfapaQWp+AhXU8hzq/kZhB7zakO6hOu2C7I7oGIEyrBAeUiWPAw/ARMy+ t0lGYaf+zGl7rgvpvUXIJNKVvGx0IPYCfGioOcyafsuHS/7kfqsitpE5npj79bWa o4bdTimUXvwj2qjF1yoa =Qoqm -----END PGP SIGNATURE----- --------------enig8243ECFB12C8B34C5DC00DE5-- --===============3410626000692852923== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel --===============3410626000692852923==--