From mboxrd@z Thu Jan 1 00:00:00 1970 From: Zhigang Wang Subject: Re: [PATCH] tools/hotplug: fix locking Date: Thu, 21 Jun 2012 09:07:00 -0400 Message-ID: <4FE31C74.7050100@oracle.com> References: <650b03f412143c3057ab.1339608555@zhigang.us.oracle.com> <1340206463.4906.72.camel@zakaz.uk.xensource.com> <4FE20E27.8000108@oracle.com> <1340278976.21872.114.camel@zakaz.uk.xensource.com> <1340279364.21872.118.camel@zakaz.uk.xensource.com> <4FE30ED3.6040805@oracle.com> <20120621122337.GP21124@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20120621122337.GP21124@redhat.com> List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xen.org Errors-To: xen-devel-bounces@lists.xen.org To: "Daniel P. Berrange" Cc: Andrew Jones , Ian Campbell , Ian Jackson , xen-devel , Don Dutile , Kouya Shimura List-Id: xen-devel@lists.xenproject.org On 06/21/2012 08:23 AM, Daniel P. Berrange wrote: > On Thu, Jun 21, 2012 at 08:08:51AM -0400, Zhigang Wang wrote: >> On 06/21/2012 07:49 AM, Ian Campbell wrote: >>> On Thu, 2012-06-21 at 12:42 +0100, Ian Campbell wrote: >>>> On Wed, 2012-06-20 at 18:53 +0100, Zhigang Wang wrote: >>>>> On 06/20/2012 11:34 AM, Ian Campbell wrote: >>>>>> On Wed, 2012-06-13 at 18:29 +0100, Zhigang Wang wrote: >>>>>>> # HG changeset patch >>>>>>> # User Zhigang Wang >>>>>>> # Date 1339608534 14400 >>>>>>> # Node ID 650b03f412143c3057abbd0ae0e256e6b3fd2ba8 >>>>>>> # Parent 32034d1914a607d7b6f1f060352b4cac973600f8 >>>>>>> tools/hotplug: fix locking >>>>>> A better title would be "tools/hotplug: Switch to locking with flock" >>>>>> since "fix" is not very descriptive. >>>>> Agree. >>>>>> The commit message should also state why changing this scheme is >>>>>> preferable to fixing the current one. >>>>> I have two points: >>>>> >>>>> 1. No timeout: in the old one, if one process holding the lock more than 100 >>>>> seconds, other processes could steal it. >>>>> 2. No leftovers: if a process holding this lock is dead, it will close the lock >>>>> file, so it will not affect other processes. >>>>> >>>>>> Is this flock tool widely available? Does it need to be added to the >>>>>> dependencies in the README? >>>>> It is widely distributed: it's part of the util-linux package. So I think no >>>>> need to document it. >>>>>>> The current locking implementation would allow two processes get the lock >>>>>>> simultaneously: >>>>>> [...snip shell trace...] >>>>>> >>>>>> Can you add a line or two of analysis to explain where in that shell >>>>>> spew things are actually going wrong and why? >>>>> I didn't spent much time on this complicated implement. But it fails for me and >>>>> also for others (from response). >>>>>>> This patch is ported from Red Hat Enterprise Linux 5.8. >>>>>> I think we need a Signed-off-by from the original patch author. (Unless >>>>>> DCO clause b somehow applies?) >>>>> I'm not sure how to handle this. There's no signed-off-by on the original Red >>>>> Hat patch. Could anyone in Red Hat help to get it signed off? >>>> Perhaps Andrew Jones or Don Dutile can point us in the right direction >>>> (both CCd) if you identify the precise source of the patch (i.e. the >>>> name of the .src.rpm and the name of the patch within it)? >>>> >>>> I looked in kernel-2.6.18-308.8.2.el5.src.rpm (which seems fairly >>>> recent) but maybe I'm looking in the wrong place. >>> Nevermind, I found it in xen-3.0.3-135.el5_8.2.src.rpm (I'd have sworn >>> that RH shipped kernel+xen in a single source package, oh well). >>> >>> %changelog says: >>> * Fri Sep 14 2007 Daniel P. Berrange - 3.0.3-40.el5 >>> - Rewrite locking in hotplug scripts to fix timeouts (rhbz #267861) >>> >>> Daniel CCd. >>> >>> I wonder if there were any followup fixes or changes, especially wrt to >>> your point 1 above (the lack of a timeout now) requiring xend (or now >>> libxl) changes? >> We have another timeout in xend/xl: if device backend is not setup properly, the >> device creation will fail. >> >> Without any timeout for this locking, all backend uevent scripts will hang if >> one goes wrong. Just stealing the lock upon timeoutis not a good idea: it may >> cause error or damage. What about adding a long (10minutes) timeout? (by adding >> -w/--wait/--timeout 600) > Timeouts are a really bad idea for this area of code. If you make the > timeout really long (10 mins as you suggest), then no other guest can > be started during this entire 10 minute window if something goes wrong. > IMHO this is unacceptable. If you make the timeout short enough that > you don't unduly delay other VM startup operations, then you will often > hit bogus timeouts under high load. > > In practice the scripts should not hang unless something is seriously > wrong, in which case timing out & pretending every thing is still fine > for the future is bogus. If you have real world examples of hangs > then you should focus efforts on fixing the hangs, rather than papering > over the problem with a timeout. > > Daniel Thanks Daniel. I agree with your points on timeout. Zhigang