From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Teigland <teigland@redhat.com>
Date: Tue, 23 Nov 2010 12:15:08 -0500
Subject: [Cluster-devel] "->ls_in_recovery" not released
In-Reply-To: <4CEBD6A2.8090005@bull.net>
References: <4CEA9ADD.2050109@bull.net> <20101122173442.GA21879@redhat.com>
	<4CEBD6A2.8090005@bull.net>
Message-ID: <20101123171508.GC30147@redhat.com>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

On Tue, Nov 23, 2010 at 03:58:42PM +0100, Menyhart Zoltan wrote:
> David Teigland wrote:
> >On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote:
> >>We have got a two-node OCFS2 file system controlled by the pacemaker.
> >
> >Are you using dlm_controld.pcmk?
> 
> Yes.
> 
> >If so, please try the latest versions of
> >pacemaker that use the standard dlm_controld.
> 
> Actually we have dlm-pcmk-3.0.12-23.el6.x86_64.
> 
> I downloaded git://git.fedorahosted.org/dlm.git
> We shall try it soon.

I'd suggest getting it from cluster.git STABLE3 or RHEL6 branches instead.

> >>"ls_recover()" includes several other cases when it simply goes
> >>to the "fail:" branch without setting free "->ls_in_recovery" and
> >>without cleaning up the inconsistent data left behind.
> >>
> >>I think some error handling code is missing in "ls_recover()".
> >>Have you modified the DLM since the RHEL 6.0?
> >
> >No, in_recovery is supposed to remain locked until recovery completes.
> >Any number of ls_recover() calls can fail due to more member changes
> >during recovery, but one of them should eventually succeed (complete
> >recovery), once the membership stops changing.  Then in_recovery will be
> >unlocked.
> >
> >Look at the specific errors causing ls_recover() to fail, and check if
> >it's a confchg-related failure (like above), or another kind of error.
> 
> Assume the "other" node is lost, possibly forever.
> "dlm_wait_function()" can return only if "dlm_ls_stop()" gets called
> in the mean time.
> 
> I suppose the user-land can do something like this:
> 
> echo 0 > /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control
> 
> Actually I tried it by hand: it did not unblock the situation.
> I gues at the next time, it was "ping_members()" that returned
> with error==1.  The dead"other" node was still on the list.
> Again, "ls_recover()" returned without setting free "->ls_in_recovery".
> 
> How can be "ls_recover()" reentered to be able to carry out the
> recovery and to set "->ls_in_recovery" free?
> (Assuming the "other" node is lost, possibly forever.)

dlm_controld manages all that.  You're either having a problem with the
pacemaker version, or you're missing something really basic, like loss of
quorum.  You're probably way off base looking in the kernel.

Dave