From mboxrd@z Thu Jan 1 00:00:00 1970 From: Menyhart Zoltan Date: Tue, 23 Nov 2010 15:58:42 +0100 Subject: [Cluster-devel] "->ls_in_recovery" not released In-Reply-To: <20101122173442.GA21879@redhat.com> References: <4CEA9ADD.2050109@bull.net> <20101122173442.GA21879@redhat.com> Message-ID: <4CEBD6A2.8090005@bull.net> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit David Teigland wrote: > On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote: >> We have got a two-node OCFS2 file system controlled by the pacemaker. > > Are you using dlm_controld.pcmk? Yes. >If so, please try the latest versions of > pacemaker that use the standard dlm_controld. Actually we have dlm-pcmk-3.0.12-23.el6.x86_64. I downloaded git://git.fedorahosted.org/dlm.git We shall try it soon. >> "ls_recover()" includes several other cases when it simply goes >> to the "fail:" branch without setting free "->ls_in_recovery" and >> without cleaning up the inconsistent data left behind. >> >> I think some error handling code is missing in "ls_recover()". >> Have you modified the DLM since the RHEL 6.0? > > No, in_recovery is supposed to remain locked until recovery completes. > Any number of ls_recover() calls can fail due to more member changes > during recovery, but one of them should eventually succeed (complete > recovery), once the membership stops changing. Then in_recovery will be > unlocked. > > Look at the specific errors causing ls_recover() to fail, and check if > it's a confchg-related failure (like above), or another kind of error. Assume the "other" node is lost, possibly forever. "dlm_wait_function()" can return only if "dlm_ls_stop()" gets called in the mean time. I suppose the user-land can do something like this: echo 0 > /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control Actually I tried it by hand: it did not unblock the situation. I gues at the next time, it was "ping_members()" that returned with error==1. The dead"other" node was still on the list. Again, "ls_recover()" returned without setting free "->ls_in_recovery". How can be "ls_recover()" reentered to be able to carry out the recovery and to set "->ls_in_recovery" free? (Assuming the "other" node is lost, possibly forever.) Thanks for your response. Zoltan Menyhart