From: David Teigland <teigland@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] "->ls_in_recovery" not released
Date: Tue, 23 Nov 2010 12:15:08 -0500 [thread overview]
Message-ID: <20101123171508.GC30147@redhat.com> (raw)
In-Reply-To: <4CEBD6A2.8090005@bull.net>
On Tue, Nov 23, 2010 at 03:58:42PM +0100, Menyhart Zoltan wrote:
> David Teigland wrote:
> >On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote:
> >>We have got a two-node OCFS2 file system controlled by the pacemaker.
> >
> >Are you using dlm_controld.pcmk?
>
> Yes.
>
> >If so, please try the latest versions of
> >pacemaker that use the standard dlm_controld.
>
> Actually we have dlm-pcmk-3.0.12-23.el6.x86_64.
>
> I downloaded git://git.fedorahosted.org/dlm.git
> We shall try it soon.
I'd suggest getting it from cluster.git STABLE3 or RHEL6 branches instead.
> >>"ls_recover()" includes several other cases when it simply goes
> >>to the "fail:" branch without setting free "->ls_in_recovery" and
> >>without cleaning up the inconsistent data left behind.
> >>
> >>I think some error handling code is missing in "ls_recover()".
> >>Have you modified the DLM since the RHEL 6.0?
> >
> >No, in_recovery is supposed to remain locked until recovery completes.
> >Any number of ls_recover() calls can fail due to more member changes
> >during recovery, but one of them should eventually succeed (complete
> >recovery), once the membership stops changing. Then in_recovery will be
> >unlocked.
> >
> >Look at the specific errors causing ls_recover() to fail, and check if
> >it's a confchg-related failure (like above), or another kind of error.
>
> Assume the "other" node is lost, possibly forever.
> "dlm_wait_function()" can return only if "dlm_ls_stop()" gets called
> in the mean time.
>
> I suppose the user-land can do something like this:
>
> echo 0 > /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control
>
> Actually I tried it by hand: it did not unblock the situation.
> I gues at the next time, it was "ping_members()" that returned
> with error==1. The dead"other" node was still on the list.
> Again, "ls_recover()" returned without setting free "->ls_in_recovery".
>
> How can be "ls_recover()" reentered to be able to carry out the
> recovery and to set "->ls_in_recovery" free?
> (Assuming the "other" node is lost, possibly forever.)
dlm_controld manages all that. You're either having a problem with the
pacemaker version, or you're missing something really basic, like loss of
quorum. You're probably way off base looking in the kernel.
Dave
next prev parent reply other threads:[~2010-11-23 17:15 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-22 16:31 [Cluster-devel] "->ls_in_recovery" not released Menyhart Zoltan
2010-11-22 17:34 ` David Teigland
2010-11-23 14:58 ` Menyhart Zoltan
2010-11-23 17:15 ` David Teigland [this message]
2010-11-24 16:13 ` Menyhart Zoltan
2010-11-24 20:29 ` David Teigland
2010-11-30 16:57 ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
2010-11-30 17:30 ` David Teigland
2010-12-01 9:23 ` Menyhart Zoltan
2010-12-01 17:27 ` David Teigland
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20101123171508.GC30147@redhat.com \
--to=teigland@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).