cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] "->ls_in_recovery" not released
@ 2010-11-22 16:31 Menyhart Zoltan
  2010-11-22 17:34 ` David Teigland
  0 siblings, 1 reply; 10+ messages in thread
From: Menyhart Zoltan @ 2010-11-22 16:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

We have got a two-node OCFS2 file system controlled by the pacemaker.
We do some robustness tests, e.g. blocking the access to the "other" node.
The "local" machine is blocked:

  PID: 15617  TASK: ffff880c77572d90  CPU: 38  COMMAND: "dlm_recoverd"
  #0 [ffff880c7cb07c30] schedule at ffffffff81452830
  #1 [ffff880c7cb07cf8] dlm_wait_function at ffffffffa03aaffb
  #2 [ffff880c7cb07d68] dlm_rcom_status at ffffffffa03aa3d9
                        ping_members
  #3 [ffff880c7cb07db8] dlm_recover_members at ffffffffa03a58a3
                        ls_recover
                        do_ls_recovery
  #4 [ffff880c7cb07e48] dlm_recoverd at ffffffffa03abc89
  #5 [ffff880c7cb07ee8] kthread at ffffffff810820f6
  #6 [ffff880c7cb07f48] kernel_thread at ffffffff8100d1aa

If either the monitor device closes, or someone sends down a "stop"
onto the control device, then "ls_recover()" goes to the "fail:" branch
without setting free "->ls_in_recovery".
As a result OCFS2 operations remain blocked, e.g.:

PID: 3385   TASK: ffff880876e69520  CPU: 1   COMMAND: "bash"
  #0 [ffff88087cb91980] schedule at ffffffff81452830
  #1 [ffff88087cb91a48] rwsem_down_failed_common at ffffffff81454c95
  #2 [ffff88087cb91a98] rwsem_down_read_failed at ffffffff81454e26
  #3 [ffff88087cb91ad8] call_rwsem_down_read_failed at ffffffff81248004
  #4 [ffff88087cb91b40] dlm_lock at ffffffffa03a17b2
  #5 [ffff88087cb91c00] user_dlm_lock at ffffffffa020d18e
  #6 [ffff88087cb91c30] ocfs2_dlm_lock at ffffffffa00683c2
  #7 [ffff88087cb91c40] __ocfs2_cluster_lock at ffffffffa04f951c
  #8 [ffff88087cb91d60] ocfs2_inode_lock_full_nested at ffffffffa04fd800
  #9 [ffff88087cb91df0] ocfs2_inode_revalidate at ffffffffa0507566
#10 [ffff88087cb91e20] ocfs2_getattr at ffffffffa050270b
#11 [ffff88087cb91e60] vfs_getattr at ffffffff8115cac1
#12 [ffff88087cb91ea0] vfs_fstatat at ffffffff8115cb50
#13 [ffff88087cb91ee0] vfs_stat at ffffffff8115cc9b
#14 [ffff88087cb91ef0] sys_newstat at ffffffff8115ccc4
#15 [ffff88087cb91f80] system_call_fastpath at ffffffff8100c172

"ls_recover()" includes several other cases when it simply goes
to the "fail:" branch without setting free "->ls_in_recovery" and
without cleaning up the inconsistent data left behind.

I think some error handling code is missing in "ls_recover()".
Have you modified the DLM since the RHEL 6.0?

Thanks,

Zoltan Menyhart



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] "->ls_in_recovery" not released
  2010-11-22 16:31 [Cluster-devel] "->ls_in_recovery" not released Menyhart Zoltan
@ 2010-11-22 17:34 ` David Teigland
  2010-11-23 14:58   ` Menyhart Zoltan
  0 siblings, 1 reply; 10+ messages in thread
From: David Teigland @ 2010-11-22 17:34 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote:
> We have got a two-node OCFS2 file system controlled by the pacemaker.

Are you using dlm_controld.pcmk?  If so, please try the latest versions of
pacemaker that use the standard dlm_controld.  The problem may be related
to the lockspace membership events that are passed to the kernel from
dlm_controld.  'dlm_tool dump' from each node, correlated with the
corosync membership events, may probably reveal the problem.  Start by
looking at the sequence of confchg log messages,
e.g. "dlm:ls:g conf 3 1 0 memb 1 2 4 join 4 left"

conf
3 = number of members
1 = number of members that joined
0 = number of members that left

"memb 1 2 4" - nodeids of members
"join 4" - nodeids of members that joined
"left" - nodeids of members that left

> "ls_recover()" includes several other cases when it simply goes
> to the "fail:" branch without setting free "->ls_in_recovery" and
> without cleaning up the inconsistent data left behind.
> 
> I think some error handling code is missing in "ls_recover()".
> Have you modified the DLM since the RHEL 6.0?

No, in_recovery is supposed to remain locked until recovery completes.
Any number of ls_recover() calls can fail due to more member changes
during recovery, but one of them should eventually succeed (complete
recovery), once the membership stops changing.  Then in_recovery will be
unlocked.

Look at the specific errors causing ls_recover() to fail, and check if
it's a confchg-related failure (like above), or another kind of error.

Dave



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] "->ls_in_recovery" not released
  2010-11-22 17:34 ` David Teigland
@ 2010-11-23 14:58   ` Menyhart Zoltan
  2010-11-23 17:15     ` David Teigland
  0 siblings, 1 reply; 10+ messages in thread
From: Menyhart Zoltan @ 2010-11-23 14:58 UTC (permalink / raw)
  To: cluster-devel.redhat.com

David Teigland wrote:
> On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote:
>> We have got a two-node OCFS2 file system controlled by the pacemaker.
>
> Are you using dlm_controld.pcmk?

Yes.

>If so, please try the latest versions of
> pacemaker that use the standard dlm_controld.

Actually we have dlm-pcmk-3.0.12-23.el6.x86_64.

I downloaded git://git.fedorahosted.org/dlm.git
We shall try it soon.

>> "ls_recover()" includes several other cases when it simply goes
>> to the "fail:" branch without setting free "->ls_in_recovery" and
>> without cleaning up the inconsistent data left behind.
>>
>> I think some error handling code is missing in "ls_recover()".
>> Have you modified the DLM since the RHEL 6.0?
>
> No, in_recovery is supposed to remain locked until recovery completes.
> Any number of ls_recover() calls can fail due to more member changes
> during recovery, but one of them should eventually succeed (complete
> recovery), once the membership stops changing.  Then in_recovery will be
> unlocked.
>
> Look at the specific errors causing ls_recover() to fail, and check if
> it's a confchg-related failure (like above), or another kind of error.

Assume the "other" node is lost, possibly forever.
"dlm_wait_function()" can return only if "dlm_ls_stop()" gets called
in the mean time.

I suppose the user-land can do something like this:

echo 0 > /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control

Actually I tried it by hand: it did not unblock the situation.
I gues at the next time, it was "ping_members()" that returned
with error==1.  The dead"other" node was still on the list.
Again, "ls_recover()" returned without setting free "->ls_in_recovery".

How can be "ls_recover()" reentered to be able to carry out the
recovery and to set "->ls_in_recovery" free?
(Assuming the "other" node is lost, possibly forever.)

Thanks for your response.

Zoltan Menyhart



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] "->ls_in_recovery" not released
  2010-11-23 14:58   ` Menyhart Zoltan
@ 2010-11-23 17:15     ` David Teigland
  2010-11-24 16:13       ` Menyhart Zoltan
  2010-11-30 16:57       ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
  0 siblings, 2 replies; 10+ messages in thread
From: David Teigland @ 2010-11-23 17:15 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Nov 23, 2010 at 03:58:42PM +0100, Menyhart Zoltan wrote:
> David Teigland wrote:
> >On Mon, Nov 22, 2010 at 05:31:25PM +0100, Menyhart Zoltan wrote:
> >>We have got a two-node OCFS2 file system controlled by the pacemaker.
> >
> >Are you using dlm_controld.pcmk?
> 
> Yes.
> 
> >If so, please try the latest versions of
> >pacemaker that use the standard dlm_controld.
> 
> Actually we have dlm-pcmk-3.0.12-23.el6.x86_64.
> 
> I downloaded git://git.fedorahosted.org/dlm.git
> We shall try it soon.

I'd suggest getting it from cluster.git STABLE3 or RHEL6 branches instead.

> >>"ls_recover()" includes several other cases when it simply goes
> >>to the "fail:" branch without setting free "->ls_in_recovery" and
> >>without cleaning up the inconsistent data left behind.
> >>
> >>I think some error handling code is missing in "ls_recover()".
> >>Have you modified the DLM since the RHEL 6.0?
> >
> >No, in_recovery is supposed to remain locked until recovery completes.
> >Any number of ls_recover() calls can fail due to more member changes
> >during recovery, but one of them should eventually succeed (complete
> >recovery), once the membership stops changing.  Then in_recovery will be
> >unlocked.
> >
> >Look at the specific errors causing ls_recover() to fail, and check if
> >it's a confchg-related failure (like above), or another kind of error.
> 
> Assume the "other" node is lost, possibly forever.
> "dlm_wait_function()" can return only if "dlm_ls_stop()" gets called
> in the mean time.
> 
> I suppose the user-land can do something like this:
> 
> echo 0 > /sys/kernel/dlm/14E8093BB71D447EBEE691622CF86B9C/control
> 
> Actually I tried it by hand: it did not unblock the situation.
> I gues at the next time, it was "ping_members()" that returned
> with error==1.  The dead"other" node was still on the list.
> Again, "ls_recover()" returned without setting free "->ls_in_recovery".
> 
> How can be "ls_recover()" reentered to be able to carry out the
> recovery and to set "->ls_in_recovery" free?
> (Assuming the "other" node is lost, possibly forever.)

dlm_controld manages all that.  You're either having a problem with the
pacemaker version, or you're missing something really basic, like loss of
quorum.  You're probably way off base looking in the kernel.

Dave



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] "->ls_in_recovery" not released
  2010-11-23 17:15     ` David Teigland
@ 2010-11-24 16:13       ` Menyhart Zoltan
  2010-11-24 20:29         ` David Teigland
  2010-11-30 16:57       ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
  1 sibling, 1 reply; 10+ messages in thread
From: Menyhart Zoltan @ 2010-11-24 16:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

> I'd suggest getting it from cluster.git STABLE3 or RHEL6 branches instead.

Could you please indicate the exact URL?


I have got a concern about the robustness of the DLM.

The Linux rules say: one should not return to user mode while holding a lock.
This is because one should not trust the user mode programs whether they
eventually re-enter the kernel or not, in order to release the lock.

For the very same reason (one should not trust the user mode programs),
I think, the DML kernel module is not sufficiently robust.

If you have a closer look, the situation of the "dlm_recoverd" kernel thread
is quite similar to waiting for a user mode program to trigger setting free
a lock.

I can agree: it does not return to user mode.
Yet it holds the lock and goes to sleep, in an um-interruptible way, waiting
for a user action: it trusts 100 % a user mode program, that can be killed,
can bee swapped out and no room to swap it in, etc.

Instead, the DLM should always return in a few seconds, saying the caller
cannot be granted a given "dlm_lock" for a given reason.

E.g. the ocfs2 is able to handle refused lock request. It is up to the
caller to decide if s/he wants to wait more.

I think whatever the user land does, the DLM kernel module should give
a response to a "dlm_lock()" request within a short (for a human operator)
time.


Thanks for your response,

Zoltan Menyhart



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] "->ls_in_recovery" not released
  2010-11-24 16:13       ` Menyhart Zoltan
@ 2010-11-24 20:29         ` David Teigland
  0 siblings, 0 replies; 10+ messages in thread
From: David Teigland @ 2010-11-24 20:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Nov 24, 2010 at 05:13:40PM +0100, Menyhart Zoltan wrote:
> Could you please indicate the exact URL?

The current fedora packages,
or
https://www.redhat.com/archives/cluster-devel/2010-October/msg00008.html
or
http://git.fedorahosted.org/git/?p=cluster.git;a=shortlog;h=refs/heads/STABLE31

> The Linux rules say: one should not return to user mode while holding a lock.
> This is because one should not trust the user mode programs whether they
> eventually re-enter the kernel or not, in order to release the lock.
> 
> For the very same reason (one should not trust the user mode programs),
> I think, the DML kernel module is not sufficiently robust.
> 
> If you have a closer look, the situation of the "dlm_recoverd" kernel thread
> is quite similar to waiting for a user mode program to trigger setting free
> a lock.
> 
> I can agree: it does not return to user mode.
> Yet it holds the lock and goes to sleep, in an um-interruptible way, waiting
> for a user action: it trusts 100 % a user mode program, that can be killed,
> can bee swapped out and no room to swap it in, etc.
> 
> Instead, the DLM should always return in a few seconds, saying the caller
> cannot be granted a given "dlm_lock" for a given reason.
> 
> E.g. the ocfs2 is able to handle refused lock request. It is up to the
> caller to decide if s/he wants to wait more.
> 
> I think whatever the user land does, the DLM kernel module should give
> a response to a "dlm_lock()" request within a short (for a human operator)
> time.

You have identified one of the obvious downsides to implementing
clustering partly in the kernel and partly in userland.  In my experience
this has not proven to be a problem.

Dave



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] Patch: making DLM more robust
  2010-11-23 17:15     ` David Teigland
  2010-11-24 16:13       ` Menyhart Zoltan
@ 2010-11-30 16:57       ` Menyhart Zoltan
  2010-11-30 17:30         ` David Teigland
  1 sibling, 1 reply; 10+ messages in thread
From: Menyhart Zoltan @ 2010-11-30 16:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

An easy first step to make DLM more robust can be adding a time out protection
to the lock space cration operation, while waiting for a "dlm_controld" action.
A new memeber "ci_dlm_controld_secs" is added to "dlm_config" to set up time out
in seconds, DEFAULT_DLM_CTRL_SECS is 5 seconds.

At the same time, signals can be enabled and handled, too.

DLM_USER_CREATE_LOCKSPACE will be able to return new error codes:
-EINTR or -ETIMEDOUT.

Could you please tell me why the signals are blocked within "device_write()"?
I think it is safe to allow signals, surely in your original code sequences
waiting in an uninterruptible way.

BTW "sigprocmask()" already contains "recalc_sigpending()".

  out_sig:
     sigprocmask(SIG_SETMASK, &tmpsig, NULL);
     recalc_sigpending();


Thanks,

Zoltan Menyhart
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: diff2
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20101130/d22fbbe9/attachment.ksh>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] Patch: making DLM more robust
  2010-11-30 16:57       ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
@ 2010-11-30 17:30         ` David Teigland
  2010-12-01  9:23           ` Menyhart Zoltan
  0 siblings, 1 reply; 10+ messages in thread
From: David Teigland @ 2010-11-30 17:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Nov 30, 2010 at 05:57:50PM +0100, Menyhart Zoltan wrote:
> Hi,
> 
> An easy first step to make DLM more robust can be adding a time out protection
> to the lock space cration operation, while waiting for a "dlm_controld" action.
> A new memeber "ci_dlm_controld_secs" is added to "dlm_config" to set up time out
> in seconds, DEFAULT_DLM_CTRL_SECS is 5 seconds.
> 
> At the same time, signals can be enabled and handled, too.
> 
> DLM_USER_CREATE_LOCKSPACE will be able to return new error codes:
> -EINTR or -ETIMEDOUT.
> 
> Could you please tell me why the signals are blocked within "device_write()"?
> I think it is safe to allow signals, surely in your original code sequences
> waiting in an uninterruptible way.

Thanks, I'll take a look; as long as it's disabled by default I don't
expect I'd object much.  There are two main problems with this idea,
though, that need to be handled before it's generally usable:

1. The kernel can wait on user space indefinately during completely normal
situations, e.g. the loss of quorum or fencing failures can delay
completion indefinately.  This means you can easily introduce false
failures when using a timeout.  EINTR, since it's driven by user
intervention, is a better idea, e.g. killing a mount process.

2. The difficulty, even with EINTR, is correctly and cleanly unwinding the
dlm_controld state.

Dave



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] Patch: making DLM more robust
  2010-11-30 17:30         ` David Teigland
@ 2010-12-01  9:23           ` Menyhart Zoltan
  2010-12-01 17:27             ` David Teigland
  0 siblings, 1 reply; 10+ messages in thread
From: Menyhart Zoltan @ 2010-12-01  9:23 UTC (permalink / raw)
  To: cluster-devel.redhat.com

David Teigland wrote:

> Thanks, I'll take a look; as long as it's disabled by default I don't
> expect I'd object much.  There are two main problems with this idea,
> though, that need to be handled before it's generally usable:
>
> 1. The kernel can wait on user space indefinately during completely normal
> situations, e.g. the loss of quorum or fencing failures can delay
> completion indefinately.

In my eyes, a networked application should indicate a failure within a
"human expectable" time delay. E.g.:
- You can try a DLM_USER_CREATE_LOCKSPACE for 5 seconds
- If it times out, you can log it, display some status telling the user
   that it has already been retried for H hours M minutes  and S seconds
- And retry (if configured so to do by itself) if there is no intervention

> This means you can easily introduce false
> failures when using a timeout.

If we cannot obtain a given resource within a limited time frame,
then it is a real error for the customer: s/he cannot mount an OCFS2
volume, cannot issue a cluster command, etc.

> EINTR, since it's driven by user
> intervention, is a better idea, e.g. killing a mount process.
>
> 2. The difficulty, even with EINTR, is correctly and cleanly unwinding the
> dlm_controld state.

Let's take this example indlm/libdlm/libdlm.c:

int create_lockspace_v6(const char *name, uint32_t flags)
{
         char reqbuf[sizeof(struct dlm_write_request) + DLM_LOCKSPACE_LEN];
         struct dlm_write_request *req = (struct dlm_write_request *)reqbuf;
         int namelen = strlen(name);

         memset(reqbuf, 0, sizeof(reqbuf));
         set_version_v6(req);
         req->cmd = DLM_USER_CREATE_LOCKSPACE;
         req->i.lspace.flags = flags;
         if (namelen > DLM_LOCKSPACE_LEN) {
                 errno = EINVAL;
                 return -1;
         }
         memcpy(req->i.lspace.name, name, namelen);
         return write(control_fd, req, sizeof(*req) + namelen);
}

The caller should already be prepared to unwind everything in case of an
EINVAL is returned due to a name length error.
"write()" can also return several errors.

We will have two more error codes:

EINTR: there is no much difference if the signal arrives just before we
call "write()" or inside the system call...
If you already ignore it... If you already handle it...

ETIMEDOUT:see above

There should be a smooth way out from errors, other than hard reseting the
machine :-)

Thanks,

Zoltan Menyhart





^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Cluster-devel] Patch: making DLM more robust
  2010-12-01  9:23           ` Menyhart Zoltan
@ 2010-12-01 17:27             ` David Teigland
  0 siblings, 0 replies; 10+ messages in thread
From: David Teigland @ 2010-12-01 17:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Dec 01, 2010 at 10:23:25AM +0100, Menyhart Zoltan wrote:
> If we cannot obtain a given resource within a limited time frame,
> then it is a real error for the customer: s/he cannot mount an OCFS2
> volume, cannot issue a cluster command, etc.

Matter of opinion and preference I suppose.

> >2. The difficulty, even with EINTR, is correctly and cleanly unwinding the
> >dlm_controld state.
> 
> Let's take this example indlm/libdlm/libdlm.c:

The problem is not backing out of libdlm, it's leaving the cpg group, etc
in dlm_controld (when the join itself is not even complete).  It should
all be possible, but I've never viewed this as a problem worth fixing
given the effort required.

Dave



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2010-12-01 17:27 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-22 16:31 [Cluster-devel] "->ls_in_recovery" not released Menyhart Zoltan
2010-11-22 17:34 ` David Teigland
2010-11-23 14:58   ` Menyhart Zoltan
2010-11-23 17:15     ` David Teigland
2010-11-24 16:13       ` Menyhart Zoltan
2010-11-24 20:29         ` David Teigland
2010-11-30 16:57       ` [Cluster-devel] Patch: making DLM more robust Menyhart Zoltan
2010-11-30 17:30         ` David Teigland
2010-12-01  9:23           ` Menyhart Zoltan
2010-12-01 17:27             ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).