[Cluster-devel] Question on LVB when the node that held EX lock crash

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Cluster-devel] Question on LVB when the node that held EX lock crash
@ 2016-11-16  8:29 Eric Ren
  2016-11-16  8:42 ` Eric Ren
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Ren @ 2016-11-16  8:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David and all,

I am debugging an issue of ocfs2 that relates to LVB value. I will try to make it a pure DLM 
question:

Two nodes (N1, N2) try to truncate the same file(R1) concurrently.

N1                                         N2
lock(R1, EX)
changing LVB: x
                                               lock(R1, EX)
convert(R1, NULL)
flush LVB
                                               changing LVB: x -> y
                                               crash
convert(R1, EX)
get LVB
Qustion: what is the LVB then? x or y?
======

Is this a valid question? or am I missing something?

Looking forward your help;-)

Thanks,
Eric



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-16  8:29 [Cluster-devel] Question on LVB when the node that held EX lock crash Eric Ren
@ 2016-11-16  8:42 ` Eric Ren
  2016-11-16 15:08   ` David Teigland
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Ren @ 2016-11-16  8:42 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On 11/16/2016 04:29 PM, Eric Ren wrote:
> Hi David and all,
>
> I am debugging an issue of ocfs2 that relates to LVB value. I will try to make it a pure 
> DLM question:
>
> Two nodes (N1, N2) try to truncate the same file(R1) concurrently.
>
> N1                                         N2
> lock(R1, EX)
> changing LVB: x
>                                               lock(R1, EX)
> convert(R1, NULL)
> flush LVB
>                                               changing LVB: x -> y
>                                               crash

Hmm, here should be a long story of DLM recovery;-)

Eric
> convert(R1, EX)
> get LVB
> Qustion: what is the LVB then? x or y?
> ======
>
> Is this a valid question? or am I missing something?
>
> Looking forward your help;-)
>
> Thanks,
> Eric




^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-16  8:42 ` Eric Ren
@ 2016-11-16 15:08   ` David Teigland
  2016-11-17  9:30     ` Eric Ren
  2016-11-30  9:07     ` Eric Ren
  0 siblings, 2 replies; 7+ messages in thread
From: David Teigland @ 2016-11-16 15:08 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Nov 16, 2016 at 04:42:09PM +0800, Eric Ren wrote:
> On 11/16/2016 04:29 PM, Eric Ren wrote:
> > Hi David and all,
> > 
> > I am debugging an issue of ocfs2 that relates to LVB value. I will try
> > to make it a pure DLM question:
> > 
> > Two nodes (N1, N2) try to truncate the same file(R1) concurrently.
> > 
> > N1                                         N2
> > lock(R1, EX)
> > changing LVB: x
> >                                               lock(R1, EX)
> > convert(R1, NULL)
> > flush LVB
> >                                               changing LVB: x -> y
> >                                               crash
> 
> Hmm, here should be a long story of DLM recovery;-)
> 
> Eric
> > convert(R1, EX)
> > get LVB
> > Qustion: what is the LVB then? x or y?
> > ======
> > 
> > Is this a valid question? or am I missing something?

It's a good question, and it's been enough years that the details are now
hazy.  I think the current behavior emulates the original VMS dlm model
fairly well, so any documentation you can find on that may help.

If you look at the recover_lvb() comment, you'll see a little information
about this.  The LVB can be lost in a crash in some fairly common cases,
and in those cases, the dlm should set the VALNOTVALID flag to tell the
application that the LVB may be lost/stale.  So, an application cannot
rely entirely on the LVB, and must be able to go without it, or
reconstruct the value, i.e. the LVB data is usually used as part of an
optimization (e.g. caching).

The two cases mentioned in that comment are:

1. if N1 was R1 master when N2 crashed: N1 would purge the EX lock from
N2, and set VALNOTVALID on R1, because the latest LVB from N2 was never
seen by N1.

2. if N2 was R1 master when N2 crashed: N1 would become the new R1 master
(if it kept a NL lock on it), and would set VALNOTVALID because it doesn't
know if N2 had any EX locks from other nodes that might have also crashed,
or an LVB that had been updated since N1 last saw it.

Dave

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-16 15:08   ` David Teigland
@ 2016-11-17  9:30     ` Eric Ren
  2016-11-30  9:07     ` Eric Ren
  1 sibling, 0 replies; 7+ messages in thread
From: Eric Ren @ 2016-11-17  9:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi!

On 11/16/2016 11:08 PM, David Teigland wrote:
> ....
>>> convert(R1, EX)
>>> get LVB
>>> Qustion: what is the LVB then? x or y?
>>> ======
>>>
>>> Is this a valid question? or am I missing something?
> It's a good question, and it's been enough years that the details are now
> hazy.  I think the current behavior emulates the original VMS dlm model
> fairly well, so any documentation you can find on that may help.
>
> If you look at the recover_lvb() comment, you'll see a little information
> about this.  The LVB can be lost in a crash in some fairly common cases,
> and in those cases, the dlm should set the VALNOTVALID flag to tell the
> application that the LVB may be lost/stale.  So, an application cannot
> rely entirely on the LVB, and must be able to go without it, or
> reconstruct the value, i.e. the LVB data is usually used as part of an
> optimization (e.g. caching).
>
> The two cases mentioned in that comment are:
>
> 1. if N1 was R1 master when N2 crashed: N1 would purge the EX lock from
> N2, and set VALNOTVALID on R1, because the latest LVB from N2 was never
> seen by N1.
>
> 2. if N2 was R1 master when N2 crashed: N1 would become the new R1 master
> (if it kept a NL lock on it), and would set VALNOTVALID because it doesn't
> know if N2 had any EX locks from other nodes that might have also crashed,
> or an LVB that had been updated since N1 last saw it.
Thanks a lot! Yes, the ocfs2 developer had handled this situation by this patch:

            "[Ocfs2-devel] [PATCH] ocfs2: Provide the ocfs2_dlm_lvb_valid() stack API"
from
https://oss.oracle.com/pipermail/ocfs2-devel/2009-June/004752.html

Comments of this patch:
"""

The Lock Value Block (LVB) of a DLM lock can be lost when nodes die and
the DLM cannot reconstruct its state.  Clients of the DLM need to know
this.

ocfs2's internal DLM, o2dlm, explicitly zeroes out the LVB when it loses
track of the state.  This is not a standard behavior, but ocfs2 has
always relied on it.  Thus, an o2dlm LVB is always "valid".

ocfs2 now supports both o2dlm and fs/dlm via the stack glue.  When
fs/dlm loses track of an LVBs state, it sets a flag
(DLM_SBF_VALNOTVALID) on the Lock Status Block (LKSB).  The contents of
the LVB may be garbage or merely stale.

ocfs2 doesn't want to try to guess at the validity of the stale LVB.
Instead, it should be checking the VALNOTVALID flag.  As this is the
'standard' way of treating LVBs, we will promote this behavior.

We add a stack glue API ocfs2_dlm_lvb_valid().  It returns non-zero when
the LVB is valid.  o2dlm will always return valid, while fs/dlm will
check VALNOTVALID.

Signed-off-by: Joel Becker <joel.becker@oracle.com <http://oss.oracle.com/mailman/listinfo/ocfs2-devel>>

"""

Thanks!
Eric
>
> Dave
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-16 15:08   ` David Teigland
  2016-11-17  9:30     ` Eric Ren
@ 2016-11-30  9:07     ` Eric Ren
  2016-11-30 16:16       ` David Teigland
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Ren @ 2016-11-30  9:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On 11/16/2016 11:08 PM, David Teigland wrote:
>>> convert(R1, EX)
>>> get LVB
>>> Qustion: what is the LVB then? x or y?
>>> ======
>>>
>>> Is this a valid question? or am I missing something?
> It's a good question, and it's been enough years that the details are now
> hazy.  I think the current behavior emulates the original VMS dlm model
> fairly well, so any documentation you can find on that may help.
>
> If you look at the recover_lvb() comment, you'll see a little information
> about this.  The LVB can be lost in a crash in some fairly common cases,
> and in those cases, the dlm should set the VALNOTVALID flag to tell the
> application that the LVB may be lost/stale.  So, an application cannot
> rely entirely on the LVB, and must be able to go without it, or
> reconstruct the value, i.e. the LVB data is usually used as part of an
> optimization (e.g. caching).
>
> The two cases mentioned in that comment are:
>
> 1. if N1 was R1 master when N2 crashed: N1 would purge the EX lock from
> N2, and set VALNOTVALID on R1, because the latest LVB from N2 was never
> seen by N1.
>
> 2. if N2 was R1 master when N2 crashed: N1 would become the new R1 master
> (if it kept a NL lock on it), and would set VALNOTVALID because it doesn't
> know if N2 had any EX locks from other nodes that might have also crashed,
> or an LVB that had been updated since N1 last saw it.
I have little doubts about the patch below and hope for your clarifications;-)

1) The commit:
"""
commit da8c66638ae684c99abcb30e89d2803402e7ca20
Author: David Teigland <teigland@redhat.com>
Date:   Thu Nov 15 15:01:51 2012 -0600

     dlm: fix lvb invalidation conditions

     When a node is removed that held a PW/EX lock, the
     existing master node should invalidate the lvb on the
     resource due to the purged lock.

     Previously, the existing master node was invalidating
     the lvb if it found only NL/CR locks on the resource
     during recovery for the removed node.  This could lead
     to cases where it invalidated the lvb and shouldn't
     have, or cases where it should have invalidated and
     didn't.

     When recovery selects a *new* master node for a
     resource, and that new master finds only NL/CR locks
     on the resource after lock recovery, it should
     invalidate the lvb.  This case was handled correctly
     (but was incorrectly applied to the existing master
     case also.)

     When a process exits while holding a PW/EX lock,
     the lvb on the resource should be invalidated.
     This was not happening.

     The lvb contents and VALNOTVALID flag should be
     recovered before granting locks in recovery so that
     the recovered lvb state is provided in the callback.
     The lvb was being recovered after the lock was granted.

     Signed-off-by: David Teigland <teigland@redhat.com>
"""

2) Snippet code that I cannot understand:
"""
@@ -852,12 +868,19 @@ void dlm_recover_rsbs(struct dlm_ls *ls)
                 if (is_master(r)) {
                         if (rsb_flag(r, RSB_RECOVER_CONVERT))
                                 recover_conversion(r);
+
+                       /* recover lvb before granting locks so the updated
+                          lvb/VALNOTVALID is presented in the completion */
+                       recover_lvb(r);
+
                         if (rsb_flag(r, RSB_NEW_MASTER2))
                                 recover_grant(r);
-                       recover_lvb(r);
                         count++;
+               } else {
+                       rsb_clear_flag(r, RSB_VALNOTVALID);
                 }
"""

3) Questions:

a. Should we put recover_lvb() even before recover_conversion()? if not, why?
b. Why should we clear fag RSB_VALNOTVALID in the else branch?

Best Regards,
Eric

>
> Dave
>



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-30  9:07     ` Eric Ren
@ 2016-11-30 16:16       ` David Teigland
  2016-12-01  9:51         ` Eric Ren
  0 siblings, 1 reply; 7+ messages in thread
From: David Teigland @ 2016-11-30 16:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Nov 30, 2016 at 05:07:22PM +0800, Eric Ren wrote:
> @@ -852,12 +868,19 @@ void dlm_recover_rsbs(struct dlm_ls *ls)
>                 if (is_master(r)) {
>                         if (rsb_flag(r, RSB_RECOVER_CONVERT))
>                                 recover_conversion(r);
> +
> +                       /* recover lvb before granting locks so the updated
> +                          lvb/VALNOTVALID is presented in the completion */
> +                       recover_lvb(r);
> +
>                         if (rsb_flag(r, RSB_NEW_MASTER2))
>                                 recover_grant(r);
> -                       recover_lvb(r);
>                         count++;
> +               } else {
> +                       rsb_clear_flag(r, RSB_VALNOTVALID);
>                 }
> """
> 
> 3) Questions:
> 
> a. Should we put recover_lvb() even before recover_conversion()? if not, why?

Yes, I think you're right.  The lvb decision should be made using the
original lock modes, not the modified lock modes from recover_conversion.

> b. Why should we clear fag RSB_VALNOTVALID in the else branch?

That looks incorrect also.  I think VALNOTVALID should only be cleared
when the lvb is written by the application.  The else condition should
probably just be removed.  That does raise the question of whether it
could be masking another problem (e.g. some case where the flag is not
being cleared when it should be, or a case where it's being set and
shouldn't be.)

Dave



^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Cluster-devel] Question on LVB when the node that held EX lock crash
  2016-11-30 16:16       ` David Teigland
@ 2016-12-01  9:51         ` Eric Ren
  0 siblings, 0 replies; 7+ messages in thread
From: Eric Ren @ 2016-12-01  9:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi David,

On 12/01/2016 12:16 AM, David Teigland wrote:
> On Wed, Nov 30, 2016 at 05:07:22PM +0800, Eric Ren wrote:
>> a. Should we put recover_lvb() even before recover_conversion()? if not, why?
> Yes, I think you're right.  The lvb decision should be made using the
> original lock modes, not the modified lock modes from recover_conversion.
>
>> b. Why should we clear fag RSB_VALNOTVALID in the else branch?
> That looks incorrect also.  I think VALNOTVALID should only be cleared
> when the lvb is written by the application.  The else condition should
> probably just be removed.  That does raise the question of whether it
> could be masking another problem (e.g. some case where the flag is not
> being cleared when it should be, or a case where it's being set and
> shouldn't be.)

Thanks a lot for your help!

The ocfs2 panic (mlog_bug_on_msg(le64_to_cpu(fe->i_size) != i_size_read(inode),...) still can
be triggered by our testing that a *same* file is truncated randomly from 3 nodes, and 
meanwhile,  reset one of the nodes.

Here are some observations that worth to mention, i think:
1) with o2cb cluster stack (ocfs2 internal DLM), no panic so far. I think it is because o2cb 
simply discards LVB during the process of
failure & recovery, and make ocfs2 get metadata block from disk.

2) with fs/dlm, ocfs2 code decide whether going to disk by checking VALNOTVALID flag.

3) without commit (dlm: fix lvb invalidation conditions),  we can reproduce the panic very 
often.

4) with commit (dlm: fix lvb invalidation conditions), and with changes
      a) move recover_lvb() before recover_conversion();
      b) remove the else condition;
the panic only happens when we reset the master node of the rsb for ocfs2 inode.

Will look into more to see if I can find anything solid;-)

Thanks,
Eric
>
> Dave
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-12-01  9:51 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-11-16  8:29 [Cluster-devel] Question on LVB when the node that held EX lock crash Eric Ren
2016-11-16  8:42 ` Eric Ren
2016-11-16 15:08   ` David Teigland
2016-11-17  9:30     ` Eric Ren
2016-11-30  9:07     ` Eric Ren
2016-11-30 16:16       ` David Teigland
2016-12-01  9:51         ` Eric Ren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.