[Cluster-devel] [RFC][PATCH] dlm: Reset fs_notified when check_fs

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [RFC][PATCH] dlm: Reset fs_notified when check_fs_done
@ 2010-02-28 22:06 Jiaju Zhang
  2010-11-08 15:05 ` [Cluster-devel] [PATCH] " Jiaju Zhang
  0 siblings, 1 reply; 5+ messages in thread
From: Jiaju Zhang @ 2010-02-28 22:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

About the issue that dlm_controld and fs_controld sit spinning,
retrying and replying for the fs_notified check, I have a suspision
that another scenario may also hit that logic:

If the node->fs_notified has been set to 1 by previous change, when a
new change comes and needs to check the node->fs_notified, because it
has not been reset to 0, so check_fs_done will succeed even if
dlm_controld has not received the notification from fs_controld this
time.
For example, given that the following membership changes n, n+1, n+2,
we see what happens on node X:
Step 1: cg n: node Y leaves with CPG_REASON_NODEDOWN reason,
        eventually in node X's ls->node_history, node Y's fs_notified
        = 1
Step 2: cg n+1: node Y joins ...
Step 3: cg n+2: node Y leaves with CPG_REASON_NODEDOWN reason, one
        possible scenario is: before fs_controld's notification
        arrives, dlm_controld has known node Y is down from CPG
        message and done a lot of work, and it saw node Y's
        fs_notified = 1 (been set in Step 1) then passed the fs check
        wrongly. So node Y's check_fs reset to 0.
Step 4: fs_controld's notification arrives, it sees node Y's check_fs
        = 0 and assumes dlm_controld has not known node Y is down and
        retries to send the notification. But in fact, dlm_controld
        has already known this and finished all the work, which will
        result in the spinning ... 

I'm not sure if I read the code correctly :-) Below is the patch which
reset the node->fs_notified. Review and comments are highly
appreciated!

Thanks,
Jiaju

Signed-off-by: Jiaju Zhang <jjzhang.linux@gmail.com>
---
 group/dlm_controld/cpg.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
index d5245ce..b257595 100644
--- a/group/dlm_controld/cpg.c
+++ b/group/dlm_controld/cpg.c
@@ -636,6 +636,7 @@ static int check_fs_done(struct lockspace *ls)
 
 		if (node->fs_notified) {
 			node->check_fs = 0;
+			node->fs_notified = 0;
 		} else {
 			log_group(ls, "check_fs nodeid %d needs fs notify",
 				  node->nodeid);



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done
  2010-02-28 22:06 [Cluster-devel] [RFC][PATCH] dlm: Reset fs_notified when check_fs_done Jiaju Zhang
@ 2010-11-08 15:05 ` Jiaju Zhang
  2010-11-08 22:06   ` David Teigland
  0 siblings, 1 reply; 5+ messages in thread
From: Jiaju Zhang @ 2010-11-08 15:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi list,

Sorry for reposting this mail if you have received it, since it seems
to me that I haven't successfully sent it out, maybe the attachment is
too large. I'll send the attachment separately if anoyone is
interested in it :)

Hi David,

This patch was made about eight months ago, however, at that time I
have no way to regularly reproduce it and haven't verified if this
patch really works, so I set it aside for so long ...

These months, I also heard of many users reporting this issue but
unfortunately when I requested them to test the patch, most of them
even didn't try it at all :(

Luckily, things have changed now. One user met this issue two months
ago and he's also very kindly to test the patch. The result is the
patch really works.

Attached is the log before they apply the patch. This time the log
has already included the debugging messages which were added by the
commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
really happens on the user's site.
(The log is a bit large when it was uncompressed, this is because the
spinning would print many messages to the log.)

I rebased the patch against current upstream code now. Thank you for
your review in advance :)

Signed-off-by: Jiaju Zhang <jjzhang.linux@gmail.com>
---
 group/dlm_controld/cpg.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
index 9b0d223..12cb202 100644
--- a/group/dlm_controld/cpg.c
+++ b/group/dlm_controld/cpg.c
@@ -651,6 +651,7 @@ static int check_fs_done(struct lockspace *ls)
 		if (node->fs_notified) {
 			log_group(ls, "check_fs nodeid %d clear", node->nodeid);
 			node->check_fs = 0;
+			node->fs_notified = 0;
 		} else {
 			log_group(ls, "check_fs nodeid %d needs fs notify",
 				  node->nodeid);


On Mon, Mar 01, 2010 at 06:06:32AM +0800, Jiaju Zhang wrote:
> Hi,
> 
> About the issue that dlm_controld and fs_controld sit spinning,
> retrying and replying for the fs_notified check, I have a suspision
> that another scenario may also hit that logic:
> 
> If the node->fs_notified has been set to 1 by previous change, when a
> new change comes and needs to check the node->fs_notified, because it
> has not been reset to 0, so check_fs_done will succeed even if
> dlm_controld has not received the notification from fs_controld this
> time.
> For example, given that the following membership changes n, n+1, n+2,
> we see what happens on node X:
> Step 1: cg n: node Y leaves with CPG_REASON_NODEDOWN reason,
>         eventually in node X's ls->node_history, node Y's fs_notified
>         = 1
> Step 2: cg n+1: node Y joins ...
> Step 3: cg n+2: node Y leaves with CPG_REASON_NODEDOWN reason, one
>         possible scenario is: before fs_controld's notification
>         arrives, dlm_controld has known node Y is down from CPG
>         message and done a lot of work, and it saw node Y's
>         fs_notified = 1 (been set in Step 1) then passed the fs check
>         wrongly. So node Y's check_fs reset to 0.
> Step 4: fs_controld's notification arrives, it sees node Y's check_fs
>         = 0 and assumes dlm_controld has not known node Y is down and
>         retries to send the notification. But in fact, dlm_controld
>         has already known this and finished all the work, which will
>         result in the spinning ... 
> 
> I'm not sure if I read the code correctly :-) Below is the patch which
> reset the node->fs_notified. Review and comments are highly
> appreciated!
> 
> Thanks,
> Jiaju
> 
> Signed-off-by: Jiaju Zhang <jjzhang.linux@gmail.com>
> ---
>  group/dlm_controld/cpg.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
> index d5245ce..b257595 100644
> --- a/group/dlm_controld/cpg.c
> +++ b/group/dlm_controld/cpg.c
> @@ -636,6 +636,7 @@ static int check_fs_done(struct lockspace *ls)
>  
>  		if (node->fs_notified) {
>  			node->check_fs = 0;
> +			node->fs_notified = 0;
>  		} else {
>  			log_group(ls, "check_fs nodeid %d needs fs notify",
>  				  node->nodeid);



^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done
  2010-11-08 15:05 ` [Cluster-devel] [PATCH] " Jiaju Zhang
@ 2010-11-08 22:06   ` David Teigland
  2011-02-22  8:35     ` Jiaju Zhang
  0 siblings, 1 reply; 5+ messages in thread
From: David Teigland @ 2010-11-08 22:06 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, Nov 08, 2010 at 11:05:49PM +0800, Jiaju Zhang wrote:
> Luckily, things have changed now. One user met this issue two months
> ago and he's also very kindly to test the patch. The result is the
> patch really works.
> 
> Attached is the log before they apply the patch. This time the log
> has already included the debugging messages which were added by the
> commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
> really happens on the user's site.
> (The log is a bit large when it was uncompressed, this is because the
> spinning would print many messages to the log.)
> 
> I rebased the patch against current upstream code now. Thank you for
> your review in advance :)

Thanks, the patch looks good, I'll push this out.

Dave



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done
  2010-11-08 22:06   ` David Teigland
@ 2011-02-22  8:35     ` Jiaju Zhang
  2011-02-22 17:11       ` David Teigland
  0 siblings, 1 reply; 5+ messages in thread
From: Jiaju Zhang @ 2011-02-22  8:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Nov 9, 2010 at 6:06 AM, David Teigland <teigland@redhat.com> wrote:
> On Mon, Nov 08, 2010 at 11:05:49PM +0800, Jiaju Zhang wrote:
>> Luckily, things have changed now. One user met this issue two months
>> ago and he's also very kindly to test the patch. The result is the
>> patch really works.
>>
>> Attached is the log before they apply the patch. This time the log
>> has already included the debugging messages which were added by the
>> commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
>> really happens on the user's site.
>> (The log is a bit large when it was uncompressed, this is because the
>> spinning would print many messages to the log.)
>>
>> I rebased the patch against current upstream code now. Thank you for
>> your review in advance :)
>
> Thanks, the patch looks good, I'll push this out.

Hi David, I haven't found this patch from
git://git.fedorahosted.org/dlm.git, was it being missed?

Thanks;)
Jiaju



^ permalink raw reply	[flat|nested] 5+ messages in thread

* [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done
  2011-02-22  8:35     ` Jiaju Zhang
@ 2011-02-22 17:11       ` David Teigland
  0 siblings, 0 replies; 5+ messages in thread
From: David Teigland @ 2011-02-22 17:11 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Feb 22, 2011 at 04:35:42PM +0800, Jiaju Zhang wrote:
> On Tue, Nov 9, 2010 at 6:06 AM, David Teigland <teigland@redhat.com> wrote:
> > On Mon, Nov 08, 2010 at 11:05:49PM +0800, Jiaju Zhang wrote:
> >> Luckily, things have changed now. One user met this issue two months
> >> ago and he's also very kindly to test the patch. The result is the
> >> patch really works.
> >>
> >> Attached is the log before they apply the patch. This time the log
> >> has already included the debugging messages which were added by the
> >> commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
> >> really happens on the user's site.
> >> (The log is a bit large when it was uncompressed, this is because the
> >> spinning would print many messages to the log.)
> >>
> >> I rebased the patch against current upstream code now. Thank you for
> >> your review in advance :)
> >
> > Thanks, the patch looks good, I'll push this out.
> 
> Hi David, I haven't found this patch from
> git://git.fedorahosted.org/dlm.git, was it being missed?

Sorry, it's in the cluster.git STABLE31 branch, but I forgot dlm.git, I've
just pushed it there too.
Dave



^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2011-02-22 17:11 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-28 22:06 [Cluster-devel] [RFC][PATCH] dlm: Reset fs_notified when check_fs_done Jiaju Zhang
2010-11-08 15:05 ` [Cluster-devel] [PATCH] " Jiaju Zhang
2010-11-08 22:06   ` David Teigland
2011-02-22  8:35     ` Jiaju Zhang
2011-02-22 17:11       ` David Teigland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).