From mboxrd@z Thu Jan  1 00:00:00 1970
From: teigland@sourceware.org <teigland@sourceware.org>
Date: 8 Aug 2006 21:19:18 -0000
Subject: [Cluster-devel] cluster/group/gfs_controld lock_dlm.h plock.c  ...
Message-ID: <20060808211918.17762.qmail@sourceware.org>
List-Id: <cluster-devel.redhat.com>
To: cluster-devel.redhat.com
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit

CVSROOT:	/cvs/cluster
Module name:	cluster
Changes by:	teigland at sourceware.org	2006-08-08 21:19:18

Modified files:
	group/gfs_controld: lock_dlm.h plock.c recover.c 

Log message:
	The idea to have the last node that did the checkpoint try to reuse it
	even if it wasn't the low nodeid any more doesn't work because the new
	mounter tries to read the ckpt when it gets the journals message from the
	low nodeid before the ckpt is written from the other node.  Now, the
	low nodeid is always the one to create a ckpt for a new mounter which
	means a node saving the last ckpt needs to unlink it when it sees a new
	low nodeid join the group.

Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/gfs_controld/lock_dlm.h.diff?cvsroot=cluster&r1=1.11&r2=1.12
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/gfs_controld/plock.c.diff?cvsroot=cluster&r1=1.10&r2=1.11
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/gfs_controld/recover.c.diff?cvsroot=cluster&r1=1.9&r2=1.10

--- cluster/group/gfs_controld/lock_dlm.h	2006/08/07 16:57:50	1.11
+++ cluster/group/gfs_controld/lock_dlm.h	2006/08/08 21:19:17	1.12
@@ -140,6 +140,7 @@
 	int			emulate_first_mounter;
 	int			wait_first_done;
 	int			low_finished_nodeid;
+	int			low_nodeid;
 	int			save_plocks;
 
 	uint64_t		cp_handle;
@@ -259,7 +260,7 @@
 
 int send_group_message(struct mountgroup *mg, int len, char *buf);
 
-void store_plocks(struct mountgroup *mg);
+void store_plocks(struct mountgroup *mg, int nodeid);
 void retrieve_plocks(struct mountgroup *mg);
 int dump_plocks(char *name, int fd);
 void process_saved_plocks(struct mountgroup *mg);
--- cluster/group/gfs_controld/plock.c	2006/08/08 19:37:33	1.10
+++ cluster/group/gfs_controld/plock.c	2006/08/08 21:19:17	1.11
@@ -1094,20 +1094,18 @@
 	return ret;
 }
 
-/* Copy all plock state into a checkpoint so new node can retrieve it.
+/* Copy all plock state into a checkpoint so new node can retrieve it.  The
+   node creating the ckpt for the mounter needs to be the same node that's
+   sending the mounter its journals message (i.e. the low nodeid).  The new
+   mounter knows the ckpt is ready to read only after it gets its journals
+   message.
+ 
+   If the mounter is becoming the new low nodeid in the group, the node doing
+   the store closes the ckpt and the new node unlinks the ckpt after reading
+   it.  The ckpt should then disappear and the new node can create a new ckpt
+   for the next mounter. */
 
-   The low node in the group and the previous node to create the ckpt (with
-   non-zero cp_handle) may be different if a new node joins with a lower nodeid
-   than the previous low node that created the ckpt.  In this case, the prev
-   node has the old ckpt open and will reuse it if no plock state has changed,
-   or will unlink it and create a new one.  The low node will also attempt to
-   create a new ckpt.  That open-create will either fail due to the prev node
-   reusing the old ckpt, or it will race with the open-create on the prev node
-   after the prev node unlinks the old ckpt.  Either way, when there are two
-   different nodes in the group calling store_plocks(), one of them will fail
-   at the Open(CREATE) step with ERR_EXIST due to the other. */
-
-void store_plocks(struct mountgroup *mg)
+void store_plocks(struct mountgroup *mg, int nodeid)
 {
 	SaCkptCheckpointCreationAttributesT attr;
 	SaCkptCheckpointHandleT h;
@@ -1128,8 +1126,8 @@
 
 	/* no change to plock state since we created the last checkpoint */
 	if (mg->last_checkpoint_time > mg->last_plock_time) {
-		log_group(mg, "store_plocks: ckpt uptodate");
-		return;
+		log_group(mg, "store_plocks: saved ckpt uptodate");
+		goto out;
 	}
 	mg->last_checkpoint_time = time(NULL);
 
@@ -1236,6 +1234,17 @@
 			break;
 		}
 	}
+
+ out:
+	/* If the new nodeid is becoming the low nodeid it will now be in
+	   charge of creating ckpt's for mounters instead of us. */
+
+	if (nodeid < our_nodeid) {
+		log_group(mg, "store_plocks: closing ckpt for new low node %d",
+			  nodeid);
+		saCkptCheckpointClose(h);
+		mg->cp_handle = 0;
+	}
 }
 
 /* called by a node that's just been added to the group to get existing plock
@@ -1336,7 +1345,11 @@
  out_it:
 	saCkptSectionIterationFinalize(itr);
  out:
-	saCkptCheckpointClose(h);
+	if (mg->low_nodeid == our_nodeid) {
+		log_group(mg, "retrieve_plocks: unlink ckpt from old low node");
+		unlink_checkpoint(mg, &name);
+	} else
+		saCkptCheckpointClose(h);
 }
 
 void purge_plocks(struct mountgroup *mg, int nodeid, int unmount)
--- cluster/group/gfs_controld/recover.c	2006/08/07 16:57:50	1.9
+++ cluster/group/gfs_controld/recover.c	2006/08/08 21:19:17	1.10
@@ -589,8 +589,8 @@
 	log_group(mg, "assign_journal: new member %d got jid %d",
 		  new->nodeid, new->jid);
 
-	if (mg->low_finished_nodeid == our_nodeid || mg->cp_handle)
-		store_plocks(mg);
+	if (mg->low_finished_nodeid == our_nodeid)
+		store_plocks(mg, new->nodeid);
 
 	/* if we're the first mounter and haven't gotten others_may_mount
 	   yet, then don't send journals until kernel_recovery_done_first
@@ -982,6 +982,8 @@
 	}
 
 	list_for_each_entry(memb, &mg->members, list) {
+		if (mg->low_nodeid == -1 || memb->nodeid < mg->low_nodeid)
+			mg->low_nodeid = memb->nodeid;
 		if (!memb->finished)
 			continue;
 		if (low == -1 || memb->nodeid < low)
@@ -1008,6 +1010,7 @@
 	INIT_LIST_HEAD(&mg->resources);
 	INIT_LIST_HEAD(&mg->saved_messages);
 	mg->init = 1;
+	mg->low_nodeid = -1;
 
 	strncpy(mg->name, name, MAXNAME);
 
@@ -1902,7 +1905,7 @@
 			continue;
 
 		if (!stored_plocks) {
-			store_plocks(mg);
+			store_plocks(mg, memb->nodeid);
 			stored_plocks = 1;
 		}