* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
@ 2006-06-20 18:09 teigland
2006-06-20 18:56 ` Robert Peterson
0 siblings, 1 reply; 6+ messages in thread
From: teigland @ 2006-06-20 18:09 UTC (permalink / raw)
To: cluster-devel.redhat.com
CVSROOT: /cvs/cluster
Module name: cluster
Changes by: teigland at sourceware.org 2006-06-20 18:09:12
Modified files:
group/daemon : cman.c cpg.c gd_internal.h main.c
Log message:
Moving the cluster infrastructure to userland introduced a new problem
that we didn't need to worry about before. All cluster state now exists
in userland processes which can go away and then come back like new, i.e.
unaware of the previous state.
Typically, when the userland cluster infrastructure on a node
"disappears", another node recognizes this as a node failure and recovers.
There's no problem there. The problem is when the cluster infrastructure
disappears on all the cluster nodes and then comes back. The
infrastructure that disappeared may have abandoned control of gfs/dlm
instances in the kernel. When the infrastructure comes back, it's like a
brand new cluster, it knows nothing about the residual, uncontrolled
instances of gfs/dlm. New nodes would use gfs/dlm in this new cluster
independently of the unknown gfs/dlm users from before and there'd be
immediate corruption [1].
Eventually, the infrastructure may be able to reconstruct the global state
of abandoned instances of gfs/dlm when it comes back and reassert control
of them, but that's not realistic any time soon. For now, the
infrastructure needs to recognize nodes with residual gfs/dlm state as
failed nodes that need recovery (starting with fencing). That recognition
and recovery now happens as part of the startup initialization, before new
instances of gfs/dlm are created [2].
[1] This is trivial to demonstrate:
- start up a cluster on nodes A,B,C
- mount gfs on nodes A,B
- run 'cman_tool leave force' on A,B,C
- start up the cluster again on A,B,C
- mount gfs on node C
- nodes A,B are now using gfs independently of node C
[2] The previous example now works like this:
- start up a cluster on nodes A,B,C
- mount gfs on nodes A,B
- run 'cman_tool leave force' on A,B,C
- start up the cluster again on A,B,C
i) when groupd starts on A,B, it recognizes the uncontrolled instance
of gfs, kills cman locally and fences the local node [3]
ii) when C runs fence_tool join, a new fence domain is started which
fences nodes with an unknown state, which are A and B
- mount gfs on node C
[3] This self-fencing does nothing for node C which still needs to fence
both A and B itself. If A fences itself before C fences it, A will be
fenced twice. This self-fencing step is optional, but it can be
convenient when 1) all the nodes restarting the infrastructure find
residual gfs/dlm instances and 2) reboot fencing is used. The anticipated
situation is one where everyone has residual state so no one can start up
to fence anyone else; all are stuck. But, they all fence themselves,
reboot and resolve everything.
There's a different approach we could take that would be more convenient
when not all cluster nodes are likely to be mounting gfs or SAN fencing is
used. In this case, a node that finds residual gfs/dlm instances would
remain a cman member and not fence itself. This would contribute quorum
to help another node without residual state start up and fence it.
The solution to this confusing situation is simple:
- groupd now checks for residual gfs/dlm kernel state when it starts up
and if it finds any it kills cman and exec's fence_node <myname>.
- fenced can't bypass fencing of a node unless the node is both a cman
member and has fully started groupd (a node may need fencing if it's
joined the cluster but groupd isn't starting).
- the same consideration in fence_manual as fenced
Patches:
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/daemon/cman.c.diff?cvsroot=cluster&r1=1.22&r2=1.23
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/daemon/cpg.c.diff?cvsroot=cluster&r1=1.22&r2=1.23
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/daemon/gd_internal.h.diff?cvsroot=cluster&r1=1.30&r2=1.31
http://sourceware.org/cgi-bin/cvsweb.cgi/cluster/group/daemon/main.c.diff?cvsroot=cluster&r1=1.33&r2=1.34
--- cluster/group/daemon/cman.c 2006/05/02 16:25:11 1.22
+++ cluster/group/daemon/cman.c 2006/06/20 18:09:12 1.23
@@ -12,8 +12,24 @@
static int cman_node_count;
static int cman_cb;
static int cman_reason;
+static char name_buf[CMAN_MAX_NODENAME_LEN+1];
+int kill_cman(int nodeid)
+{
+ cman_handle_t ach;
+ int rv;
+
+ ach = cman_admin_init(NULL);
+ if (!ach) {
+ log_print("cman_admin_init error %d %d", (int) ch, errno);
+ return -ENOTCONN;
+ }
+ rv = cman_kill_node(ach, nodeid);
+ cman_finish(ach);
+ return rv;
+}
+
static int is_member(cman_node_t *node_list, int count, int nodeid)
{
int i;
@@ -181,8 +197,12 @@
cman_quorate = cman_is_quorate(ch);
+ memset(name_buf, 0, sizeof(name_buf));
+ strncpy(name_buf, node.cn_name, CMAN_MAX_NODENAME_LEN);
+ our_name = name_buf;
our_nodeid = node.cn_nodeid;
- log_debug("cman: our nodeid %d quorum %d", our_nodeid, cman_quorate);
+ log_debug("cman: our nodeid %d name %s quorum %d",
+ our_nodeid, our_name, cman_quorate);
fd = cman_get_fd(ch);
client_add(fd, process_cman, close_cman);
--- cluster/group/daemon/cpg.c 2006/05/25 13:41:27 1.22
+++ cluster/group/daemon/cpg.c 2006/06/20 18:09:12 1.23
@@ -10,6 +10,8 @@
static int groupd_ci;
static int got_confchg;
+static struct cpg_address groupd_cpg_member[MAX_GROUP_MEMBERS];
+static int groupd_cpg_member_count;
static struct cpg_address saved_member[MAX_GROUP_MEMBERS];
static struct cpg_address saved_joined[MAX_GROUP_MEMBERS];
static struct cpg_address saved_left[MAX_GROUP_MEMBERS];
@@ -149,6 +151,9 @@
log_print("process_groupd_confchg members %d -%d +%d",
saved_member_count, saved_left_count, saved_joined_count);
+ memcpy(&groupd_cpg_member, &saved_member, sizeof(saved_member));
+ groupd_cpg_member_count = saved_member_count;
+
for (i = 0; i < saved_member_count; i++) {
if (saved_member[i].nodeId == our_nodeid &&
saved_member[i].pid == (uint32_t) getpid()) {
@@ -162,12 +167,25 @@
log_print("we are not in groupd confchg: %u %u",
our_nodeid, (uint32_t) getpid());
+ /* FIXME: we probably want to do a cman_kill_node() on a node
+ where groupd exits but cman is still running. */
+
for (i = 0; i < saved_left_count; i++) {
if (saved_left[i].reason != CPG_REASON_LEAVE)
add_recovery_set(saved_left[i].nodeId);
}
}
+void copy_groupd_data(group_data_t *data)
+{
+ int i;
+
+ data->level = -1;
+ data->member_count = groupd_cpg_member_count;
+ for (i = 0; i < groupd_cpg_member_count; i++)
+ data->members[i] = groupd_cpg_member[i].nodeId;
+}
+
/* FIXME: also match name */
group_t *find_group_by_handle(cpg_handle_t h)
--- cluster/group/daemon/gd_internal.h 2006/05/02 16:25:11 1.30
+++ cluster/group/daemon/gd_internal.h 2006/06/20 18:09:12 1.31
@@ -23,6 +23,7 @@
#include <string.h>
#include <strings.h>
#include <ctype.h>
+#include <dirent.h>
#include <sys/socket.h>
#include <sys/un.h>
#include <sys/types.h>
@@ -93,15 +94,12 @@
#define FALSE (0)
#endif
-#define test_bit(nr, addr) ((*addr) & (1 << nr))
-#define set_bit(nr, addr) ((*addr) |= (1 << nr))
-#define clear_bit(nr, addr) ((*addr) &= ~(1 << nr))
-
extern struct list_head gd_groups;
extern struct list_head gd_levels[MAX_LEVELS];
extern uint32_t gd_event_nr;
extern int cman_quorate;
extern int our_nodeid;
+extern char *our_name;
struct group;
struct app;
@@ -256,6 +254,7 @@
/* cman.c */
int setup_cman(void);
+int kill_cman(int nodeid);
/* cpg.c */
int setup_cpg(void);
@@ -263,6 +262,7 @@
int do_cpg_leave(group_t *g);
int send_message(group_t *g, void *buf, int len);
int send_message_groupd(group_t *g, void *buf, int len);
+void copy_groupd_data(group_data_t *data);
/* joinleave.c */
void remove_group(group_t *g);
--- cluster/group/daemon/main.c 2006/06/14 21:38:14 1.33
+++ cluster/group/daemon/main.c 2006/06/20 18:09:12 1.34
@@ -17,9 +17,10 @@
extern struct list_head recovery_sets;
-struct list_head gd_groups;
-struct list_head gd_levels[MAX_LEVELS];
+struct list_head gd_groups;
+struct list_head gd_levels[MAX_LEVELS];
uint32_t gd_event_nr;
+char *our_name;
int our_nodeid;
int cman_quorate;
@@ -37,6 +38,92 @@
void *deadfn;
};
+/* Look for any instances of gfs or dlm in the kernel, if we find any, it
+ means they're uncontrolled by us (via gfs_controld/dlm_controld/groupd).
+ We need to be rebooted to clear out this uncontrolled kernel state. Most
+ importantly, other nodes must not be allowed to form groups that might
+ correspond to these same instances of gfs/dlm. If they did, then we'd
+ be accessing gfs/dlm independently from them and corrupt stuff. */
+
+/* If we detect any local gfs/dlm state, fence ourself via fence_node.
+ This may not be strictly necessary since other nodes should fence us
+ when they form a new fence domain. If they're not forming a new domain,
+ that means there is a domain member that has a record of previous cluster
+ state when we were a member; it will have recognized that we left the
+ cluster and need fencing. The case where we need groupd to fence ourself
+ is when all cluster nodes are starting up and have residual gfs/dlm kernel
+ state. None would be able to start groupd/fenced and fence anyone. */
+
+/* - we've rejoined the cman cluster with residual gfs/dlm state
+ - there is a previous cman/domain member that saw us fail
+ - when we failed it lost quorum
+ - our current rejoin has given the cluster quorum
+ - the old member that saw we needed fencing can now begin fencing
+ - the old member sees we're now a cman member, might bypass fencing us...
+ - only bypasses fencing us if we're also in groupd cpg
+ - we won't be in groupd cpg until after we've verified there's no
+ local residual gfs/dlm state */
+
+static int kernel_instance_count(char *sysfs_dir)
+{
+ char path[PATH_MAX];
+ DIR *d;
+ struct dirent *de;
+ int rv = 0;
+
+ memset(path, 0, PATH_MAX);
+ snprintf(path, PATH_MAX, "%s", sysfs_dir);
+
+ d = opendir(path);
+ if (!d)
+ return 0;
+
+ while ((de = readdir(d))) {
+ if (de->d_name[0] == '.')
+ continue;
+
+ log_print("found uncontrolled kernel object %s in %s",
+ de->d_name, sysfs_dir);
+ rv++;
+ }
+ closedir(d);
+ return rv;
+}
+
+int check_uncontrolled_groups(void)
+{
+ pid_t pid;
+ char *argv[4];
+ int status, rv = 0;
+
+ rv += kernel_instance_count("/sys/kernel/dlm");
+ rv += kernel_instance_count("/sys/fs/gfs");
+ rv += kernel_instance_count("/sys/fs/gfs2");
+
+ if (!rv)
+ return 0;
+
+ /* FIXME: make sure this is going into syslog */
+ log_print("local node must be reset to clear %d uncontrolled "
+ "instances of gfs and/or dlm", rv);
+
+ kill_cman(our_nodeid);
+
+ argv[0] = "fence_node";
+ argv[1] = "-O";
+ argv[2] = our_name;
+ argv[3] = NULL;
+
+ pid = fork();
+ if (pid)
+ waitpid(pid, &status, 0);
+ else {
+ execvp(argv[0], argv);
+ log_print("failed to exec fence_node");
+ }
+
+ return -1;
+}
static void app_action(app_t *a, char *buf)
{
@@ -380,6 +467,12 @@
memset(&data, 0, sizeof(data));
+ /* special case to get members of groupd cpg */
+ if (atoi(argv[1]) == -1 && !strncmp(argv[2], "groupd", 6)) {
+ copy_groupd_data(&data);
+ goto out;
+ }
+
g = find_group_level(argv[2], atoi(argv[1]));
if (!g)
goto out;
@@ -580,6 +673,10 @@
if (rv < 0)
return rv;
+ rv = check_uncontrolled_groups();
+ if (rv < 0)
+ return rv;
+
rv = setup_cpg();
if (rv < 0)
return rv;
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
2006-06-20 18:09 [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. teigland
@ 2006-06-20 18:56 ` Robert Peterson
2006-06-20 19:19 ` David Teigland
0 siblings, 1 reply; 6+ messages in thread
From: Robert Peterson @ 2006-06-20 18:56 UTC (permalink / raw)
To: cluster-devel.redhat.com
teigland at sourceware.org wrote:
> Moving the cluster infrastructure to userland introduced a new problem
> that we didn't need to worry about before. All cluster state now exists
> in userland processes which can go away and then come back like new, i.e.
> unaware of the previous state.
Hi Dave,
You know this new development cman stuff and I really don't, but I was
just thinking:
If we used a shared memory segment, we could hold state information
there and
then cman would remember the cluster state after process termination and
restart,
possibly making this whole thing unnecessary. Just a thought.
Of course, one could also argue that if the process terminated, can we
really trust the state
information it had at the time?
Bob Peterson
Red Hat Cluster Suite
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
2006-06-20 18:56 ` Robert Peterson
@ 2006-06-20 19:19 ` David Teigland
2006-06-20 19:43 ` Robert Peterson
0 siblings, 1 reply; 6+ messages in thread
From: David Teigland @ 2006-06-20 19:19 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Jun 20, 2006 at 01:56:27PM -0500, Robert Peterson wrote:
> teigland at sourceware.org wrote:
> > Moving the cluster infrastructure to userland introduced a new
> > problem that we didn't need to worry about before. All cluster
> > state now exists in userland processes which can go away and then
> > come back like new, i.e. unaware of the previous state.
> Hi Dave,
>
> You know this new development cman stuff and I really don't, but I was
> just thinking:
>
> If we used a shared memory segment, we could hold state information
> there and then cman would remember the cluster state after process
> termination and restart, possibly making this whole thing unnecessary.
> Just a thought. Of course, one could also argue that if the process
> terminated, can we really trust the state information it had at the
> time?
Might be a good idea, I don't really know. I'm not even sure we'd need to
save much or any additional state that couldn't be pulled from the gfs/dlm
instances themselves. It seems to me the challenge would be writing the
daemons so they could put all the pieces and interconnections back
together again.
If this ends up being a big enough problem to get more attention, I think
the first practical improvement we could make is something like
blocking/clearing i/o from the residual fs's (like we do in withdraw) and
adding the ability to fully purge instances of gfs/dlm from the kernel
without rebooting the node. Then the machines could all start from
scratch without rebooting or fencing.
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
2006-06-20 19:19 ` David Teigland
@ 2006-06-20 19:43 ` Robert Peterson
2006-06-20 20:06 ` David Teigland
2006-06-20 20:13 ` Steven Dake
0 siblings, 2 replies; 6+ messages in thread
From: Robert Peterson @ 2006-06-20 19:43 UTC (permalink / raw)
To: cluster-devel.redhat.com
David Teigland wrote:
> Might be a good idea, I don't really know. I'm not even sure we'd need to
> save much or any additional state that couldn't be pulled from the gfs/dlm
> instances themselves. It seems to me the challenge would be writing the
> daemons so they could put all the pieces and interconnections back
> together again.
>
> If this ends up being a big enough problem to get more attention, I think
> the first practical improvement we could make is something like
> blocking/clearing i/o from the residual fs's (like we do in withdraw) and
> adding the ability to fully purge instances of gfs/dlm from the kernel
> without rebooting the node. Then the machines could all start from
> scratch without rebooting or fencing
Here's another idea that came to me:
For critical cluster processes like cman and fenced, maybe we could use
init's ability
to restart processes, i.e. the "respawn" option in /etc/inittab. Maybe
we can use
"respawn" or something similar to ensure that if a critical process like
fenced dies,
it gets restarted automatically and immediately. Of course, that might
cause problems
for shutdown, etc., and it would probably make it harder to test certain
things...
Bob Peterson
Red Hat Cluster Suite
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
2006-06-20 19:43 ` Robert Peterson
@ 2006-06-20 20:06 ` David Teigland
2006-06-20 20:13 ` Steven Dake
1 sibling, 0 replies; 6+ messages in thread
From: David Teigland @ 2006-06-20 20:06 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, Jun 20, 2006 at 02:43:50PM -0500, Robert Peterson wrote:
> David Teigland wrote:
> >Might be a good idea, I don't really know. I'm not even sure we'd need to
> >save much or any additional state that couldn't be pulled from the gfs/dlm
> >instances themselves. It seems to me the challenge would be writing the
> >daemons so they could put all the pieces and interconnections back
> >together again.
> >
> >If this ends up being a big enough problem to get more attention, I think
> >the first practical improvement we could make is something like
> >blocking/clearing i/o from the residual fs's (like we do in withdraw) and
> >adding the ability to fully purge instances of gfs/dlm from the kernel
> >without rebooting the node. Then the machines could all start from
> >scratch without rebooting or fencing
> Here's another idea that came to me:
>
> For critical cluster processes like cman and fenced, maybe we could use
> init's ability to restart processes, i.e. the "respawn" option in
> /etc/inittab. Maybe we can use "respawn" or something similar to ensure
> that if a critical process like fenced dies, it gets restarted
> automatically and immediately. Of course, that might cause problems for
> shutdown, etc., and it would probably make it harder to test certain
> things...
Assuming the daemon is managing something, then the failure amounts to a
full node failure and the node needs to be recovered by the other nodes.
Respawning the daemons in that case probably just gets in the way of the
other nodes doing recovery. Respawning would make sense if the daemons
failed when they weren't managing anything, but that's pretty unlikely.
Dave
^ permalink raw reply [flat|nested] 6+ messages in thread
* [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. ...
2006-06-20 19:43 ` Robert Peterson
2006-06-20 20:06 ` David Teigland
@ 2006-06-20 20:13 ` Steven Dake
1 sibling, 0 replies; 6+ messages in thread
From: Steven Dake @ 2006-06-20 20:13 UTC (permalink / raw)
To: cluster-devel.redhat.com
On Tue, 2006-06-20 at 14:43 -0500, Robert Peterson wrote:
> David Teigland wrote:
> > Might be a good idea, I don't really know. I'm not even sure we'd need to
> > save much or any additional state that couldn't be pulled from the gfs/dlm
> > instances themselves. It seems to me the challenge would be writing the
> > daemons so they could put all the pieces and interconnections back
> > together again.
> >
> > If this ends up being a big enough problem to get more attention, I think
> > the first practical improvement we could make is something like
> > blocking/clearing i/o from the residual fs's (like we do in withdraw) and
> > adding the ability to fully purge instances of gfs/dlm from the kernel
> > without rebooting the node. Then the machines could all start from
> > scratch without rebooting or fencing
> Here's another idea that came to me:
>
> For critical cluster processes like cman and fenced, maybe we could use
> init's ability
> to restart processes, i.e. the "respawn" option in /etc/inittab. Maybe
> we can use
> "respawn" or something similar to ensure that if a critical process like
> fenced dies,
> it gets restarted automatically and immediately. Of course, that might
> cause problems
> for shutdown, etc., and it would probably make it harder to test certain
> things...
>
These daemons should have zero segv bugs and should never crash. If
they do crash, I believe we want to understand why by having community
complaints so they can be fixed rather then an init process restart
workaround.
The cman runs as a component within aisexec, and it is our goal with
that project to ensure there are zero crash bugs in the code. If a node
crashes, we will surely hear complaints from the community about "got
crash". Of course this is difficult with 55k lines of C code, but
achievable.
The one problem we tend to have is OOM, since handling OOM gracefully in
a distributed system is exceedingly complex.
Regards
-steve
> Bob Peterson
> Red Hat Cluster Suite
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2006-06-20 20:13 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-06-20 18:09 [Cluster-devel] cluster/group/daemon cman.c cpg.c gd_internal. teigland
2006-06-20 18:56 ` Robert Peterson
2006-06-20 19:19 ` David Teigland
2006-06-20 19:43 ` Robert Peterson
2006-06-20 20:06 ` David Teigland
2006-06-20 20:13 ` Steven Dake
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.