From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrew Beekhof Date: Fri, 14 May 2010 12:15:12 +0200 Subject: [Cluster-devel] [PATCH] dlm_controld.pcmk: Fix membership change judging issue In-Reply-To: <4BED4A430200000A00011FE2@novprvlin0050.provo.novell.com> References: <20100513084926.GA30727@linux-jjzhang> <20100513095117.GM20952@suse.de> <20100513183215.GP20952@suse.de> <4BED4A430200000A00011FE2@novprvlin0050.provo.novell.com> Message-ID: List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On Fri, May 14, 2010 at 5:04 AM, Tim Serong wrote: > On 5/14/2010 at 06:19 AM, Andrew Beekhof wrote: >> >> Does the behavior still occur with pacemaker 1.1.2? >> > > Yes. > > For the record, the most minimal testcase I've managed for this > so far is as follows (substitute "/etc/init.d/corosync start" or > whatever for "rcopenais start" if you're not on something SUSE-based): > > 1) Configure corosync/openais on two nodes. > ? Do not start the cluster yet. > > 2) On one node: > > ? ? # rm /var/lib/heartbeat/crm/* > ? ? # rcopenais start > ? ? # while ! crm_mon -1 | grep -qi online; do \ > ? ? ? ? echo -n "." ; sleep 5 ; done > > 3) Now we have one node online, configure Pacemaker: > > ? ? # cat < ? ? primitive dlm ocf:pacemaker:controld > ? ? primitive clvm ocf:lvm2:clvmd > ? ? group g dlm clvm > ? ? clone c g meta interleave="true" > ? ? property stonith-enabled="false" > ? ? property no-quorum-policy="ignore" > ? ? commit > ? ? CONF > > ? Watch "crm_mon -r" until that clone comes online. > ? Should only take a few seconds. > > 4) On the other node: > > ? ? # rm /var/lib/heartbeat/crm/* > ? ? # rcopenais start > > The first node will now either wedge up spectacularly, and/or > dlm_recoverd and clvmd will be stuck in D state on both nodes. Presumably each thinks the other node isn't a member? Perhaps something like this will help: diff -r b59c27dc114a lib/ais/plugin.c --- a/lib/ais/plugin.c Wed May 12 10:51:56 2010 +0200 +++ b/lib/ais/plugin.c Fri May 14 12:12:33 2010 +0200 @@ -498,9 +498,8 @@ static void *pcmk_wait_dispatch (void *a ais_notice("Respawning failed child process: %s", pcmk_children[lpc].name); spawn_child(&(pcmk_children[lpc])); - } else { - send_cluster_id(); } + send_cluster_id(); } } sched_yield (); @@ -661,6 +660,7 @@ int pcmk_startup(struct corosync_api_v1 } } } + send_cluster_id(); return 0; }