[linux-lvm] cluster request failed: Host is down

All of lore.kernel.org
 help / color / mirror / Atom feed

* [linux-lvm] cluster request failed: Host is down
@ 2012-11-16 12:48 Jacek Konieczny
  2012-11-16 15:15 ` Zdenek Kabelac
  0 siblings, 1 reply; 4+ messages in thread
From: Jacek Konieczny @ 2012-11-16 12:48 UTC (permalink / raw)
  To: linux-lvm

Hi,

I have seen this problem already reported here, but with no useful
answer:

http://osdir.com/ml/linux-lvm/2011-01/msg00038.html

This post suggest it is some very old bug, a change which can be easily
revertedâ€¦ though, it is a bit hard to believe. Such an easy bug, would
be already fixed, wouldn't it?

For me the problem is as follows:

I have a two node cluster with a volume group running on a DRBD in
Master-Master setup. When I shut one node down, cleanly, I am not able
to properly manage the volumes. 

LVs which are active on the surviving host remain active, but I am not
able to deactivate them or activate more volumes:

>  [root@dev1n1 ~]# lvs dev1_vg/4bwM2m7oVL
>    cluster request failed: Host is down
>    LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>    4bwM2m7oVL dev1_vg -wi------ 1.00g                                           
>  [root@dev1n1 ~]# lvchange -aey dev1_vg/XaMS0LyAq8 ; echo $?
>    cluster request failed: Host is down
>    cluster request failed: Host is down
>    cluster request failed: Host is down
>    cluster request failed: Host is down
>    cluster request failed: Host is down
>  5
>  [root@dev1n1 ~]# lvs dev1_vg/4bwM2m7oVL
>    cluster request failed: Host is down
>    LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>    4bwM2m7oVL dev1_vg -wi------ 1.00g                                           
>  [root@dev1n1 ~]# lvchange -aen dev1_vg/XaMS0LyAq8 ; echo $?
>    cluster request failed: Host is down
>    cluster request failed: Host is down
>  5
>  [root@dev1n1 ~]# lvs dev1_vg/XaMS0LyAq8
>    cluster request failed: Host is down
>    LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>    XaMS0LyAq8 dev1_vg -wi-a---- 1.00g                                           
>  
>  [root@dev1n1 ~]# dlm_tool ls
>  dlm lockspaces
>  name          clvmd
>  id            0x4104eefa
>  flags         0x00000000 
>  change        member 1 joined 0 remove 1 failed 0 seq 2,2
>  members       1 
>  
>  [root@dev1n1 ~]# dlm_tool status
>  cluster nodeid 1 quorate 1 ring seq 30648 30648
>  daemon now 1115 fence_pid 0 
>  node 1 M add 15 rem 0 fail 0 fence 0 at 0 0
>  node 2 X add 15 rem 184 fail 0 fence 0 at 0 0

The node has cleanly left the lockspace and the cluster. DLM is aware
about that, so should be clvmd, right? And if all other cluster nodes
(only one here) are clean, all LVM operations on the clustered VG should
work, right? Or am I missing something?

The behaviour is exactly the same when I power off a running node. It
is fenced by dlm_tool, as expected and then the VG is non-functional as
above, until the dead node is up again and joins the cluster.

Is this the expected behaviour or is it a bug?

Greets,
        Jacek

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-lvm] cluster request failed: Host is down
  2012-11-16 12:48 [linux-lvm] cluster request failed: Host is down Jacek Konieczny
@ 2012-11-16 15:15 ` Zdenek Kabelac
  2012-11-16 15:33   ` Jacek Konieczny
  0 siblings, 1 reply; 4+ messages in thread
From: Zdenek Kabelac @ 2012-11-16 15:15 UTC (permalink / raw)
  To: LVM general discussion and development; +Cc: Jacek Konieczny

Dne 16.11.2012 13:48, Jacek Konieczny napsal(a):
> Hi,
>
> I have seen this problem already reported here, but with no useful
> answer:
>
> http://osdir.com/ml/linux-lvm/2011-01/msg00038.html
>
> This post suggest it is some very old bug, a change which can be easily
> revertedâ€¦ though, it is a bit hard to believe. Such an easy bug, would
> be already fixed, wouldn't it?
>
> For me the problem is as follows:
>
> I have a two node cluster with a volume group running on a DRBD in
> Master-Master setup. When I shut one node down, cleanly, I am not able
> to properly manage the volumes.
>
> LVs which are active on the surviving host remain active, but I am not
> able to deactivate them or activate more volumes:
>
>>   [root@dev1n1 ~]# lvs dev1_vg/4bwM2m7oVL
>>     cluster request failed: Host is down
>>     LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>>     4bwM2m7oVL dev1_vg -wi------ 1.00g
>>   [root@dev1n1 ~]# lvchange -aey dev1_vg/XaMS0LyAq8 ; echo $?
>>     cluster request failed: Host is down
>>     cluster request failed: Host is down
>>     cluster request failed: Host is down
>>     cluster request failed: Host is down
>>     cluster request failed: Host is down
>>   5
>>   [root@dev1n1 ~]# lvs dev1_vg/4bwM2m7oVL
>>     cluster request failed: Host is down
>>     LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>>     4bwM2m7oVL dev1_vg -wi------ 1.00g
>>   [root@dev1n1 ~]# lvchange -aen dev1_vg/XaMS0LyAq8 ; echo $?
>>     cluster request failed: Host is down
>>     cluster request failed: Host is down
>>   5
>>   [root@dev1n1 ~]# lvs dev1_vg/XaMS0LyAq8
>>     cluster request failed: Host is down
>>     LV         VG        Attr      LSize Pool Origin Data%  Move Log Copy%  Convert
>>     XaMS0LyAq8 dev1_vg -wi-a---- 1.00g
>>
>>   [root@dev1n1 ~]# dlm_tool ls
>>   dlm lockspaces
>>   name          clvmd
>>   id            0x4104eefa
>>   flags         0x00000000
>>   change        member 1 joined 0 remove 1 failed 0 seq 2,2
>>   members       1
>>
>>   [root@dev1n1 ~]# dlm_tool status
>>   cluster nodeid 1 quorate 1 ring seq 30648 30648
>>   daemon now 1115 fence_pid 0
>>   node 1 M add 15 rem 0 fail 0 fence 0 at 0 0
>>   node 2 X add 15 rem 184 fail 0 fence 0 at 0 0
>
> The node has cleanly left the lockspace and the cluster. DLM is aware
> about that, so should be clvmd, right? And if all other cluster nodes
> (only one here) are clean, all LVM operations on the clustered VG should
> work, right? Or am I missing something?
>
> The behaviour is exactly the same when I power off a running node. It
> is fenced by dlm_tool, as expected and then the VG is non-functional as
> above, until the dead node is up again and joins the cluster.
>
> Is this the expected behaviour or is it a bug?


Cluster with just 1 node is not a cluster (no quorum)

So you may either drop locking --config 'global {locking_type = 0}'
or fix the dropped node.  Since you are admin of the system you
know what to do - system itself unfortunately cannot determine,
whether the node A is master or node B is master (both could
be alive, just Internet connection between them could be failing).
So it's admin responsibility to take proper action.

Zdenek

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [linux-lvm] cluster request failed: Host is down
  2012-11-16 15:15 ` Zdenek Kabelac
@ 2012-11-16 15:33   ` Jacek Konieczny
  2012-11-16 18:42     ` [PATCH] clvmd/corosync: Cluster nodes down ok if quorate Jacek Konieczny
  0 siblings, 1 reply; 4+ messages in thread
From: Jacek Konieczny @ 2012-11-16 15:33 UTC (permalink / raw)
  To: Zdenek Kabelac; +Cc: LVM general discussion and development

On Fri, Nov 16, 2012 at 04:15:27PM +0100, Zdenek Kabelac wrote:
> Cluster with just 1 node is not a cluster (no quorum)

It is a cluster, in a degraded state. An expected situation for two-node
cluster.

I use the corosync stack, which provides the quorum information (there
are special settings to provide quorum on failed two-node cluster) and I
have fencing configured to enforce the quorum. Every other piece of the
cluster trust the quorum information from corosync.

> So you may either drop locking --config 'global {locking_type = 0}'

That misses the point.

> or fix the dropped node.  Since you are admin of the system you
> know what to do - 

Yes. Make sure the cluster has proper quorum and fencing configured.

The cluster is supposed to provide high availability â€“ which means
a graceful fail-over in case a node fails.

> system itself unfortunately cannot determine,
> whether the node A is master or node B is master (both could
> be alive, just Internet connection between them could be failing).

That is true if there is no proper fencing configured. But I have
working fencing. If there is no communication with the other node 
(due to hardware or software problem) its power is turned off. It is
known that the other node is not 'master' at that moment.

As DLM already made sure the other node is not running (it would lock up
on any lock operation otherwise) I see no reason for CLVMD not to trust
DLM (that lock is provided to all lock-space members) and Corosync (that
we have quorum).

And the message says 'Host is down', not 'no quorum' or anything which
would suggest that is a problem.

Greets,
        Jacek

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH] clvmd/corosync: Cluster nodes down ok if quorate
  2012-11-16 15:33   ` Jacek Konieczny
@ 2012-11-16 18:42     ` Jacek Konieczny
  0 siblings, 0 replies; 4+ messages in thread
From: Jacek Konieczny @ 2012-11-16 18:42 UTC (permalink / raw)
  To: lvm-devel

clvmd would refuse to handle cluster commands if any corosync cluster
node is down. If we have quorum and all active cluster nodes are running
clvmd, then cluster operations are safe (provided proper fencing is in
place, but DLM takes care of that).

Problem noticed here:
   http://www.redhat.com/archives/linux-lvm/2011-January/msg00039.html
and here:
   http://www.redhat.com/archives/linux-lvm/2012-November/msg00023.html

Signed-off-by: Jacek Konieczny <jajcus@jajcus.net>
---
 daemons/clvmd/clvmd-corosync.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/daemons/clvmd/clvmd-corosync.c b/daemons/clvmd/clvmd-corosync.c
index d85ec1e..603ba4f 100644
--- a/daemons/clvmd/clvmd-corosync.c
+++ b/daemons/clvmd/clvmd-corosync.c
@@ -58,6 +58,7 @@ static void corosync_cpg_confchg_callback(cpg_handle_t handle,
 				 const struct cpg_address *left_list, size_t left_list_entries,
 				 const struct cpg_address *joined_list, size_t joined_list_entries);
 static void _cluster_closedown(void);
+static int _is_quorate(void);
 
 /* Hash list of nodes in the cluster */
 static struct dm_hash_table *node_hash;
@@ -455,7 +456,8 @@ static int _cluster_do_node_callback(struct local_client *master_client,
 		if (ninfo->state != NODE_DOWN)
 			callback(master_client, csid, ninfo->state == NODE_CLVMD);
 		if (ninfo->state != NODE_CLVMD)
-			somedown = -1;
+			if (ninfo->state != NODE_DOWN || !_is_quorate()) 
+				somedown = -1;
 	}
 	return somedown;
 }
@@ -528,7 +530,7 @@ static int _unlock_resource(const char *resource, int lockid)
 	return 0;
 }
 
-static int _is_quorate()
+static int _is_quorate(void)
 {
 	int quorate;
 	if (quorum_getquorate(quorum_handle, &quorate) == CS_OK)
-- 
1.7.7.4



^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-11-16 18:42 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-16 12:48 [linux-lvm] cluster request failed: Host is down Jacek Konieczny
2012-11-16 15:15 ` Zdenek Kabelac
2012-11-16 15:33   ` Jacek Konieczny
2012-11-16 18:42     ` [PATCH] clvmd/corosync: Cluster nodes down ok if quorate Jacek Konieczny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.