cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed
* [Cluster-devel] fence daemon problems
@ 2012-10-03  8:03 Dietmar Maurer
  2012-10-03  9:25 ` Dietmar Maurer
  0 siblings, 1 reply; 12+ messages in thread
From: Dietmar Maurer @ 2012-10-03  8:03 UTC (permalink / raw)
  To: cluster-devel.redhat.com

I observe strange problems with fencing when a cluster loose quorum for a short time.

After regain quorum, fenced reports 'wait state   messages', and whole cluster
is blocked waiting for fenced.

I can reproduce that bug here easily. It always happens with the following test:

Software: RHEL6.3 based kernel, corosync 1.4.4, cluster-3.1.93

I have 4 nodes. node hp4 is turned off for this test:

hp2:~# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        hp4
   2   M   1232   2012-10-03 08:59:08  hp1
   3   M   1228   2012-10-03 08:58:58  hp3
   4   M   1220   2012-10-03 08:58:58  hp2

hp2:~# fence_tool ls
fence domain
member count  3
victim count  0
victim now    0
master nodeid 3
wait state    none
members       2 3 4

Everything runs fine so far (fence_tool ls output match on all nodes).

Now I unplug the network cable on hp1:

hp2:~# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        hp4
   2   X   1232                        hp1
   3   M   1228   2012-10-03 08:58:58  hp3
   4   M   1220   2012-10-03 08:58:58  hp2

hp2:~# fence_tool ls
fence domain
member count  2
victim count  1
victim now    0
master nodeid 3
wait state    quorum
members       2 3 4

Same output on hp3 - so far so good .
In the fenced log I can find the following entries:

hp2:~# cat /var/log/cluster/fenced.log
Oct 03 08:59:08 fenced fenced 1349169030 started
Oct 03 08:59:09 fenced fencing deferred to hp3

on hp3:

hp3:~# cat /var/log/cluster/fenced.log
Oct 03 08:57:12 fenced fencing node hp4
Oct 03 08:57:21 fenced fence hp4 success

hp2:~# dlm_tool ls
dlm lockspaces
name          rgmanager
id            0x5231f3eb
flags         0x00000004 kern_stop
change        member 3 joined 1 remove 0 failed 0 seq 2,2
members       2 3 4
new change    member 2 joined 0 remove 1 failed 1 seq 3,3
new status    wait_messages 0 wait_condition 1 fencing
new members   3 4

same output on hp3.

Now I reconnect the network on hp1:

# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   X      0                        hp4
   2   M   1240   2012-10-03 09:07:41  hp1
   3   M   1228   2012-10-03 08:58:58  hp3
   4   M   1220   2012-10-03 08:58:58  hp2

So we have quorum again.

hp2:~# fence_tool ls
fence domain
member count  3
victim count  1
victim now    0
master nodeid 3
wait state    messages
members       2 3 4

same output on hp3, hp1 is different:

hp1:~# fence_tool ls
fence domain
member count  3
victim count  2
victim now    0
master nodeid 3
wait state    messages
members       2 3 4

Here are the fenced dumps - maybe someone can see what is wrong here?

hp2:~# fence_tool dump
...
1349247553 receive_complete 3:3 len 232
1349247751 cluster node 2 removed seq 1236
1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2
1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2
1349247751 add_change cg 3 remove nodeid 2 reason 3
1349247751 add_change cg 3 m 2 j 0 r 1 f 1
1349247751 add_victims node 2
1349247751 check_ringid cluster 1236 cpg 2:1232
1349247751 fenced:default ring 4:1236 2 memb 4 3
1349247751 check_ringid done cluster 1236 cpg 4:1236
1349247751 check_quorum not quorate
1349247751 fenced:daemon ring 4:1236 2 memb 4 3
1349248061 cluster node 2 added seq 1240
1349248061 check_ringid cluster 1240 cpg 4:1236
1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left
1349248061 cpg_mcast_joined retried 5 protocol
1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3
1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left
1349248061 add_change cg 4 joined nodeid 2
1349248061 add_change cg 4 m 3 j 1 r 0 f 0
1349248061 check_ringid cluster 1240 cpg 4:1236
1349248061 fenced:default ring 2:1240 3 memb 2 4 3
1349248061 check_ringid done cluster 1240 cpg 2:1240
1349248061 check_quorum done
1349248061 send_start 4:4 flags 2 started 2 m 3 j 1 r 0 f 0
1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061
1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 3 join 1349247548 left 0 local quorum 1349248061
1349248061 receive_start 4:4 len 232
1349248061 match_change 4:4 skip cg 3 expect counts 2 0 1 1
1349248061 match_change 4:4 matches cg 4
1349248061 wait_messages cg 4 need 2 of 3
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 2 stateful merge
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 2 stateful merge
1349248061 receive_start 3:5 len 232
1349248061 match_change 3:5 skip cg 3 expect counts 2 0 1 1
1349248061 match_change 3:5 matches cg 4
1349248061 wait_messages cg 4 need 1 of 3
1349248061 receive_start 2:5 len 232
1349248061 match_change 2:5 skip cg 3 sender not member
1349248061 match_change 2:5 matches cg 4
1349248061 receive_start 2:5 add node with started_count 1
1349248061 wait_messages cg 4 need 1 of 3

hp3:~# fence_tool dump
...
1349247553 receive_complete 3:3 len 232
1349247751 cluster node 2 removed seq 1236
1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2
1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2
1349247751 add_change cg 4 remove nodeid 2 reason 3
1349247751 add_change cg 4 m 2 j 0 r 1 f 1
1349247751 add_victims node 2
1349247751 check_ringid cluster 1236 cpg 2:1232
1349247751 fenced:default ring 4:1236 2 memb 4 3
1349247751 check_ringid done cluster 1236 cpg 4:1236
1349247751 check_quorum not quorate
1349247751 fenced:daemon ring 4:1236 2 memb 4 3
1349248061 cluster node 2 added seq 1240
1349248061 check_ringid cluster 1240 cpg 4:1236
1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left
1349248061 cpg_mcast_joined retried 5 protocol
1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3
1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061
1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left
1349248061 add_change cg 5 joined nodeid 2
1349248061 add_change cg 5 m 3 j 1 r 0 f 0
1349248061 check_ringid cluster 1240 cpg 4:1236
1349248061 fenced:default ring 2:1240 3 memb 2 4 3
1349248061 check_ringid done cluster 1240 cpg 2:1240
1349248061 check_quorum done
1349248061 send_start 3:5 flags 2 started 3 m 3 j 1 r 0 f 0
1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 3 join 1349247425 left 0 local quorum 1349248061
1349248061 receive_start 4:4 len 232
1349248061 match_change 4:4 skip cg 4 expect counts 2 0 1 1
1349248061 match_change 4:4 matches cg 5
1349248061 wait_messages cg 5 need 2 of 3
1349248061 receive_start 3:5 len 232
1349248061 match_change 3:5 skip cg 4 expect counts 2 0 1 1
1349248061 match_change 3:5 matches cg 5
1349248061 wait_messages cg 5 need 1 of 3
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 2 stateful merge
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 2 stateful merge
1349248061 receive_start 2:5 len 232
1349248061 match_change 2:5 skip cg 4 sender not member
1349248061 match_change 2:5 matches cg 5
1349248061 receive_start 2:5 add node with started_count 1
1349248061 wait_messages cg 5 need 1 of 3

hp1:~# fence_tool dump
...
1349247551 our_nodeid 2 our_name hp1
1349247552 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log
1349247552 logfile cur mode 100644
1349247552 cpg_join fenced:daemon ...
1349247552 setup_cpg_daemon 10
1349247552 group_mode 3 compat 0
1349247552 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left
1349247552 fenced:daemon ring 2:1232 3 memb 2 4 3
1349247552 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1
1349247552 daemon node 4 max 0.0.0.0 run 0.0.0.0
1349247552 daemon node 4 join 1349247552 left 0 local quorum 1349247551
1349247552 run protocol from nodeid 4
1349247552 daemon run 1.1.1 max 1.1.1
1349247552 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1349247552 daemon node 3 max 0.0.0.0 run 0.0.0.0
1349247552 daemon node 3 join 1349247552 left 0 local quorum 1349247551
1349247552 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0
1349247552 daemon node 2 max 0.0.0.0 run 0.0.0.0
1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551
1349247552 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0
1349247552 daemon node 2 max 1.1.1.0 run 0.0.0.0
1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551
1349247553 client connection 3 fd 13
1349247553 added 4 nodes from ccs
1349247553 cpg_join fenced:default ...
1349247553 fenced:default conf 3 1 0 memb 2 3 4 join 2 left
1349247553 add_change cg 1 joined nodeid 2
1349247553 add_change cg 1 m 3 j 1 r 0 f 0
1349247553 add_victims_init nodeid 1
1349247553 check_ringid cluster 1232 cpg 0:0
1349247553 fenced:default ring 2:1232 3 memb 2 4 3
1349247553 check_ringid done cluster 1232 cpg 2:1232
1349247553 check_quorum done
1349247553 send_start 2:1 flags 1 started 0 m 3 j 1 r 0 f 0
1349247553 receive_start 3:3 len 232
1349247553 match_change 3:3 matches cg 1
1349247553 save_history 1 master 3 time 1349247441 how 1
1349247553 wait_messages cg 1 need 2 of 3
1349247553 receive_start 2:1 len 232
1349247553 match_change 2:1 matches cg 1
1349247553 wait_messages cg 1 need 1 of 3
1349247553 receive_start 4:2 len 232
1349247553 match_change 4:2 matches cg 1
1349247553 wait_messages cg 1 got all 3
1349247553 set_master from 0 to complete node 3
1349247553 fencing deferred to hp3
1349247553 receive_complete 3:3 len 232
1349247553 receive_complete clear victim nodeid 1 init 1
1349247750 cluster node 3 removed seq 1236
1349247750 cluster node 4 removed seq 1236
1349247751 fenced:daemon conf 2 0 1 memb 2 4 join left 3
1349247751 fenced:daemon conf 1 0 1 memb 2 join left 4
1349247751 fenced:daemon ring 2:1236 1 memb 2
1349247751 fenced:default conf 2 0 1 memb 2 4 join left 3
1349247751 add_change cg 2 remove nodeid 3 reason 3
1349247751 add_change cg 2 m 2 j 0 r 1 f 1
1349247751 add_victims node 3
1349247751 check_ringid cluster 1236 cpg 2:1232
1349247751 fenced:default conf 1 0 1 memb 2 join left 4
1349247751 add_change cg 3 remove nodeid 4 reason 3
1349247751 add_change cg 3 m 1 j 0 r 1 f 1
1349247751 add_victims node 4
1349247751 check_ringid cluster 1236 cpg 2:1232
1349247751 fenced:default ring 2:1236 1 memb 2
1349247751 check_ringid done cluster 1236 cpg 2:1236
1349247751 check_quorum not quorate
1349248061 cluster node 3 added seq 1240
1349248061 cluster node 4 added seq 1240
1349248061 check_ringid cluster 1240 cpg 2:1236
1349248061 fenced:daemon conf 2 1 0 memb 2 3 join 3 left
1349248061 cpg_mcast_joined retried 6 protocol
1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 4 left
1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3
1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 4 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 4 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 4 stateful merge
1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 3 max 0.0.0.0 run 0.0.0.0
1349248061 daemon node 3 join 1349248061 left 1349247751 local quorum 1349248061
1349248061 daemon node 3 stateful merge
1349248061 fenced:default conf 2 1 0 memb 2 3 join 3 left
1349248061 add_change cg 4 joined nodeid 3
1349248061 add_change cg 4 m 2 j 1 r 0 f 0
1349248061 check_ringid cluster 1240 cpg 2:1236
1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 4 left
1349248061 add_change cg 5 joined nodeid 4
1349248061 add_change cg 5 m 3 j 1 r 0 f 0
1349248061 check_ringid cluster 1240 cpg 2:1236
1349248061 fenced:default ring 2:1240 3 memb 2 4 3
1349248061 check_ringid done cluster 1240 cpg 2:1240
1349248061 check_quorum done
1349248061 send_start 2:5 flags 2 started 1 m 3 j 1 r 0 f 0
1349248061 receive_start 4:4 len 232
1349248061 match_change 4:4 skip cg 2 created 1349247751 cluster add 1349248061
1349248061 match_change 4:4 skip cg 3 sender not member
1349248061 match_change 4:4 skip cg 4 sender not member
1349248061 match_change 4:4 matches cg 5
1349248061 receive_start 4:4 add node with started_count 2
1349248061 wait_messages cg 5 need 3 of 3
1349248061 receive_start 3:5 len 232
1349248061 match_change 3:5 skip cg 2 sender not member
1349248061 match_change 3:5 skip cg 3 sender not member
1349248061 match_change 3:5 skip cg 4 expect counts 2 1 0 0
1349248061 match_change 3:5 matches cg 5
1349248061 receive_start 3:5 add node with started_count 3
1349248061 wait_messages cg 5 need 3 of 3
1349248061 receive_start 2:5 len 232
1349248061 match_change 2:5 skip cg 2 expect counts 2 0 1 1
1349248061 match_change 2:5 skip cg 3 expect counts 1 0 1 1
1349248061 match_change 2:5 skip cg 4 expect counts 2 1 0 0
1349248061 match_change 2:5 matches cg 5
1349248061 wait_messages cg 5 need 2 of 3
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.0
1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061
1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.1
1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20121003/1a13d0fe/attachment.htm>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2012-10-09 17:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-03  8:03 [Cluster-devel] fence daemon problems Dietmar Maurer
2012-10-03  9:25 ` Dietmar Maurer
2012-10-03 14:46   ` David Teigland
2012-10-03 16:08     ` Dietmar Maurer
2012-10-03 16:12     ` Dietmar Maurer
2012-10-03 16:24       ` David Teigland
2012-10-03 16:26         ` Dietmar Maurer
2012-10-03 16:44           ` David Teigland
2012-10-03 16:55             ` Dietmar Maurer
2012-10-03 17:10               ` David Teigland
2012-10-09 17:14               ` Lon Hohberger
2012-10-09 17:13             ` Lon Hohberger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).