* [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang
@ 2009-03-18 11:17 David Winter
2009-03-18 12:05 ` Sunil Mushran
2009-03-18 17:07 ` Joel Becker
0 siblings, 2 replies; 3+ messages in thread
From: David Winter @ 2009-03-18 11:17 UTC (permalink / raw)
To: ocfs2-devel
Hello,
we've had some serious trouble with a two-node Xen-based OCFS2
cluster. In brief: we had two incidents where one node detects an idle
timeout and shuts the other node down which causes the other node and
the Dom0 to hang. Both times this could only be resolved by rebooting
the whole machine using the built-in IPMI card.
All machines (including the other DomUs) run Centos 5.2 and the OCFS2
nodes use ocfs2-tools-1.4.1-1.el5 and
ocfs2-2.6.18-92.1.13.el5xen-1.4.1-1.el5.
Unfortunately there wasn't logged much of relevance, except for the /
var/log/messages of the node that issued the shutdown (see below) and
the nearly five hour gap in the logs of the other node.
Mar 15 14:39:47 ugc-1 kernel: o2net: connection to node cod-2 (num 3)
at 10.0.0.42:7777 has been idle for 30.0 seconds, shutting it down.
Mar 15 14:39:47 ugc-1 kernel: (0,0):o2net_idle_timer:1476 here are
some times that might help debug the situation: (tmr 1237124357.624587
now 1237124387.624394 dr 1237124357.624578 adv
1237124357.624588:1237124357.624589 func (be795f6d:507)
1237124191.594238:1237124191.594242)
Mar 15 14:39:47 ugc-1 kernel: o2net: no longer connected to node cod-2
(num 3) at 10.0.0.42:7777
Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_do_master_request:1335
ERROR: link to 3 went down!
Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_get_lock_resource:912
ERROR: status = -112
Mar 15 14:40:17 ugc-1 kernel: (1743,0):o2net_connect_expired:1637
ERROR: no connection established with node 3 after 30.0 seconds,
giving up and returning errors.
Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_do_master_request:1335
ERROR: link to 3 went down!
Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_get_lock_resource:912
ERROR: status = -107
Mar 15 14:44:47 ugc-1 kernel: (1743,0):o2net_connect_expired:1637
ERROR: no connection established with node 3 after 30.0 seconds,
giving up and returning errors.
Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_do_master_request:1335
ERROR: link to 3 went down!
Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_get_lock_resource:912
ERROR: status = -107
Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_do_master_request:1335
ERROR: link to 3 went down!
Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_get_lock_resource:912
ERROR: status = -107
Is this already a known issue and if so, is there a workaround or fix?
Thanks in advance.
Regards, David
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang
2009-03-18 11:17 [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang David Winter
@ 2009-03-18 12:05 ` Sunil Mushran
2009-03-18 17:07 ` Joel Becker
1 sibling, 0 replies; 3+ messages in thread
From: Sunil Mushran @ 2009-03-18 12:05 UTC (permalink / raw)
To: ocfs2-devel
Setup a netconsole server to capture the logs. There is not much to go
on with the info you have provided.
On Wed, Mar 18, 2009 at 12:17:36PM +0100, David Winter wrote:
> Hello,
>
> we've had some serious trouble with a two-node Xen-based OCFS2
> cluster. In brief: we had two incidents where one node detects an idle
> timeout and shuts the other node down which causes the other node and
> the Dom0 to hang. Both times this could only be resolved by rebooting
> the whole machine using the built-in IPMI card.
>
> All machines (including the other DomUs) run Centos 5.2 and the OCFS2
> nodes use ocfs2-tools-1.4.1-1.el5 and
> ocfs2-2.6.18-92.1.13.el5xen-1.4.1-1.el5.
>
> Unfortunately there wasn't logged much of relevance, except for the /
> var/log/messages of the node that issued the shutdown (see below) and
> the nearly five hour gap in the logs of the other node.
>
> Mar 15 14:39:47 ugc-1 kernel: o2net: connection to node cod-2 (num 3)
> at 10.0.0.42:7777 has been idle for 30.0 seconds, shutting it down.
> Mar 15 14:39:47 ugc-1 kernel: (0,0):o2net_idle_timer:1476 here are
> some times that might help debug the situation: (tmr 1237124357.624587
> now 1237124387.624394 dr 1237124357.624578 adv
> 1237124357.624588:1237124357.624589 func (be795f6d:507)
> 1237124191.594238:1237124191.594242)
> Mar 15 14:39:47 ugc-1 kernel: o2net: no longer connected to node cod-2
> (num 3) at 10.0.0.42:7777
> Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_do_master_request:1335
> ERROR: link to 3 went down!
> Mar 15 14:39:47 ugc-1 kernel: (24452,0):dlm_get_lock_resource:912
> ERROR: status = -112
> Mar 15 14:40:17 ugc-1 kernel: (1743,0):o2net_connect_expired:1637
> ERROR: no connection established with node 3 after 30.0 seconds,
> giving up and returning errors.
> Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_do_master_request:1335
> ERROR: link to 3 went down!
> Mar 15 14:44:29 ugc-1 kernel: (16225,0):dlm_get_lock_resource:912
> ERROR: status = -107
> Mar 15 14:44:47 ugc-1 kernel: (1743,0):o2net_connect_expired:1637
> ERROR: no connection established with node 3 after 30.0 seconds,
> giving up and returning errors.
> Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_do_master_request:1335
> ERROR: link to 3 went down!
> Mar 15 14:46:59 ugc-1 kernel: (24456,0):dlm_get_lock_resource:912
> ERROR: status = -107
> Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_do_master_request:1335
> ERROR: link to 3 went down!
> Mar 15 14:46:59 ugc-1 kernel: (27195,0):dlm_get_lock_resource:912
> ERROR: status = -107
>
> Is this already a known issue and if so, is there a workaround or fix?
>
> Thanks in advance.
>
>
> Regards, David
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
^ permalink raw reply [flat|nested] 3+ messages in thread
* [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang
2009-03-18 11:17 [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang David Winter
2009-03-18 12:05 ` Sunil Mushran
@ 2009-03-18 17:07 ` Joel Becker
1 sibling, 0 replies; 3+ messages in thread
From: Joel Becker @ 2009-03-18 17:07 UTC (permalink / raw)
To: ocfs2-devel
On Wed, Mar 18, 2009 at 12:17:36PM +0100, David Winter wrote:
> Hello,
>
> we've had some serious trouble with a two-node Xen-based OCFS2
> cluster. In brief: we had two incidents where one node detects an idle
> timeout and shuts the other node down which causes the other node and
> the Dom0 to hang. Both times this could only be resolved by rebooting
> the whole machine using the built-in IPMI card.
>
> All machines (including the other DomUs) run Centos 5.2 and the OCFS2
> nodes use ocfs2-tools-1.4.1-1.el5 and
> ocfs2-2.6.18-92.1.13.el5xen-1.4.1-1.el5.
>
> Unfortunately there wasn't logged much of relevance, except for the /
> var/log/messages of the node that issued the shutdown (see below) and
> the nearly five hour gap in the logs of the other node.
Just to clarify, the o2cb stack doesn't shut down other nodes.
Nodes can only self-fence. The 'shutting it down' message in the logs
is about the connection. In other words, cod-2 is already hanging.
ugc-1 notices and closes the network connection.
So you want to figure out why cod-2 hung or crashed. Sunil is
right that you'll want netconsole for a better idea of what's going on.
We can't diagnose cod-2 from this information.
If your dom0 is hanging, that's a separate issue. A hanging
domU, no matter the cause, shouldn't hang dom0.
Joel
--
"Sometimes when reading Goethe I have the paralyzing suspicion
that he is trying to be funny."
- Guy Davenport
Joel Becker
Principal Software Developer
Oracle
E-mail: joel.becker at oracle.com
Phone: (650) 506-8127
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2009-03-18 17:07 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-03-18 11:17 [Ocfs2-devel] shutdown by o2net_idle_timer causes Xen to hang David Winter
2009-03-18 12:05 ` Sunil Mushran
2009-03-18 17:07 ` Joel Becker
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.