ocfs2-devel.oss.oracle.com archive mirror
 help / color / mirror / Atom feed
* [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect
@ 2014-06-13  1:48 Junxiao Bi
  2014-06-13  1:48 ` [Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout Junxiao Bi
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Junxiao Bi @ 2014-06-13  1:48 UTC (permalink / raw)
  To: ocfs2-devel



Hi,

This patch serial is to fix a possible message lost bug in ocfs2 when
network go bad. This bug will cause ocfs2 hung forever even network
become good again.
The messages may lost in this case. After the tcp connection is established
between two nodes, an idle timer will be set to check its state periodically,
if no messages are received during this time, idle timer will timeout, it will
shutdown the connection and try to reconnect, so pending messages in tcp queues
will be lost. This messages may be from dlm. Dlm may get hung in this case. This
may cause the whole ocfs2 cluster hung. 
This is very possible to happen when network state goes bad. Do the reconnect is
useless, it will fail if network state is still bad. Just waiting there for
network recovering may be a good idea, it will not lost messages and some node
will be fenced until cluster goes into split-brain state, for this case, Tcp user
timeout is used to override the tcp retransmit timeout. It will timeout after 25
days, user should have notice this through the provided log and fix the network,
if they don't, ocfs2 will fall back to original reconnect way.
This is a resend of the patches, no changes since last time. Please help review.

Thanks,
Junxiao.

^ permalink raw reply	[flat|nested] 6+ messages in thread
* [Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect
@ 2014-05-15  4:26 Junxiao Bi
  2014-05-15  4:26 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value Junxiao Bi
  0 siblings, 1 reply; 6+ messages in thread
From: Junxiao Bi @ 2014-05-15  4:26 UTC (permalink / raw)
  To: ocfs2-devel


Hi,

After the tcp connection is established between two ocfs2 nodes, an idle
timer will be set to check its state periodically, if no messages are
received during this time, idle timer will timeout, it will shutdown
the connection and try to rebuild, so pending message in tcp queues will
be lost. This may cause the whole ocfs2 cluster hung. 
This is very possible to happen when network state goes bad. Do the
reconnect is useless, it will fail if network state doesn't recover.
Just waiting there for network recovering may be a good idea, it will
not lost messages and some node will be fenced until cluster goes into
split-brain state, for this case, Tcp user timeout is used to override
the tcp retransmit timeout. It will timeout after 25 days, user should
have notice this through the provided log and fix the network, if they
don't, ocfs2 will fall back to original reconnect way.
The following is the serial of patches to fix the bug. Please help review.

Thanks,
Junxiao.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-06-13  1:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-06-13  1:48 [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect Junxiao Bi
2014-06-13  1:48 ` [Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout Junxiao Bi
2014-06-13  1:48 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value Junxiao Bi
2014-06-13  1:48 ` [Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced Junxiao Bi
2014-06-13  1:56 ` [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect Junxiao Bi
  -- strict thread matches above, loose matches on Subject: below --
2014-05-15  4:26 [Ocfs2-devel] [PATCH 0/3] " Junxiao Bi
2014-05-15  4:26 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value Junxiao Bi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).