From: Junxiao Bi <junxiao.bi@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect
Date: Fri, 13 Jun 2014 09:56:54 +0800 [thread overview]
Message-ID: <539A5A66.2080306@oracle.com> (raw)
In-Reply-To: <1402624104-17841-1-git-send-email-junxiao.bi@oracle.com>
Not sure why Joseph Qi is excluded from cc list of git send-email.
Cc him.
On 06/13/2014 09:48 AM, Junxiao Bi wrote:
>
> Hi,
>
> This patch serial is to fix a possible message lost bug in ocfs2 when
> network go bad. This bug will cause ocfs2 hung forever even network
> become good again.
> The messages may lost in this case. After the tcp connection is established
> between two nodes, an idle timer will be set to check its state periodically,
> if no messages are received during this time, idle timer will timeout, it will
> shutdown the connection and try to reconnect, so pending messages in tcp queues
> will be lost. This messages may be from dlm. Dlm may get hung in this case. This
> may cause the whole ocfs2 cluster hung.
> This is very possible to happen when network state goes bad. Do the reconnect is
> useless, it will fail if network state is still bad. Just waiting there for
> network recovering may be a good idea, it will not lost messages and some node
> will be fenced until cluster goes into split-brain state, for this case, Tcp user
> timeout is used to override the tcp retransmit timeout. It will timeout after 25
> days, user should have notice this through the provided log and fix the network,
> if they don't, ocfs2 will fall back to original reconnect way.
> This is a resend of the patches, no changes since last time. Please help review.
>
> Thanks,
> Junxiao.
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
prev parent reply other threads:[~2014-06-13 1:56 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-13 1:48 [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced Junxiao Bi
2014-06-13 1:56 ` Junxiao Bi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=539A5A66.2080306@oracle.com \
--to=junxiao.bi@oracle.com \
--cc=ocfs2-devel@oss.oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).