From: Junxiao Bi <junxiao.bi@oracle.com>
To: ocfs2-devel@oss.oracle.com
Subject: [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect
Date: Fri, 13 Jun 2014 09:56:54 +0800 [thread overview]
Message-ID: <539A5A66.2080306@oracle.com> (raw)
In-Reply-To: <1402624104-17841-1-git-send-email-junxiao.bi@oracle.com>
Not sure why Joseph Qi is excluded from cc list of git send-email.
Cc him.
On 06/13/2014 09:48 AM, Junxiao Bi wrote:
>
> Hi,
>
> This patch serial is to fix a possible message lost bug in ocfs2 when
> network go bad. This bug will cause ocfs2 hung forever even network
> become good again.
> The messages may lost in this case. After the tcp connection is established
> between two nodes, an idle timer will be set to check its state periodically,
> if no messages are received during this time, idle timer will timeout, it will
> shutdown the connection and try to reconnect, so pending messages in tcp queues
> will be lost. This messages may be from dlm. Dlm may get hung in this case. This
> may cause the whole ocfs2 cluster hung.
> This is very possible to happen when network state goes bad. Do the reconnect is
> useless, it will fail if network state is still bad. Just waiting there for
> network recovering may be a good idea, it will not lost messages and some node
> will be fenced until cluster goes into split-brain state, for this case, Tcp user
> timeout is used to override the tcp retransmit timeout. It will timeout after 25
> days, user should have notice this through the provided log and fix the network,
> if they don't, ocfs2 will fall back to original reconnect way.
> This is a resend of the patches, no changes since last time. Please help review.
>
> Thanks,
> Junxiao.
>
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel at oss.oracle.com
> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
prev parent reply other threads:[~2014-06-13 1:56 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-06-13 1:48 [Ocfs2-devel] ocfs2: o2net: fix packets lost issue when reconnect Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 1/3] ocfs2: o2net: don't shutdown connection when idle timeout Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 2/3] ocfs2: o2net: set tcp user timeout to max value Junxiao Bi
2014-06-13 1:48 ` [Ocfs2-devel] [PATCH 3/3] ocfs2: quorum: add a log for node not fenced Junxiao Bi
2014-06-13 1:56 ` Junxiao Bi [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=539A5A66.2080306@oracle.com \
--to=junxiao.bi@oracle.com \
--cc=ocfs2-devel@oss.oracle.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.