From mboxrd@z Thu Jan 1 00:00:00 1970 From: Junxiao Bi Date: Mon, 19 May 2014 09:36:08 +0800 Subject: [Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue when reconnect In-Reply-To: <5375D3DA.6030603@huawei.com> References: <1400127983-9774-1-git-send-email-junxiao.bi@oracle.com> <53747A65.1000200@huawei.com> <537575A8.8080600@oracle.com> <5375C6C4.9040207@huawei.com> <5375CD34.60101@oracle.com> <5375D3DA.6030603@huawei.com> Message-ID: <53796008.5040501@oracle.com> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: ocfs2-devel@oss.oracle.com On 05/16/2014 05:01 PM, Joseph Qi wrote: > On 2014/5/16 16:32, Junxiao Bi wrote: >> On 05/16/2014 04:05 PM, Joseph Qi wrote: >>> Hi Junxiao, >>> >>> On 2014/5/16 10:19, Junxiao Bi wrote: >>>> Hi Joseph, >>>> >>>> On 05/15/2014 04:27 PM, Joseph Qi wrote: >>>>> On 2014/5/15 12:26, Junxiao Bi wrote: >>>>>> Hi, >>>>>> >>>>>> After the tcp connection is established between two ocfs2 nodes, an idle >>>>>> timer will be set to check its state periodically, if no messages are >>>>>> received during this time, idle timer will timeout, it will shutdown >>>>>> the connection and try to rebuild, so pending message in tcp queues will >>>>>> be lost. This may cause the whole ocfs2 cluster hung. >>>>>> This is very possible to happen when network state goes bad. Do the >>>>>> reconnect is useless, it will fail if network state doesn't recover. >>>>>> Just waiting there for network recovering may be a good idea, it will >>>>>> not lost messages and some node will be fenced until cluster goes into >>>>>> split-brain state, for this case, Tcp user timeout is used to override >>>>>> the tcp retransmit timeout. It will timeout after 25 days, user should >>>>>> have notice this through the provided log and fix the network, if they >>>>>> don't, ocfs2 will fall back to original reconnect way. >>>>>> The following is the serial of patches to fix the bug. Please help review. >>>>> TCP RTT is auto-regressive, that means the following case may take >>>>> place: >>>>> Suppose current retransmission interval is ?T (somewhat long), network >>>>> recovers but down again before the next retransmission windows >>>>> comes (< ?T), so the network recovery won't be detected and ocfs2 >>>>> cluster still hungs. >>>> Network recovers but down again, this means the network is still down. >>>> Ocfs2 hung is an expected behavior when network is down if split-brain case. >>>> What we need take care is how long can ocfs2 recover from hung after >>>> network recover(not down again). I didn't know tcp internal about how >>>> they retransmit the packets, I just test blocking the network for half >>>> an hour, it just need several seconds to recover from the hung. Of >>>> course, how long the hung recover may also depends on how hard it hung >>>> from dlm. >>>> >>> Yes, it is an expected behavior. But currently ocfs2 will make quorum >>> decision after timeout and cluster won't hang long. >> Not always, sometimes, the quorum decision can't fence any node. Like in >> three nodes cluster, 1, 2, 3, if the network between node 2 and node 3 >> is down, but the network of each node to node 1 is good. No node will be >> fenced. This is what we call split-brain case. Cluster will hung. > Yes, you are right. In such a case currently ocfs2 cannot handle. > But if all nodes are connected with the same switch, I am curious about > how this happens. I think we'd better not assume the network topo. We should support all the user cases. Thanks, Junxiao. > >>> So should it be better fence than wait till recover in this situation? >>> After all, it widely affects cluster operations. >> Yes, but making fence decision is not that easy in the split-brain case. >> This needs a node know every connections status of the cluster. Then it >> can decide to cut some nodes to make the cluster work again. But now >> every node only know itself connections status, like node 1 didn't know >> the connection status between node 2 and node 3. >>> Another thought is, could we retry the message? And to avoid BUG when >>> a same message is handled twice, we can add an unique message sequence >>> number. >> Retry is useless when network is bad. It will fail again and again until >> network recover. > The thought is based on quorum decision will be made when timeout. > And I suppose network down within cluster range. > >> Thanks, >> Junxiao. >>>> Thanks, >>>> Junxiao. >>>>>> Thanks, >>>>>> Junxiao. >>>>>> >>>>>> _______________________________________________ >>>>>> Ocfs2-devel mailing list >>>>>> Ocfs2-devel at oss.oracle.com >>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >> >> . >> >