From mboxrd@z Thu Jan  1 00:00:00 1970
From: Junxiao Bi <junxiao.bi@oracle.com>
Date: Mon, 19 May 2014 09:36:08 +0800
Subject: [Ocfs2-devel] [PATCH 0/3] ocfs2: o2net: fix packets lost issue
 when reconnect
In-Reply-To: <5375D3DA.6030603@huawei.com>
References: <1400127983-9774-1-git-send-email-junxiao.bi@oracle.com>
	<53747A65.1000200@huawei.com> <537575A8.8080600@oracle.com>
	<5375C6C4.9040207@huawei.com> <5375CD34.60101@oracle.com>
	<5375D3DA.6030603@huawei.com>
Message-ID: <53796008.5040501@oracle.com>
List-Id: <ocfs2-devel.oss.oracle.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
To: ocfs2-devel@oss.oracle.com

On 05/16/2014 05:01 PM, Joseph Qi wrote:
> On 2014/5/16 16:32, Junxiao Bi wrote:
>> On 05/16/2014 04:05 PM, Joseph Qi wrote:
>>> Hi Junxiao,
>>>
>>> On 2014/5/16 10:19, Junxiao Bi wrote:
>>>> Hi Joseph,
>>>>
>>>> On 05/15/2014 04:27 PM, Joseph Qi wrote:
>>>>> On 2014/5/15 12:26, Junxiao Bi wrote:
>>>>>> Hi,
>>>>>>
>>>>>> After the tcp connection is established between two ocfs2 nodes, an idle
>>>>>> timer will be set to check its state periodically, if no messages are
>>>>>> received during this time, idle timer will timeout, it will shutdown
>>>>>> the connection and try to rebuild, so pending message in tcp queues will
>>>>>> be lost. This may cause the whole ocfs2 cluster hung. 
>>>>>> This is very possible to happen when network state goes bad. Do the
>>>>>> reconnect is useless, it will fail if network state doesn't recover.
>>>>>> Just waiting there for network recovering may be a good idea, it will
>>>>>> not lost messages and some node will be fenced until cluster goes into
>>>>>> split-brain state, for this case, Tcp user timeout is used to override
>>>>>> the tcp retransmit timeout. It will timeout after 25 days, user should
>>>>>> have notice this through the provided log and fix the network, if they
>>>>>> don't, ocfs2 will fall back to original reconnect way.
>>>>>> The following is the serial of patches to fix the bug. Please help review.
>>>>> TCP RTT is auto-regressive, that means the following case may take
>>>>> place:
>>>>> Suppose current retransmission interval is ?T (somewhat long), network
>>>>> recovers but down again before the next retransmission windows
>>>>> comes (< ?T), so the network recovery won't be detected and ocfs2
>>>>> cluster still hungs.
>>>> Network recovers but down again, this means the network is still down.
>>>> Ocfs2 hung is an expected behavior when network is down if split-brain case.
>>>> What we need take care is how long can ocfs2 recover from hung after
>>>> network recover(not down again). I didn't know tcp internal about how
>>>> they retransmit the packets, I just test blocking the network for half
>>>> an hour, it just need several seconds to recover from the hung.  Of
>>>> course, how long the hung recover may also depends on how hard it hung
>>>> from dlm.
>>>>
>>> Yes, it is an expected behavior. But currently ocfs2 will make quorum
>>> decision after timeout and cluster won't hang long.
>> Not always, sometimes, the quorum decision can't fence any node. Like in
>> three nodes cluster, 1, 2, 3,  if the network between node 2 and node 3
>> is down, but the network of each node to node 1 is good. No node will be
>> fenced. This is what we call split-brain case. Cluster will hung.
> Yes, you are right. In such a case currently ocfs2 cannot handle.
> But if all nodes are connected with the same switch, I am curious about
> how this happens.
I think we'd better not assume the network topo. We should support all
the user cases.

Thanks,
Junxiao.
>
>>> So should it be better fence than wait till recover in this situation?
>>> After all, it widely affects cluster operations.
>> Yes, but making fence decision is not that easy in the split-brain case.
>> This needs a node know every connections status of the cluster. Then it
>> can decide to cut some nodes to make the cluster work again. But now
>> every node only know itself connections status, like node 1 didn't know
>> the connection status between node 2 and node 3.
>>> Another thought is, could we retry the message? And to avoid BUG when
>>> a same message is handled twice, we can add an unique message sequence
>>> number.
>> Retry is useless when network is bad. It will fail again and again until
>> network recover.
> The thought is based on quorum decision will be made when timeout.
> And I suppose network down within cluster range.
>
>> Thanks,
>> Junxiao.
>>>> Thanks,
>>>> Junxiao.
>>>>>> Thanks,
>>>>>> Junxiao.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Ocfs2-devel mailing list
>>>>>> Ocfs2-devel at oss.oracle.com
>>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-devel
>>
>> .
>>
>