From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [net 1/1] tipc: fix failover problem Date: Sat, 29 Sep 2018 11:46:45 -0700 (PDT) Message-ID: <20180929.114645.1219366490102910355.davem@davemloft.net> References: <1537988454-4210-1-git-send-email-jon.maloy@ericsson.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, gordan.mihaljevic@dektech.com.au, tung.q.nguyen@dektech.com.au, hoang.h.le@dektech.com.au, canh.d.luu@dektech.com.au, ying.xue@windriver.com, tipc-discussion@lists.sourceforge.net To: jon.maloy@ericsson.com Return-path: Received: from shards.monkeyblade.net ([23.128.96.9]:57084 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727569AbeI3BQT (ORCPT ); Sat, 29 Sep 2018 21:16:19 -0400 In-Reply-To: <1537988454-4210-1-git-send-email-jon.maloy@ericsson.com> Sender: netdev-owner@vger.kernel.org List-ID: From: Jon Maloy Date: Wed, 26 Sep 2018 21:00:54 +0200 > From: LUU Duc Canh > > We see the following scenario: > 1) Link endpoint B on node 1 discovers that its peer endpoint is gone. > Since there is a second working link, failover procedure is started. > 2) Link endpoint A on node 1 sends a FAILOVER message to peer endpoint > A on node 2. The node item 1->2 goes to state FAILINGOVER. > 3) Linke endpoint A/2 receives the failover, and is supposed to take > down its parallell link endpoint B/2, while producing a FAILOVER > message to send back to A/1. > 4) However, B/2 has already been deleted, so no FAILOVER message can > created. > 5) Node 1->2 remains in state FAILINGOVER forever, refusing to receive > any messages that can bring B/1 up again. We are left with a non- > redundant link between node 1 and 2. > > We fix this with letting endpoint A/2 build a dummy FAILOVER message > to send to back to A/1, so that the situation can be resolved. > > Signed-off-by: LUU Duc Canh > Signed-off-by: Jon Maloy Applied.