From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH net] bonding: fix slave stuck in BOND_LINK_FAIL state Date: Wed, 08 Nov 2017 16:08:17 +0900 (KST) Message-ID: <20171108.160817.884757301746988730.davem@davemloft.net> References: <15116.1510051807@nyx> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: netdev@vger.kernel.org, alexandre.sidorenko@hpe.com, maheshb@google.com, jarod@redhat.com, vfalico@gmail.com, andy@greyhouse.net To: jay.vosburgh@canonical.com Return-path: Received: from shards.monkeyblade.net ([184.105.139.130]:36220 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755283AbdKHHIW (ORCPT ); Wed, 8 Nov 2017 02:08:22 -0500 In-Reply-To: <15116.1510051807@nyx> Sender: netdev-owner@vger.kernel.org List-ID: From: Jay Vosburgh Date: Tue, 07 Nov 2017 19:50:07 +0900 > The bonding miimon logic has a flaw, in that a failure of the > rtnl_trylock can cause a slave to become permanently stuck in > BOND_LINK_FAIL state. > > The sequence of events to cause this is as follows: > > 1) bond_miimon_inspect finds that a slave's link is down, and so > calls bond_propose_link_state, setting slave->new_link_state to > BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns > non-zero. > > 2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is > rescheduled. No change is committed. > > 3) bond_miimon_inspect is called again, but this time the slave > from step 1 has recovered. slave->new_link is reset to NOCHANGE, and, as > slave->link was never changed, the switch enters the BOND_LINK_UP case, > and does nothing. The pending BOND_LINK_FAIL state from step 1 remains > pending, as new_link_state is not reset. > > 4) The state from step 3 persists until another slave changes link > state and causes bond_miimon_inspect to return non-zero. At this point, > the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed, > and the slave will remain stuck in BOND_LINK_FAIL state even though it > is actually link up. > > The remedy for this is to initialize new_link_state on each entry > to bond_miimon_inspect, as is already done with new_link. > > Reported-by: Alex Sidorenko > Reviewed-by: Jarod Wilson > Signed-off-by: Jay Vosburgh > Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP correctly") Applied and queued up for -stable. As discussed with some others here at netdev... rtnl_trylock() really needs to be re-evaluated if not removed completely.