From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Miller Subject: Re: [PATCH] net: tipc: fix stall during bclink wakeup procedure Date: Tue, 08 Sep 2015 22:51:11 -0700 (PDT) Message-ID: <20150908.225111.1855452548113402714.davem@davemloft.net> References: <55EBF2B2.901@windriver.com> Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit Cc: ying.xue@windriver.com, jon.maloy@ericsson.com, tipc-discussion@lists.sourceforge.net, netdev@vger.kernel.org, linux-kernel@vger.kernel.org To: kolmakov.dmitriy@huawei.com Return-path: Received: from shards.monkeyblade.net ([149.20.54.216]:57688 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752104AbbIIFvM (ORCPT ); Wed, 9 Sep 2015 01:51:12 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: From: Kolmakov Dmitriy Date: Mon, 7 Sep 2015 09:05:48 +0000 > If an attempt to wake up users of broadcast link is made when there is > no enough place in send queue than it may hang up inside the > tipc_sk_rcv() function since the loop breaks only after the wake up > queue becomes empty. This can lead to complete CPU stall with the > following message generated by RCU: ... > The issue occurs only when tipc_sk_rcv() is used to wake up postponed > senders: ... > After the sender thread is woke up it can gather control and perform > an attempt to send a message. But if there is no enough place in send > queue it will call link_schedule_user() function which puts a message > of type SOCK_WAKEUP to the wakeup queue and put the sender to sleep. > Thus the size of the queue actually is not changed and the while() > loop never exits. > > The approach I proposed is to wake up only senders for which there is > enough place in send queue so the described issue can't occur. > Moreover the same approach is already used to wake up senders on > unicast links. > > I have got into the issue on our product code but to reproduce the > issue I changed a benchmark test application (from > tipcutils/demos/benchmark) to perform the following scenario: > 1. Run 64 instances of test application (nodes). It can be done > on the one physical machine. > 2. Each application connects to all other using TIPC sockets in > RDM mode. > 3. When setup is done all nodes start simultaneously send > broadcast messages. > 4. Everything hangs up. > > The issue is reproducible only when a congestion on broadcast link > occurs. For example, when there are only 8 nodes it works fine since > congestion doesn't occur. Send queue limit is 40 in my case (I use a > critical importance level) and when 64 nodes send a message at the > same moment a congestion occurs every time. > > Signed-off-by: Dmitry S Kolmakov > Reviewed-by: Jon Maloy > Acked-by: Ying Xue > --- > v2: Updated after comments from Jon and Ying. Applied, thanks.