From mboxrd@z Thu Jan 1 00:00:00 1970 From: Chris Friesen Subject: Re: bonding: time limits too tight in bond_ab_arp_inspect Date: Wed, 22 Aug 2012 12:58:24 -0600 Message-ID: <50352BD0.3060409@genband.com> References: <20120822174534.GA20260@midget.suse.cz> <50351CC5.3030109@genband.com> <24655.1345660922@death.nxdomain> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Jiri Bohac , Andy Gospodarek , netdev@vger.kernel.org, Petr Tesarik To: Jay Vosburgh Return-path: Received: from exprod7og102.obsmtp.com ([64.18.2.157]:34917 "EHLO exprod7og102.obsmtp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755342Ab2HVTAO (ORCPT ); Wed, 22 Aug 2012 15:00:14 -0400 In-Reply-To: <24655.1345660922@death.nxdomain> Sender: netdev-owner@vger.kernel.org List-ID: On 08/22/2012 12:42 PM, Jay Vosburgh wrote: > Chris Friesen wrote: > >> On 08/22/2012 11:45 AM, Jiri Bohac wrote: >> >>> This code is run from bond_activebackup_arp_mon() about >>> delta_in_ticks jiffies after the previous ARP probe has been >>> sent. If the delayed work gets executed exactly in delta_in_ticks >>> jiffies, there is a chance the slave will be brought up. If the >>> delayed work runs one jiffy later, the slave will stay down. > > Presumably the ARP reply is coming back in less than one jiffy, > then, so the slave_last_rx() value is the same jiffy as when the > _inspect was previously called? > >> >> >>> Should they perhaps all be increased by, say, delta_in_ticks/2, to make this >>> less dependent on the current scheduling latencies? >> >> We have been using a patch that tracks the arpmon requested sleep time vs >> the actual sleep time and adds any scheduling latency to the allowed >> delta. That way if we sleep too long due to scheduling latency it doesn't >> affect the calculation. > > How much scheduling latency do you see? > > Is that really better than just permitting a bit more slack in > the timing window? We hit enough latency that it triggered arpmon to falsely mark multiple links as lost. This triggered our system maintenance code to go into a "oh no we can't talk to the outside world" secenario, which does fairly intrusive things to try and bring connectivity back up. Basically a bad thing to happen just because of a random scheduler latency spike. I should note that we added this some time back and are still running older kernels so I have no idea what latency on modern kernels is like. Chris