From mboxrd@z Thu Jan 1 00:00:00 1970 From: Piotr =?utf-8?B?RGHFgmVr?= Subject: Re: ECONNREFUSED implies OSD definitely failed Date: Fri, 29 Apr 2016 09:46:39 +0200 Message-ID: <20160429074639.GG26146@predictor> References: <20160428143251.GA1541@suse.de> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from predictor.org.pl ([185.5.97.54]:48396 "EHLO predictor.org.pl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752286AbcD2Hor (ORCPT ); Fri, 29 Apr 2016 03:44:47 -0400 Content-Disposition: inline In-Reply-To: <20160428143251.GA1541@suse.de> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Lars Marowsky-Bree Cc: ceph-devel@vger.kernel.org On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote: > On 2016-04-22T12:24:52, Sage Weil wrote: >=20 > > Piotr has a PR at > >=20 > > https://github.com/ceph/ceph/pull/8558 > >=20 > > that changes the messenger and OSD logic so that if we get an ECONN= REFUSED=20 > > trying to talk to another OSD we can definitively conclude that the= OSD is=20 > > down/failed, without waiting for the normal heartbeat timeout. > >=20 > > I think this is true in normal networking environments. My only co= ncern=20 > > is that there might be cases where the OSD isn't actually down and = some=20 > > transient network issue could cause ECONNREFUSED. Like... some=20 > > firewally magic networky thing. If a transient ECONNREFUSED was po= ssible,=20 > > it could cause some ugly flapping. > >=20 > > Can anyone think of something that might cause this? Even if it is= =20 > > something obscure, it means we should have a config option to disab= le this=20 > > new behavior (we probably should anyway). >=20 > Exactly this - the system reconfiguring it's network interfaces and > firewall rules (in a suboptimal fashion; it should drop, not reject, = but > ...). I'm not convinced that we should care about this. I think that probabil= ity of (re)connect event occurrence during firewall reconfiguration is quit= e low. =20 > Or a duplicate IP address (with a node that isn't running ceph-osd). > Again, not supposed to happen. That will cause a lot of other things to fail, and having ceph-osd get downed faster gives a greater chance of getting someone's attention.=20 --=20 Piotr Da=C5=82ek branch@predictor.org.pl http://blog.predictor.org.pl -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html