All of lore.kernel.org
 help / color / mirror / Atom feed
* ECONNREFUSED implies OSD definitely failed
@ 2016-04-22 16:24 Sage Weil
  2016-04-28 14:32 ` Lars Marowsky-Bree
  0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2016-04-22 16:24 UTC (permalink / raw)
  To: ceph-devel

Piotr has a PR at

	https://github.com/ceph/ceph/pull/8558

that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
trying to talk to another OSD we can definitively conclude that the OSD is 
down/failed, without waiting for the normal heartbeat timeout.

I think this is true in normal networking environments.  My only concern 
is that there might be cases where the OSD isn't actually down and some 
transient network issue could cause ECONNREFUSED.  Like... some 
firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
it could cause some ugly flapping.

Can anyone think of something that might cause this?  Even if it is 
something obscure, it means we should have a config option to disable this 
new behavior (we probably should anyway).

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECONNREFUSED implies OSD definitely failed
  2016-04-22 16:24 ECONNREFUSED implies OSD definitely failed Sage Weil
@ 2016-04-28 14:32 ` Lars Marowsky-Bree
  2016-04-29  7:46   ` Piotr Dałek
  0 siblings, 1 reply; 6+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-28 14:32 UTC (permalink / raw)
  To: ceph-devel

On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:

> Piotr has a PR at
> 
> 	https://github.com/ceph/ceph/pull/8558
> 
> that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> trying to talk to another OSD we can definitively conclude that the OSD is 
> down/failed, without waiting for the normal heartbeat timeout.
> 
> I think this is true in normal networking environments.  My only concern 
> is that there might be cases where the OSD isn't actually down and some 
> transient network issue could cause ECONNREFUSED.  Like... some 
> firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> it could cause some ugly flapping.
> 
> Can anyone think of something that might cause this?  Even if it is 
> something obscure, it means we should have a config option to disable this 
> new behavior (we probably should anyway).

Exactly this - the system reconfiguring it's network interfaces and
firewall rules (in a suboptimal fashion; it should drop, not reject, but
...).

Or a duplicate IP address (with a node that isn't running ceph-osd).
Again, not supposed to happen.



-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECONNREFUSED implies OSD definitely failed
  2016-04-28 14:32 ` Lars Marowsky-Bree
@ 2016-04-29  7:46   ` Piotr Dałek
  2016-04-29 12:29     ` Sage Weil
  0 siblings, 1 reply; 6+ messages in thread
From: Piotr Dałek @ 2016-04-29  7:46 UTC (permalink / raw)
  To: Lars Marowsky-Bree; +Cc: ceph-devel

On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:
> 
> > Piotr has a PR at
> > 
> > 	https://github.com/ceph/ceph/pull/8558
> > 
> > that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> > trying to talk to another OSD we can definitively conclude that the OSD is 
> > down/failed, without waiting for the normal heartbeat timeout.
> > 
> > I think this is true in normal networking environments.  My only concern 
> > is that there might be cases where the OSD isn't actually down and some 
> > transient network issue could cause ECONNREFUSED.  Like... some 
> > firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> > it could cause some ugly flapping.
> > 
> > Can anyone think of something that might cause this?  Even if it is 
> > something obscure, it means we should have a config option to disable this 
> > new behavior (we probably should anyway).
> 
> Exactly this - the system reconfiguring it's network interfaces and
> firewall rules (in a suboptimal fashion; it should drop, not reject, but
> ...).

I'm not convinced that we should care about this. I think that probability
of (re)connect event occurrence during firewall reconfiguration is quite
low.
 
> Or a duplicate IP address (with a node that isn't running ceph-osd).
> Again, not supposed to happen.

That will cause a lot of other things to fail, and having ceph-osd get
downed faster gives a greater chance of getting someone's attention. 

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECONNREFUSED implies OSD definitely failed
  2016-04-29  7:46   ` Piotr Dałek
@ 2016-04-29 12:29     ` Sage Weil
  2016-04-29 12:32       ` Lars Marowsky-Bree
  2016-04-29 19:02       ` Piotr Dałek
  0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2016-04-29 12:29 UTC (permalink / raw)
  To: Piotr Dałek; +Cc: Lars Marowsky-Bree, ceph-devel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1593 bytes --]

On Fri, 29 Apr 2016, Piotr Dałek wrote:
> On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> > On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:
> > 
> > > Piotr has a PR at
> > > 
> > > 	https://github.com/ceph/ceph/pull/8558
> > > 
> > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED 
> > > trying to talk to another OSD we can definitively conclude that the OSD is 
> > > down/failed, without waiting for the normal heartbeat timeout.
> > > 
> > > I think this is true in normal networking environments.  My only concern 
> > > is that there might be cases where the OSD isn't actually down and some 
> > > transient network issue could cause ECONNREFUSED.  Like... some 
> > > firewally magic networky thing.  If a transient ECONNREFUSED was possible, 
> > > it could cause some ugly flapping.
> > > 
> > > Can anyone think of something that might cause this?  Even if it is 
> > > something obscure, it means we should have a config option to disable this 
> > > new behavior (we probably should anyway).
> > 
> > Exactly this - the system reconfiguring it's network interfaces and
> > firewall rules (in a suboptimal fashion; it should drop, not reject, but
> > ...).
> 
> I'm not convinced that we should care about this. I think that probability
> of (re)connect event occurrence during firewall reconfiguration is quite
> low.

Yeah, I tend to agree.

Let's just add a config option to control the new behavior so that if, for 
some reason, there is an environment where this does happen the fast-fail 
can be disabled.

sage

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECONNREFUSED implies OSD definitely failed
  2016-04-29 12:29     ` Sage Weil
@ 2016-04-29 12:32       ` Lars Marowsky-Bree
  2016-04-29 19:02       ` Piotr Dałek
  1 sibling, 0 replies; 6+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-29 12:32 UTC (permalink / raw)
  To: ceph-devel

On 2016-04-29T08:29:59, Sage Weil <sage@newdream.net> wrote:

> > I'm not convinced that we should care about this. I think that probability
> > of (re)connect event occurrence during firewall reconfiguration is quite
> > low.
> Yeah, I tend to agree.
> 
> Let's just add a config option to control the new behavior so that if, for 
> some reason, there is an environment where this does happen the fast-fail 
> can be disabled.

Also agreed. Just wanted to add the cases where I've seen these happen
- and indeed, they are pretty obscure cases that shouldn't happen.


-- 
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: ECONNREFUSED implies OSD definitely failed
  2016-04-29 12:29     ` Sage Weil
  2016-04-29 12:32       ` Lars Marowsky-Bree
@ 2016-04-29 19:02       ` Piotr Dałek
  1 sibling, 0 replies; 6+ messages in thread
From: Piotr Dałek @ 2016-04-29 19:02 UTC (permalink / raw)
  To: Sage Weil; +Cc: Lars Marowsky-Bree, ceph-devel

On Fri, Apr 29, 2016 at 08:29:59AM -0400, Sage Weil wrote:
> On Fri, 29 Apr 2016, Piotr Dałek wrote:
> > On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> > > Exactly this - the system reconfiguring it's network interfaces and
> > > firewall rules (in a suboptimal fashion; it should drop, not reject, but
> > > ...).
> > 
> > I'm not convinced that we should care about this. I think that probability
> > of (re)connect event occurrence during firewall reconfiguration is quite
> > low.
> 
> Yeah, I tend to agree.
> 
> Let's just add a config option to control the new behavior so that if, for 
> some reason, there is an environment where this does happen the fast-fail 
> can be disabled.

Sure, that's not a problem.

-- 
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-04-29 19:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-22 16:24 ECONNREFUSED implies OSD definitely failed Sage Weil
2016-04-28 14:32 ` Lars Marowsky-Bree
2016-04-29  7:46   ` Piotr Dałek
2016-04-29 12:29     ` Sage Weil
2016-04-29 12:32       ` Lars Marowsky-Bree
2016-04-29 19:02       ` Piotr Dałek

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.