* ECONNREFUSED implies OSD definitely failed
@ 2016-04-22 16:24 Sage Weil
2016-04-28 14:32 ` Lars Marowsky-Bree
0 siblings, 1 reply; 6+ messages in thread
From: Sage Weil @ 2016-04-22 16:24 UTC (permalink / raw)
To: ceph-devel
Piotr has a PR at
https://github.com/ceph/ceph/pull/8558
that changes the messenger and OSD logic so that if we get an ECONNREFUSED
trying to talk to another OSD we can definitively conclude that the OSD is
down/failed, without waiting for the normal heartbeat timeout.
I think this is true in normal networking environments. My only concern
is that there might be cases where the OSD isn't actually down and some
transient network issue could cause ECONNREFUSED. Like... some
firewally magic networky thing. If a transient ECONNREFUSED was possible,
it could cause some ugly flapping.
Can anyone think of something that might cause this? Even if it is
something obscure, it means we should have a config option to disable this
new behavior (we probably should anyway).
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ECONNREFUSED implies OSD definitely failed
2016-04-22 16:24 ECONNREFUSED implies OSD definitely failed Sage Weil
@ 2016-04-28 14:32 ` Lars Marowsky-Bree
2016-04-29 7:46 ` Piotr Dałek
0 siblings, 1 reply; 6+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-28 14:32 UTC (permalink / raw)
To: ceph-devel
On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:
> Piotr has a PR at
>
> https://github.com/ceph/ceph/pull/8558
>
> that changes the messenger and OSD logic so that if we get an ECONNREFUSED
> trying to talk to another OSD we can definitively conclude that the OSD is
> down/failed, without waiting for the normal heartbeat timeout.
>
> I think this is true in normal networking environments. My only concern
> is that there might be cases where the OSD isn't actually down and some
> transient network issue could cause ECONNREFUSED. Like... some
> firewally magic networky thing. If a transient ECONNREFUSED was possible,
> it could cause some ugly flapping.
>
> Can anyone think of something that might cause this? Even if it is
> something obscure, it means we should have a config option to disable this
> new behavior (we probably should anyway).
Exactly this - the system reconfiguring it's network interfaces and
firewall rules (in a suboptimal fashion; it should drop, not reject, but
...).
Or a duplicate IP address (with a node that isn't running ceph-osd).
Again, not supposed to happen.
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ECONNREFUSED implies OSD definitely failed
2016-04-28 14:32 ` Lars Marowsky-Bree
@ 2016-04-29 7:46 ` Piotr Dałek
2016-04-29 12:29 ` Sage Weil
0 siblings, 1 reply; 6+ messages in thread
From: Piotr Dałek @ 2016-04-29 7:46 UTC (permalink / raw)
To: Lars Marowsky-Bree; +Cc: ceph-devel
On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:
>
> > Piotr has a PR at
> >
> > https://github.com/ceph/ceph/pull/8558
> >
> > that changes the messenger and OSD logic so that if we get an ECONNREFUSED
> > trying to talk to another OSD we can definitively conclude that the OSD is
> > down/failed, without waiting for the normal heartbeat timeout.
> >
> > I think this is true in normal networking environments. My only concern
> > is that there might be cases where the OSD isn't actually down and some
> > transient network issue could cause ECONNREFUSED. Like... some
> > firewally magic networky thing. If a transient ECONNREFUSED was possible,
> > it could cause some ugly flapping.
> >
> > Can anyone think of something that might cause this? Even if it is
> > something obscure, it means we should have a config option to disable this
> > new behavior (we probably should anyway).
>
> Exactly this - the system reconfiguring it's network interfaces and
> firewall rules (in a suboptimal fashion; it should drop, not reject, but
> ...).
I'm not convinced that we should care about this. I think that probability
of (re)connect event occurrence during firewall reconfiguration is quite
low.
> Or a duplicate IP address (with a node that isn't running ceph-osd).
> Again, not supposed to happen.
That will cause a lot of other things to fail, and having ceph-osd get
downed faster gives a greater chance of getting someone's attention.
--
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ECONNREFUSED implies OSD definitely failed
2016-04-29 7:46 ` Piotr Dałek
@ 2016-04-29 12:29 ` Sage Weil
2016-04-29 12:32 ` Lars Marowsky-Bree
2016-04-29 19:02 ` Piotr Dałek
0 siblings, 2 replies; 6+ messages in thread
From: Sage Weil @ 2016-04-29 12:29 UTC (permalink / raw)
To: Piotr Dałek; +Cc: Lars Marowsky-Bree, ceph-devel
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1593 bytes --]
On Fri, 29 Apr 2016, Piotr Dałek wrote:
> On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> > On 2016-04-22T12:24:52, Sage Weil <sweil@redhat.com> wrote:
> >
> > > Piotr has a PR at
> > >
> > > https://github.com/ceph/ceph/pull/8558
> > >
> > > that changes the messenger and OSD logic so that if we get an ECONNREFUSED
> > > trying to talk to another OSD we can definitively conclude that the OSD is
> > > down/failed, without waiting for the normal heartbeat timeout.
> > >
> > > I think this is true in normal networking environments. My only concern
> > > is that there might be cases where the OSD isn't actually down and some
> > > transient network issue could cause ECONNREFUSED. Like... some
> > > firewally magic networky thing. If a transient ECONNREFUSED was possible,
> > > it could cause some ugly flapping.
> > >
> > > Can anyone think of something that might cause this? Even if it is
> > > something obscure, it means we should have a config option to disable this
> > > new behavior (we probably should anyway).
> >
> > Exactly this - the system reconfiguring it's network interfaces and
> > firewall rules (in a suboptimal fashion; it should drop, not reject, but
> > ...).
>
> I'm not convinced that we should care about this. I think that probability
> of (re)connect event occurrence during firewall reconfiguration is quite
> low.
Yeah, I tend to agree.
Let's just add a config option to control the new behavior so that if, for
some reason, there is an environment where this does happen the fast-fail
can be disabled.
sage
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ECONNREFUSED implies OSD definitely failed
2016-04-29 12:29 ` Sage Weil
@ 2016-04-29 12:32 ` Lars Marowsky-Bree
2016-04-29 19:02 ` Piotr Dałek
1 sibling, 0 replies; 6+ messages in thread
From: Lars Marowsky-Bree @ 2016-04-29 12:32 UTC (permalink / raw)
To: ceph-devel
On 2016-04-29T08:29:59, Sage Weil <sage@newdream.net> wrote:
> > I'm not convinced that we should care about this. I think that probability
> > of (re)connect event occurrence during firewall reconfiguration is quite
> > low.
> Yeah, I tend to agree.
>
> Let's just add a config option to control the new behavior so that if, for
> some reason, there is an environment where this does happen the fast-fail
> can be disabled.
Also agreed. Just wanted to add the cases where I've seen these happen
- and indeed, they are pretty obscure cases that shouldn't happen.
--
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
"Experience is the name everyone gives to their mistakes." -- Oscar Wilde
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: ECONNREFUSED implies OSD definitely failed
2016-04-29 12:29 ` Sage Weil
2016-04-29 12:32 ` Lars Marowsky-Bree
@ 2016-04-29 19:02 ` Piotr Dałek
1 sibling, 0 replies; 6+ messages in thread
From: Piotr Dałek @ 2016-04-29 19:02 UTC (permalink / raw)
To: Sage Weil; +Cc: Lars Marowsky-Bree, ceph-devel
On Fri, Apr 29, 2016 at 08:29:59AM -0400, Sage Weil wrote:
> On Fri, 29 Apr 2016, Piotr Dałek wrote:
> > On Thu, Apr 28, 2016 at 04:32:51PM +0200, Lars Marowsky-Bree wrote:
> > > Exactly this - the system reconfiguring it's network interfaces and
> > > firewall rules (in a suboptimal fashion; it should drop, not reject, but
> > > ...).
> >
> > I'm not convinced that we should care about this. I think that probability
> > of (re)connect event occurrence during firewall reconfiguration is quite
> > low.
>
> Yeah, I tend to agree.
>
> Let's just add a config option to control the new behavior so that if, for
> some reason, there is an environment where this does happen the fast-fail
> can be disabled.
Sure, that's not a problem.
--
Piotr Dałek
branch@predictor.org.pl
http://blog.predictor.org.pl
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-04-29 19:01 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-22 16:24 ECONNREFUSED implies OSD definitely failed Sage Weil
2016-04-28 14:32 ` Lars Marowsky-Bree
2016-04-29 7:46 ` Piotr Dałek
2016-04-29 12:29 ` Sage Weil
2016-04-29 12:32 ` Lars Marowsky-Bree
2016-04-29 19:02 ` Piotr Dałek
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.