OSD network failure

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD network failure
@ 2012-11-13 14:15 Gandalf Corvotempesta
  2012-11-15  8:40 ` Josh Durgin
  0 siblings, 1 reply; 6+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-13 14:15 UTC (permalink / raw)
  To: ceph-devel

Hi,
what happens in case of OSD network failure? Is ceph smart enough to
isolate OSDs not synced?
Should I use LACP in ODS network or a single 10GBe per server should be ok?

LACP will need stackable switches and much more hardware investment.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD network failure
  2012-11-13 14:15 OSD network failure Gandalf Corvotempesta
@ 2012-11-15  8:40 ` Josh Durgin
  2012-11-15  9:51   ` Gandalf Corvotempesta
  0 siblings, 1 reply; 6+ messages in thread
From: Josh Durgin @ 2012-11-15  8:40 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On 11/13/2012 06:15 AM, Gandalf Corvotempesta wrote:
> Hi,
> what happens in case of OSD network failure? Is ceph smart enough to
> isolate OSDs not synced?
> Should I use LACP in ODS network or a single 10GBe per server should be ok?
>
> LACP will need stackable switches and much more hardware investment.

OSDs send heartbeats to each other and report failure to receive
a heartbeat in a certain interval to the monitor cluster.
When the monitor cluster receives enough of these reports,
it marks the OSD 'down' in the OSD map, and after a grace period
to allow for flapping or daemon restarts, marks the osd 'out'
as well. This makes the cluster rebalance any data that was on the
failed OSD, and places no new data there.

A lot of this is configurable, but that's the basic model.

In this model, a network failure is equivalent to extreme slowness or a
crashed OSD - everything results in an updated map of the cluster
eventually, and the OSDs maintain strong consistency of the data
through the peering and recovery processes.

So basically you'd only need a single nic per storage node. Multiple
can be useful to separate frontend and backend traffic, but ceph
is designed to maintain strong consistency when failures occur.

Josh

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD network failure
  2012-11-15  8:40 ` Josh Durgin
@ 2012-11-15  9:51   ` Gandalf Corvotempesta
  2012-11-17  1:56     ` Josh Durgin
  0 siblings, 1 reply; 6+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-15  9:51 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

2012/11/15 Josh Durgin <josh.durgin@inktank.com>:
> So basically you'd only need a single nic per storage node. Multiple
> can be useful to separate frontend and backend traffic, but ceph
> is designed to maintain strong consistency when failures occur.

Probably i've not exaplained well.
I'll have multiple nics, one for frontend, one for backend used as ODS
sync network.
What happens in case of backend network failure? The frontend network
is still ok, OSD is
still reachable but is not able to sync datas.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD network failure
  2012-11-15  9:51   ` Gandalf Corvotempesta
@ 2012-11-17  1:56     ` Josh Durgin
  2012-11-19 14:22       ` Gandalf Corvotempesta
  2012-11-19 22:19       ` Gregory Farnum
  0 siblings, 2 replies; 6+ messages in thread
From: Josh Durgin @ 2012-11-17  1:56 UTC (permalink / raw)
  To: Gandalf Corvotempesta; +Cc: ceph-devel

On 11/15/2012 01:51 AM, Gandalf Corvotempesta wrote:
> 2012/11/15 Josh Durgin <josh.durgin@inktank.com>:
>> So basically you'd only need a single nic per storage node. Multiple
>> can be useful to separate frontend and backend traffic, but ceph
>> is designed to maintain strong consistency when failures occur.
>
> Probably i've not exaplained well.
> I'll have multiple nics, one for frontend, one for backend used as ODS
> sync network.
> What happens in case of backend network failure? The frontend network
> is still ok, OSD is
> still reachable but is not able to sync datas.

Ah, ok. By default, the OSDs use the backend network for heartbeats,
so if it fails, they will notice and report peers they can't reach as
failed to the monitors, and the normal failure handling takes care
of things.

If you're worried about consistency, remember that a write won't
complete until it's on disk on all replicas. If you're interested
in the gory details of maintaining consistency, check out the peering
process [1].

Josh

[1] http://ceph.com/docs/master/dev/peering/


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD network failure
  2012-11-17  1:56     ` Josh Durgin
@ 2012-11-19 14:22       ` Gandalf Corvotempesta
  2012-11-19 22:19       ` Gregory Farnum
  1 sibling, 0 replies; 6+ messages in thread
From: Gandalf Corvotempesta @ 2012-11-19 14:22 UTC (permalink / raw)
  To: Josh Durgin; +Cc: ceph-devel

2012/11/17 Josh Durgin <josh.durgin@inktank.com>:
> Ah, ok. By default, the OSDs use the backend network for heartbeats,
> so if it fails, they will notice and report peers they can't reach as
> failed to the monitors, and the normal failure handling takes care
> of things.
>
> If you're worried about consistency, remember that a write won't
> complete until it's on disk on all replicas. If you're interested
> in the gory details of maintaining consistency, check out the peering
> process [1].

Ok. now it's clear.
So it will be safe to use a single NIC for backend network.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: OSD network failure
  2012-11-17  1:56     ` Josh Durgin
  2012-11-19 14:22       ` Gandalf Corvotempesta
@ 2012-11-19 22:19       ` Gregory Farnum
  1 sibling, 0 replies; 6+ messages in thread
From: Gregory Farnum @ 2012-11-19 22:19 UTC (permalink / raw)
  To: Josh Durgin; +Cc: Gandalf Corvotempesta, ceph-devel@vger.kernel.org

On Fri, Nov 16, 2012 at 5:56 PM, Josh Durgin <josh.durgin@inktank.com> wrote:
> On 11/15/2012 01:51 AM, Gandalf Corvotempesta wrote:
>>
>> 2012/11/15 Josh Durgin <josh.durgin@inktank.com>:
>>>
>>> So basically you'd only need a single nic per storage node. Multiple
>>> can be useful to separate frontend and backend traffic, but ceph
>>> is designed to maintain strong consistency when failures occur.
>>
>>
>> Probably i've not exaplained well.
>> I'll have multiple nics, one for frontend, one for backend used as ODS
>> sync network.
>> What happens in case of backend network failure? The frontend network
>> is still ok, OSD is
>> still reachable but is not able to sync datas.
>
>
> Ah, ok. By default, the OSDs use the backend network for heartbeats,
> so if it fails, they will notice and report peers they can't reach as
> failed to the monitors, and the normal failure handling takes care
> of things.
>
> If you're worried about consistency, remember that a write won't
> complete until it's on disk on all replicas. If you're interested
> in the gory details of maintaining consistency, check out the peering
> process [1].
>
> Josh
>
> [1] http://ceph.com/docs/master/dev/peering/

Actually, right now a failed cluster and an up public network is
something the OSDs do not handle well — they will mark each other down
on the monitor and then tell the monitor "hey, I'm not dead!" and
start flapping pretty horrendously. We first ran across it a couple
weeks ago and have started to think about it, but I'm not sure a fix
for this is going to make it into the initial Bobtail release. :(
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2012-11-19 22:19 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-13 14:15 OSD network failure Gandalf Corvotempesta
2012-11-15  8:40 ` Josh Durgin
2012-11-15  9:51   ` Gandalf Corvotempesta
2012-11-17  1:56     ` Josh Durgin
2012-11-19 14:22       ` Gandalf Corvotempesta
2012-11-19 22:19       ` Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.