Determining when it's safe to reboot a node?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Determining when it's safe to reboot a node?
@ 2012-09-25 17:12 Nick Bartos
  2012-09-25 18:37 ` Wido den Hollander
  2012-09-25 19:19 ` Sage Weil
  0 siblings, 2 replies; 3+ messages in thread
From: Nick Bartos @ 2012-09-25 17:12 UTC (permalink / raw)
  To: ceph-devel

I need to figure out some way of determining when it's OK to safely
reboot a single node.  I believe this involves making sure that at
least one other monitor is running and up to date, and all the PGs on
the local OSDs have up to date copies somewhere else in the cluster.
We're not concerned about MDS at this time, since we're not currently
using the POSIX filesystem.

I recall having a verbal conversation with Sage on this topic, but
apparently I didn't take good notes or I can't find them.  I do
remember the solution was somewhat complicated.  Is there any sort of
straight forward 'ceph' command that can do this now?  If there isn't
one, I think it would be really great if something like that could be
implemented.  It would seem to be a common enough use case to have a
simple command which could tell the admin if rebooting the node would
render the cluster partially unusable.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Determining when it's safe to reboot a node?
  2012-09-25 17:12 Determining when it's safe to reboot a node? Nick Bartos
@ 2012-09-25 18:37 ` Wido den Hollander
  2012-09-25 19:19 ` Sage Weil
  1 sibling, 0 replies; 3+ messages in thread
From: Wido den Hollander @ 2012-09-25 18:37 UTC (permalink / raw)
  To: Nick Bartos; +Cc: ceph-devel



On 09/25/2012 07:12 PM, Nick Bartos wrote:
> I need to figure out some way of determining when it's OK to safely
> reboot a single node.  I believe this involves making sure that at
> least one other monitor is running and up to date, and all the PGs on
> the local OSDs have up to date copies somewhere else in the cluster.
> We're not concerned about MDS at this time, since we're not currently
> using the POSIX filesystem.
>
> I recall having a verbal conversation with Sage on this topic, but
> apparently I didn't take good notes or I can't find them.  I do
> remember the solution was somewhat complicated.  Is there any sort of
> straight forward 'ceph' command that can do this now?  If there isn't
> one, I think it would be really great if something like that could be
> implemented.  It would seem to be a common enough use case to have a
> simple command which could tell the admin if rebooting the node would
> render the cluster partially unusable.
> --

Before rebooting that node you can mark the OSDs on that node as out.

For example, you are planning a reboot for a node with OSD 12 - 15:

$ ceph osd out 12
$ ceph osd out 13
$ ceph osd out 14
$ ceph osd out 15

Depending on what you've set "mon osd auto mark in" set to you will have 
to mark that OSDs "in" again when the node is back.

The question is if you want to mark them out, since that will cause data 
to be replicated again to meet your replication settings.

If you known the downtime will be short due to just a kernel update, you 
might want to consider just marking them down:

$ ceph osd down 12
$ ceph osd down 13
$ ceph osd down 14
$ ceph osd down 15

That won't cause a data replication as long as the node is back before 
"mon osd down out interval" which is 300 seconds by default.

Wido

> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: Determining when it's safe to reboot a node?
  2012-09-25 17:12 Determining when it's safe to reboot a node? Nick Bartos
  2012-09-25 18:37 ` Wido den Hollander
@ 2012-09-25 19:19 ` Sage Weil
  1 sibling, 0 replies; 3+ messages in thread
From: Sage Weil @ 2012-09-25 19:19 UTC (permalink / raw)
  To: Nick Bartos; +Cc: ceph-devel

Hi Nick,

On Tue, 25 Sep 2012, Nick Bartos wrote:
> I need to figure out some way of determining when it's OK to safely
> reboot a single node.  I believe this involves making sure that at
> least one other monitor is running and up to date, and all the PGs on
> the local OSDs have up to date copies somewhere else in the cluster.
> We're not concerned about MDS at this time, since we're not currently
> using the POSIX filesystem.
> 
> I recall having a verbal conversation with Sage on this topic, but
> apparently I didn't take good notes or I can't find them.  I do
> remember the solution was somewhat complicated.  Is there any sort of
> straight forward 'ceph' command that can do this now?  If there isn't
> one, I think it would be really great if something like that could be
> implemented.  It would seem to be a common enough use case to have a
> simple command which could tell the admin if rebooting the node would
> render the cluster partially unusable.

Making a conservative determination should be pretty straightforward.  
Something like:

 - make sure losing any local mon won't break quorum
 - make sure all PGs touching local osd(s) are active+clean and have other 
   osds in the acting set

should do the trick, as a first pass at least.  This can all be done by 
analyzing the 'ceph pg dump --format=json', 'ceph osd dump --format=json', 
and 'ceph quorum_status'.  The annoying part is just mapping ip addresses 
to osds and mons to figure out which ones are local...

sage

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2012-09-25 19:19 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-25 17:12 Determining when it's safe to reboot a node? Nick Bartos
2012-09-25 18:37 ` Wido den Hollander
2012-09-25 19:19 ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.