Re: A problem when restarting OSD

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Moreau Simard <dmsimard@iweb.com>
To: "Wang, Zhiqiang" <zhiqiang.wang@intel.com>, Sage Weil <sweil@redhat.com>
Cc: "'ceph-devel@vger.kernel.org'" <ceph-devel@vger.kernel.org>
Subject: Re: A problem when restarting OSD
Date: Fri, 22 Aug 2014 02:33:34 +0000	[thread overview]
Message-ID: <D01C1AB1.1B410%dmsimard@iweb.com> (raw)
In-Reply-To: <06E7D85B3BA36C4DB207FEDE871C5348936F36@SHSMSX101.ccr.corp.intel.com>

I¹m glad you mention this because I¹ve also been running into the same
issue and this took me a while to figure out too.

Is this new behaviour ? I don¹t remember running into this before...

Sage does mention multiple trees but I¹ve had this happen with a single
root.
It is definitely not my expectation that restarting an OSD would move
things around in the crush map.

I¹m in the process of developing a crush map, looks like this (note:
unfinished and does not make much sense as is):
http://pastebin.com/6vBUQTCk
This results in this tree:
# id	weight	type name	up/down	reweight
-1	18	root default
-2	9		host osd02
-4	2			disktype osd02_ssd
3	1				osd.3	up	1
9	1				osd.9	up	1
-5	7			disktype osd02_spinning
8	1				osd.8	up	1
17	1				osd.17	up	1
5	1				osd.5	up	1
11	1				osd.11	up	1
1	1				osd.1	up	1
13	1				osd.13	up	1
15	1				osd.15	up	1
-3	9		host osd01
-6	2			disktype osd01_ssd
2	1				osd.2	up	1
7	1				osd.7	up	1
-7	7			disktype osd01_spinning
0	1				osd.0	up	1
4	1				osd.4	up	1
12	1				osd.12	up	1
6	1				osd.6	up	1
14	1				osd.14	up	1
10	1				osd.10	up	1
16	1				osd.16	up	1

Only restarting the OSDs on both hosts modifies the crush map:
http://pastebin.com/rP8Y8qcH
With the resulting tree:
# id	weight	type name	up/down	reweight
-1	18	root default
-2	9		host osd02
-4	0			disktype osd02_ssd
-5	0			disktype osd02_spinning
13	1			osd.13	up	1
3	1			osd.3	up	1
5	1			osd.5	up	1
1	1			osd.1	up	1
11	1			osd.11	up	1
15	1			osd.15	up	1
17	1			osd.17	up	1
8	1			osd.8	up	1
9	1			osd.9	up	1
-3	9		host osd01
-6	0			disktype osd01_ssd
-7	0			disktype osd01_spinning
0	1			osd.0	up	1
10	1			osd.10	up	1
12	1			osd.12	up	1
14	1			osd.14	up	1
16	1			osd.16	up	1
2	1			osd.2	up	1
4	1			osd.4	up	1
7	1			osd.7	up	1
6	1			osd.6	up	1

Would a hook really be the solution I need ?
-- 
David Moreau Simard

Le 2014-08-21, 9:36 PM, « Wang, Zhiqiang » <zhiqiang.wang@intel.com> a
écrit :

>Hi Sage,
>
>Yes, I understand that we can customize the crush location hook to let
>the OSD go to the right location. But does the ceph user have the idea of
>this if he/she has more than 1 root in the crush map? At least I don't
>know this at the beginning. We need to either emphasize this or do it in
>some ways for the user.
>
>One question for the hot-swapping support of moving an OSD to another
>host. What if the journal is not located at the same disk of the OSD? Is
>the OSD still able to be available in the cluster?
>
>-----Original Message-----
>From: Sage Weil [mailto:sweil@redhat.com]
>Sent: Thursday, August 21, 2014 11:28 PM
>To: Wang, Zhiqiang
>Cc: 'ceph-devel@vger.kernel.org'
>Subject: Re: A problem when restarting OSD
>
>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote:
>> Hi all,
>> 
>> I ran into a problem when restarting an OSD.
>> 
>> Here is my OSD tree before restarting the OSD:
>> 
>> # id    weight  type name       up/down reweight
>> -6      8       root ssd
>> -4      4               host zqw-s1-ssd
>> 16      1                       osd.16  up      1
>> 17      1                       osd.17  up      1
>> 18      1                       osd.18  up      1
>> 19      1                       osd.19  up      1
>> -5      4               host zqw-s2-ssd
>> 20      1                       osd.20  up      1
>> 21      1                       osd.21  up      1
>> 22      1                       osd.22  up      1
>> 23      1                       osd.23  up      1
>> -1      14.56   root default
>> -2      7.28            host zqw-s1
>> 0       0.91                    osd.0   up      1
>> 1       0.91                    osd.1   up      1
>> 2       0.91                    osd.2   up      1
>> 3       0.91                    osd.3   up      1
>> 4       0.91                    osd.4   up      1
>> 5       0.91                    osd.5   up      1
>> 6       0.91                    osd.6   up      1
>> 7       0.91                    osd.7   up      1
>> -3      7.28            host zqw-s2
>> 8       0.91                    osd.8   up      1
>> 9       0.91                    osd.9   up      1
>> 10      0.91                    osd.10  up      1
>> 11      0.91                    osd.11  up      1
>> 12      0.91                    osd.12  up      1
>> 13      0.91                    osd.13  up      1
>> 14      0.91                    osd.14  up      1
>> 15      0.91                    osd.15  up      1
>> 
>> After I restart one of the OSD with id from 16 to 23, say restarting
>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
>>cluster begins to do rebalance. This surely is not what I want.
>> 
>> # id    weight  type name       up/down reweight
>> -6      7       root ssd
>> -4      3               host zqw-s1-ssd
>> 17      1                       osd.17  up      1
>> 18      1                       osd.18  up      1
>> 19      1                       osd.19  up      1
>> -5      4               host zqw-s2-ssd
>> 20      1                       osd.20  up      1
>> 21      1                       osd.21  up      1
>> 22      1                       osd.22  up      1
>> 23      1                       osd.23  up      1
>> -1      15.56   root default
>> -2      8.28            host zqw-s1
>> 0       0.91                    osd.0   up      1
>> 1       0.91                    osd.1   up      1
>> 2       0.91                    osd.2   up      1
>> 3       0.91                    osd.3   up      1
>> 4       0.91                    osd.4   up      1
>> 5       0.91                    osd.5   up      1
>> 6       0.91                    osd.6   up      1
>> 7       0.91                    osd.7   up      1
>> 16      1                       osd.16  up      1
>> -3      7.28            host zqw-s2
>> 8       0.91                    osd.8   up      1
>> 9       0.91                    osd.9   up      1
>> 10      0.91                    osd.10  up      1
>> 11      0.91                    osd.11  up      1
>> 12      0.91                    osd.12  up      1
>> 13      0.91                    osd.13  up      1
>> 14      0.91                    osd.14  up      1
>> 15      0.91                    osd.15  up      1
>> 
>> After digging into the problem, I find it's because in the ceph init
>>script, we change the OSD's crush location in some way. It uses the
>>script 'ceph-crush-location' to get the crush location from the
>>ceph.conf file for the restarting OSD. If there isn't such an entry in
>>ceph.conf, it uses the default one 'host=$(hostname -s) root=default'.
>>Since I don't have the crush location configuration in my ceph.conf (I
>>guess most of people don't have this in their ceph.conf), when I
>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
>> 
>> Here is a fix for this:
>> When the ceph init script uses 'ceph osd crush create-or-move' to
>> change the OSD's crush location, do a check first, if this OSD is
>> already existing in the crush map, return without making the location
>> change. This change is at:
>> https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539878
>> 761412fe
>> 
>> What do you think?
>
>The goal of this behavior is to allow hot-swapping of devices.  You can
>pull disks out of one host and put them in another and the udev machinery
>will start up the daemon, update the crush location, and the disk and
>data will become available.  It's not 'ideal' in the sense that there
>will be rebalancing, but it does make the data available to the cluster
>to preserve data safety.
>
>We haven't come up with a great scheme yet to managing multiple trees
>yet.  
>The idea is that the ceph-crush-location hook can be customized to do
>whatever is necessary, for example by putting root=ssd if the device type
>appears to be an ssd (maybe look at the sysfs metadata, or put a marker
>file in the osd data directory?).  You can point to your own hook for
>your environment with
>
>  osd crush location hook = /path/to/my/script
>
>sage
>
>
>
>--
>To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>the body of a message to majordomo@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2014-08-22  2:33 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-21  7:19 A problem when restarting OSD Wang, Zhiqiang
2014-08-21 15:28 ` Sage Weil
2014-08-22  1:36   ` Wang, Zhiqiang
2014-08-22  2:33     ` David Moreau Simard [this message]
2014-08-22  2:57       ` Wang, Zhiqiang
2014-08-22 14:02         ` David Moreau Simard
2014-08-22 14:06           ` Sage Weil
2014-08-22 14:14             ` David Moreau Simard

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=D01C1AB1.1B410%dmsimard@iweb.com \
    --to=dmsimard@iweb.com \
    --cc=ceph-devel@vger.kernel.org \
    --cc=sweil@redhat.com \
    --cc=zhiqiang.wang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.