From mboxrd@z Thu Jan  1 00:00:00 1970
From: David Moreau Simard <dmsimard@iweb.com>
Subject: Re: A problem when restarting OSD
Date: Fri, 22 Aug 2014 14:02:40 +0000
Message-ID: <D01CC247.1B4AD%dmsimard@iweb.com>
References: <06E7D85B3BA36C4DB207FEDE871C5348936F58@SHSMSX101.ccr.corp.intel.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1254
Content-Transfer-Encoding: QUOTED-PRINTABLE
Return-path: <ceph-devel-owner@vger.kernel.org>
Received: from bmx1.iweb.com ([70.38.127.135]:44556 "EHLO bmx1.iweb.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1756468AbaHVOCn convert rfc822-to-8bit (ORCPT
	<rfc822;ceph-devel@vger.kernel.org>); Fri, 22 Aug 2014 10:02:43 -0400
In-Reply-To: <06E7D85B3BA36C4DB207FEDE871C5348936F58@SHSMSX101.ccr.corp.intel.com>
Content-Language: fr-FR
Content-ID: <87F87352C17ED74FA216CA6DAF046404@corp.iweb.com>
Sender: ceph-devel-owner@vger.kernel.org
List-ID: <ceph-devel.vger.kernel.org>
To: "Wang, Zhiqiang" <zhiqiang.wang@intel.com>, Sage Weil <sweil@redhat.com>
Cc: "'ceph-devel@vger.kernel.org'" <ceph-devel@vger.kernel.org>

Hi Wang,

Thanks, I=92ll try that for the time being. This still raises a few
questions I=92d like to discuss.

I=92m convinced we can agree that the CRUSH map is ultimately the autho=
rity
as far as the location of the devices currently are.
My understanding is that we are relying on another source for device
location when (in this case) restarting OSDs: the ceph.conf file.

1) Does this imply that we probably shouldn=92t specify device location=
s
directly in the crush map but in our ceph.conf file instead ?
2) If what is in the crush map is different than what is configured in
ceph.conf, how does Ceph decide which is the authority ? Shouldn=92t it=
 be
the crush map ? In this case, it appears to be the ceph.conf file.

Just trying to wrap my head around the vision of how things should be
managed.
--=20
David Moreau Simard


Le 2014-08-21, 10:57 PM, =AB Wang, Zhiqiang =BB <zhiqiang.wang@intel.co=
m> a
=E9crit :

>Hi David,
>
>Yes, I think adding a hook in your ceph.conf can solve your problem. A=
t
>least this is what I did, and it solves the problem.
>
>For example:
>
>[osd.3]
>osd crush location =3D "host=3Dosd02 root=3Ddefault disktype=3Dosd02_s=
sd"
>
>You need to add this for every osd.
>
>-----Original Message-----
>From: David Moreau Simard [mailto:dmsimard@iweb.com]
>Sent: Friday, August 22, 2014 10:34 AM
>To: Wang, Zhiqiang; Sage Weil
>Cc: 'ceph-devel@vger.kernel.org'
>Subject: Re: A problem when restarting OSD
>
>I=B9m glad you mention this because I=B9ve also been running into the =
same
>issue and this took me a while to figure out too.
>
>Is this new behaviour ? I don=B9t remember running into this before...
>
>Sage does mention multiple trees but I=B9ve had this happen with a sin=
gle
>root.
>It is definitely not my expectation that restarting an OSD would move
>things around in the crush map.
>
>I=B9m in the process of developing a crush map, looks like this (note:
>unfinished and does not make much sense as is):
>http://pastebin.com/6vBUQTCk
>This results in this tree:
># id	weight	type name	up/down	reweight
>-1	18	root default
>-2	9		host osd02
>-4	2			disktype osd02_ssd
>3	1				osd.3	up	1
>9	1				osd.9	up	1
>-5	7			disktype osd02_spinning
>8	1				osd.8	up	1
>17	1				osd.17	up	1
>5	1				osd.5	up	1
>11	1				osd.11	up	1
>1	1				osd.1	up	1
>13	1				osd.13	up	1
>15	1				osd.15	up	1
>-3	9		host osd01
>-6	2			disktype osd01_ssd
>2	1				osd.2	up	1
>7	1				osd.7	up	1
>-7	7			disktype osd01_spinning
>0	1				osd.0	up	1
>4	1				osd.4	up	1
>12	1				osd.12	up	1
>6	1				osd.6	up	1
>14	1				osd.14	up	1
>10	1				osd.10	up	1
>16	1				osd.16	up	1
>
>Only restarting the OSDs on both hosts modifies the crush map:
>http://pastebin.com/rP8Y8qcH
>With the resulting tree:
># id	weight	type name	up/down	reweight
>-1	18	root default
>-2	9		host osd02
>-4	0			disktype osd02_ssd
>-5	0			disktype osd02_spinning
>13	1			osd.13	up	1
>3	1			osd.3	up	1
>5	1			osd.5	up	1
>1	1			osd.1	up	1
>11	1			osd.11	up	1
>15	1			osd.15	up	1
>17	1			osd.17	up	1
>8	1			osd.8	up	1
>9	1			osd.9	up	1
>-3	9		host osd01
>-6	0			disktype osd01_ssd
>-7	0			disktype osd01_spinning
>0	1			osd.0	up	1
>10	1			osd.10	up	1
>12	1			osd.12	up	1
>14	1			osd.14	up	1
>16	1			osd.16	up	1
>2	1			osd.2	up	1
>4	1			osd.4	up	1
>7	1			osd.7	up	1
>6	1			osd.6	up	1
>
>Would a hook really be the solution I need ?
>--
>David Moreau Simard
>
>Le 2014-08-21, 9:36 PM, =AB Wang, Zhiqiang =BB <zhiqiang.wang@intel.co=
m> a
>=E9crit :
>
>>Hi Sage,
>>
>>Yes, I understand that we can customize the crush location hook to le=
t
>>the OSD go to the right location. But does the ceph user have the ide=
a
>>of this if he/she has more than 1 root in the crush map? At least I
>>don't know this at the beginning. We need to either emphasize this or
>>do it in some ways for the user.
>>
>>One question for the hot-swapping support of moving an OSD to another
>>host. What if the journal is not located at the same disk of the OSD?
>>Is the OSD still able to be available in the cluster?
>>
>>-----Original Message-----
>>From: Sage Weil [mailto:sweil@redhat.com]
>>Sent: Thursday, August 21, 2014 11:28 PM
>>To: Wang, Zhiqiang
>>Cc: 'ceph-devel@vger.kernel.org'
>>Subject: Re: A problem when restarting OSD
>>
>>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote:
>>> Hi all,
>>>=20
>>> I ran into a problem when restarting an OSD.
>>>=20
>>> Here is my OSD tree before restarting the OSD:
>>>=20
>>> # id    weight  type name       up/down reweight
>>> -6      8       root ssd
>>> -4      4               host zqw-s1-ssd
>>> 16      1                       osd.16  up      1
>>> 17      1                       osd.17  up      1
>>> 18      1                       osd.18  up      1
>>> 19      1                       osd.19  up      1
>>> -5      4               host zqw-s2-ssd
>>> 20      1                       osd.20  up      1
>>> 21      1                       osd.21  up      1
>>> 22      1                       osd.22  up      1
>>> 23      1                       osd.23  up      1
>>> -1      14.56   root default
>>> -2      7.28            host zqw-s1
>>> 0       0.91                    osd.0   up      1
>>> 1       0.91                    osd.1   up      1
>>> 2       0.91                    osd.2   up      1
>>> 3       0.91                    osd.3   up      1
>>> 4       0.91                    osd.4   up      1
>>> 5       0.91                    osd.5   up      1
>>> 6       0.91                    osd.6   up      1
>>> 7       0.91                    osd.7   up      1
>>> -3      7.28            host zqw-s2
>>> 8       0.91                    osd.8   up      1
>>> 9       0.91                    osd.9   up      1
>>> 10      0.91                    osd.10  up      1
>>> 11      0.91                    osd.11  up      1
>>> 12      0.91                    osd.12  up      1
>>> 13      0.91                    osd.13  up      1
>>> 14      0.91                    osd.14  up      1
>>> 15      0.91                    osd.15  up      1
>>>=20
>>> After I restart one of the OSD with id from 16 to 23, say restartin=
g
>>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph
>>>cluster begins to do rebalance. This surely is not what I want.
>>>=20
>>> # id    weight  type name       up/down reweight
>>> -6      7       root ssd
>>> -4      3               host zqw-s1-ssd
>>> 17      1                       osd.17  up      1
>>> 18      1                       osd.18  up      1
>>> 19      1                       osd.19  up      1
>>> -5      4               host zqw-s2-ssd
>>> 20      1                       osd.20  up      1
>>> 21      1                       osd.21  up      1
>>> 22      1                       osd.22  up      1
>>> 23      1                       osd.23  up      1
>>> -1      15.56   root default
>>> -2      8.28            host zqw-s1
>>> 0       0.91                    osd.0   up      1
>>> 1       0.91                    osd.1   up      1
>>> 2       0.91                    osd.2   up      1
>>> 3       0.91                    osd.3   up      1
>>> 4       0.91                    osd.4   up      1
>>> 5       0.91                    osd.5   up      1
>>> 6       0.91                    osd.6   up      1
>>> 7       0.91                    osd.7   up      1
>>> 16      1                       osd.16  up      1
>>> -3      7.28            host zqw-s2
>>> 8       0.91                    osd.8   up      1
>>> 9       0.91                    osd.9   up      1
>>> 10      0.91                    osd.10  up      1
>>> 11      0.91                    osd.11  up      1
>>> 12      0.91                    osd.12  up      1
>>> 13      0.91                    osd.13  up      1
>>> 14      0.91                    osd.14  up      1
>>> 15      0.91                    osd.15  up      1
>>>=20
>>> After digging into the problem, I find it's because in the ceph ini=
t
>>>script, we change the OSD's crush location in some way. It uses the
>>>script 'ceph-crush-location' to get the crush location from the
>>>ceph.conf file for the restarting OSD. If there isn't such an entry =
in
>>>ceph.conf, it uses the default one 'host=3D$(hostname -s) root=3Ddef=
ault'.
>>>Since I don't have the crush location configuration in my ceph.conf =
(I
>>>guess most of people don't have this in their ceph.conf), when I
>>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'.
>>>=20
>>> Here is a fix for this:
>>> When the ceph init script uses 'ceph osd crush create-or-move' to
>>> change the OSD's crush location, do a check first, if this OSD is
>>> already existing in the crush map, return without making the locati=
on
>>> change. This change is at:
>>> https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd1539=
87
>>> 8
>>> 761412fe
>>>=20
>>> What do you think?
>>
>>The goal of this behavior is to allow hot-swapping of devices.  You c=
an
>>pull disks out of one host and put them in another and the udev
>>machinery will start up the daemon, update the crush location, and th=
e
>>disk and data will become available.  It's not 'ideal' in the sense
>>that there will be rebalancing, but it does make the data available t=
o
>>the cluster to preserve data safety.
>>
>>We haven't come up with a great scheme yet to managing multiple trees
>>yet.
>>The idea is that the ceph-crush-location hook can be customized to do
>>whatever is necessary, for example by putting root=3Dssd if the devic=
e
>>type appears to be an ssd (maybe look at the sysfs metadata, or put a
>>marker file in the osd data directory?).  You can point to your own
>>hook for your environment with
>>
>>  osd crush location hook =3D /path/to/my/script
>>
>>sage
>>
>>
>>
>>--
>>To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>in the body of a message to majordomo@vger.kernel.org More majordomo
>>info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" i=
n
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html