From mboxrd@z Thu Jan 1 00:00:00 1970 From: David Moreau Simard Subject: Re: A problem when restarting OSD Date: Fri, 22 Aug 2014 14:14:11 +0000 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=Windows-1252 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from bmx1.iweb.com ([70.38.127.135]:55267 "EHLO bmx1.iweb.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932088AbaHVOON convert rfc822-to-8bit (ORCPT ); Fri, 22 Aug 2014 10:14:13 -0400 In-Reply-To: Content-Language: fr-FR Content-ID: <2509A53922EBA74A8E40E915FEBFD855@corp.iweb.com> Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Sage Weil Cc: "Wang, Zhiqiang" , "'ceph-devel@vger.kernel.org'" Ah, that does clear things up ! I didn=B9t even know that there was a toggle for =8Cosd crush update on= start=B9 - my bad. I searched through the documentation and couldn=B9t find something on t= hat topic. Perhaps we should add a bit about that here: http://ceph.com/docs/master/rados/operations/crush-map/#crush-location I=B9ll open a pull request. --=20 David Moreau Simard Le 2014-08-22, 10:06 AM, =AB Sage Weil =BB a =E9crit= : >On Fri, 22 Aug 2014, David Moreau Simard wrote: >> Hi Wang, >>=20 >> Thanks, I?ll try that for the time being. This still raises a few >> questions I?d like to discuss. >>=20 >> I?m convinced we can agree that the CRUSH map is ultimately the >>authority >> as far as the location of the devices currently are. >> My understanding is that we are relying on another source for device >> location when (in this case) restarting OSDs: the ceph.conf file. >>=20 >> 1) Does this imply that we probably shouldn?t specify device locatio= ns >> directly in the crush map but in our ceph.conf file instead ? >> 2) If what is in the crush map is different than what is configured = in >> ceph.conf, how does Ceph decide which is the authority ? Shouldn?t i= t be >> the crush map ? In this case, it appears to be the ceph.conf file. >>=20 >> Just trying to wrap my head around the vision of how things should b= e >> managed. > >Generally speaking, you have three options: > > - 'osd crush update on start =3D false' and do it all manually, like >you're=20 > used to. > - set 'crush location =3D a=3Db c=3Dd e=3Df' in ceph.conf. The expec= tation is > that chef or puppet or whatever will fill this in with "host=3Dfoo > rack=3Dbar dc=3Dasdf". > - customize ceph-crush-location to do something trickier (like multip= le > trees) > >sage > > >> --=20 >> David Moreau Simard >>=20 >>=20 >> Le 2014-08-21, 10:57 PM, ? Wang, Zhiqiang ? a >> ?crit : >>=20 >> >Hi David, >> > >> >Yes, I think adding a hook in your ceph.conf can solve your problem= =2E At >> >least this is what I did, and it solves the problem. >> > >> >For example: >> > >> >[osd.3] >> >osd crush location =3D "host=3Dosd02 root=3Ddefault disktype=3Dosd0= 2_ssd" >> > >> >You need to add this for every osd. >> > >> >-----Original Message----- >> >From: David Moreau Simard [mailto:dmsimard@iweb.com] >> >Sent: Friday, August 22, 2014 10:34 AM >> >To: Wang, Zhiqiang; Sage Weil >> >Cc: 'ceph-devel@vger.kernel.org' >> >Subject: Re: A problem when restarting OSD >> > >> >I?m glad you mention this because I?ve also been running into the s= ame >> >issue and this took me a while to figure out too. >> > >> >Is this new behaviour ? I don?t remember running into this before..= =2E >> > >> >Sage does mention multiple trees but I?ve had this happen with a si= ngle >> >root. >> >It is definitely not my expectation that restarting an OSD would mo= ve >> >things around in the crush map. >> > >> >I?m in the process of developing a crush map, looks like this (note= : >> >unfinished and does not make much sense as is): >> >http://pastebin.com/6vBUQTCk >> >This results in this tree: >> ># id weight type name up/down reweight >> >-1 18 root default >> >-2 9 host osd02 >> >-4 2 disktype osd02_ssd >> >3 1 osd.3 up 1 >> >9 1 osd.9 up 1 >> >-5 7 disktype osd02_spinning >> >8 1 osd.8 up 1 >> >17 1 osd.17 up 1 >> >5 1 osd.5 up 1 >> >11 1 osd.11 up 1 >> >1 1 osd.1 up 1 >> >13 1 osd.13 up 1 >> >15 1 osd.15 up 1 >> >-3 9 host osd01 >> >-6 2 disktype osd01_ssd >> >2 1 osd.2 up 1 >> >7 1 osd.7 up 1 >> >-7 7 disktype osd01_spinning >> >0 1 osd.0 up 1 >> >4 1 osd.4 up 1 >> >12 1 osd.12 up 1 >> >6 1 osd.6 up 1 >> >14 1 osd.14 up 1 >> >10 1 osd.10 up 1 >> >16 1 osd.16 up 1 >> > >> >Only restarting the OSDs on both hosts modifies the crush map: >> >http://pastebin.com/rP8Y8qcH >> >With the resulting tree: >> ># id weight type name up/down reweight >> >-1 18 root default >> >-2 9 host osd02 >> >-4 0 disktype osd02_ssd >> >-5 0 disktype osd02_spinning >> >13 1 osd.13 up 1 >> >3 1 osd.3 up 1 >> >5 1 osd.5 up 1 >> >1 1 osd.1 up 1 >> >11 1 osd.11 up 1 >> >15 1 osd.15 up 1 >> >17 1 osd.17 up 1 >> >8 1 osd.8 up 1 >> >9 1 osd.9 up 1 >> >-3 9 host osd01 >> >-6 0 disktype osd01_ssd >> >-7 0 disktype osd01_spinning >> >0 1 osd.0 up 1 >> >10 1 osd.10 up 1 >> >12 1 osd.12 up 1 >> >14 1 osd.14 up 1 >> >16 1 osd.16 up 1 >> >2 1 osd.2 up 1 >> >4 1 osd.4 up 1 >> >7 1 osd.7 up 1 >> >6 1 osd.6 up 1 >> > >> >Would a hook really be the solution I need ? >> >-- >> >David Moreau Simard >> > >> >Le 2014-08-21, 9:36 PM, ? Wang, Zhiqiang ? a >> >?crit : >> > >> >>Hi Sage, >> >> >> >>Yes, I understand that we can customize the crush location hook to= let >> >>the OSD go to the right location. But does the ceph user have the = idea >> >>of this if he/she has more than 1 root in the crush map? At least = I >> >>don't know this at the beginning. We need to either emphasize this= or >> >>do it in some ways for the user. >> >> >> >>One question for the hot-swapping support of moving an OSD to anot= her >> >>host. What if the journal is not located at the same disk of the O= SD? >> >>Is the OSD still able to be available in the cluster? >> >> >> >>-----Original Message----- >> >>From: Sage Weil [mailto:sweil@redhat.com] >> >>Sent: Thursday, August 21, 2014 11:28 PM >> >>To: Wang, Zhiqiang >> >>Cc: 'ceph-devel@vger.kernel.org' >> >>Subject: Re: A problem when restarting OSD >> >> >> >>On Thu, 21 Aug 2014, Wang, Zhiqiang wrote: >> >>> Hi all, >> >>>=20 >> >>> I ran into a problem when restarting an OSD. >> >>>=20 >> >>> Here is my OSD tree before restarting the OSD: >> >>>=20 >> >>> # id weight type name up/down reweight >> >>> -6 8 root ssd >> >>> -4 4 host zqw-s1-ssd >> >>> 16 1 osd.16 up 1 >> >>> 17 1 osd.17 up 1 >> >>> 18 1 osd.18 up 1 >> >>> 19 1 osd.19 up 1 >> >>> -5 4 host zqw-s2-ssd >> >>> 20 1 osd.20 up 1 >> >>> 21 1 osd.21 up 1 >> >>> 22 1 osd.22 up 1 >> >>> 23 1 osd.23 up 1 >> >>> -1 14.56 root default >> >>> -2 7.28 host zqw-s1 >> >>> 0 0.91 osd.0 up 1 >> >>> 1 0.91 osd.1 up 1 >> >>> 2 0.91 osd.2 up 1 >> >>> 3 0.91 osd.3 up 1 >> >>> 4 0.91 osd.4 up 1 >> >>> 5 0.91 osd.5 up 1 >> >>> 6 0.91 osd.6 up 1 >> >>> 7 0.91 osd.7 up 1 >> >>> -3 7.28 host zqw-s2 >> >>> 8 0.91 osd.8 up 1 >> >>> 9 0.91 osd.9 up 1 >> >>> 10 0.91 osd.10 up 1 >> >>> 11 0.91 osd.11 up 1 >> >>> 12 0.91 osd.12 up 1 >> >>> 13 0.91 osd.13 up 1 >> >>> 14 0.91 osd.14 up 1 >> >>> 15 0.91 osd.15 up 1 >> >>>=20 >> >>> After I restart one of the OSD with id from 16 to 23, say restar= ting >> >>>osd.16, osd.16 goes to 'root default' and 'host zqw-s1', and ceph >> >>>cluster begins to do rebalance. This surely is not what I want. >> >>>=20 >> >>> # id weight type name up/down reweight >> >>> -6 7 root ssd >> >>> -4 3 host zqw-s1-ssd >> >>> 17 1 osd.17 up 1 >> >>> 18 1 osd.18 up 1 >> >>> 19 1 osd.19 up 1 >> >>> -5 4 host zqw-s2-ssd >> >>> 20 1 osd.20 up 1 >> >>> 21 1 osd.21 up 1 >> >>> 22 1 osd.22 up 1 >> >>> 23 1 osd.23 up 1 >> >>> -1 15.56 root default >> >>> -2 8.28 host zqw-s1 >> >>> 0 0.91 osd.0 up 1 >> >>> 1 0.91 osd.1 up 1 >> >>> 2 0.91 osd.2 up 1 >> >>> 3 0.91 osd.3 up 1 >> >>> 4 0.91 osd.4 up 1 >> >>> 5 0.91 osd.5 up 1 >> >>> 6 0.91 osd.6 up 1 >> >>> 7 0.91 osd.7 up 1 >> >>> 16 1 osd.16 up 1 >> >>> -3 7.28 host zqw-s2 >> >>> 8 0.91 osd.8 up 1 >> >>> 9 0.91 osd.9 up 1 >> >>> 10 0.91 osd.10 up 1 >> >>> 11 0.91 osd.11 up 1 >> >>> 12 0.91 osd.12 up 1 >> >>> 13 0.91 osd.13 up 1 >> >>> 14 0.91 osd.14 up 1 >> >>> 15 0.91 osd.15 up 1 >> >>>=20 >> >>> After digging into the problem, I find it's because in the ceph = init >> >>>script, we change the OSD's crush location in some way. It uses t= he >> >>>script 'ceph-crush-location' to get the crush location from the >> >>>ceph.conf file for the restarting OSD. If there isn't such an ent= ry >>in >> >>>ceph.conf, it uses the default one 'host=3D$(hostname -s) >>root=3Ddefault'. >> >>>Since I don't have the crush location configuration in my ceph.co= nf >>(I >> >>>guess most of people don't have this in their ceph.conf), when I >> >>>restarting osd.16, it goes to 'root default' and 'host zqw-s1'. >> >>>=20 >> >>> Here is a fix for this: >> >>> When the ceph init script uses 'ceph osd crush create-or-move' t= o >> >>> change the OSD's crush location, do a check first, if this OSD i= s >> >>> already existing in the crush map, return without making the >>location >> >>> change. This change is at: >> >>>=20 >>https://github.com/wonzhq/ceph/commit/efdfa23664caa531390d141bd153987 >> >>> 8 >> >>> 761412fe >> >>>=20 >> >>> What do you think? >> >> >> >>The goal of this behavior is to allow hot-swapping of devices. Yo= u >>can >> >>pull disks out of one host and put them in another and the udev >> >>machinery will start up the daemon, update the crush location, and= the >> >>disk and data will become available. It's not 'ideal' in the sens= e >> >>that there will be rebalancing, but it does make the data availabl= e to >> >>the cluster to preserve data safety. >> >> >> >>We haven't come up with a great scheme yet to managing multiple tr= ees >> >>yet. >> >>The idea is that the ceph-crush-location hook can be customized to= do >> >>whatever is necessary, for example by putting root=3Dssd if the de= vice >> >>type appears to be an ssd (maybe look at the sysfs metadata, or pu= t a >> >>marker file in the osd data directory?). You can point to your ow= n >> >>hook for your environment with >> >> >> >> osd crush location hook =3D /path/to/my/script >> >> >> >>sage >> >> >> >> >> >> >> >>-- >> >>To unsubscribe from this list: send the line "unsubscribe ceph-dev= el" >> >>in the body of a message to majordomo@vger.kernel.org More majordo= mo >> >>info at http://vger.kernel.org/majordomo-info.html >> > >>=20 >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel= " in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >>=20 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html