From mboxrd@z Thu Jan 1 00:00:00 1970 From: Martin Mailand Subject: Re: Cluster sync doesn't finsh Date: Fri, 18 Nov 2011 21:17:25 +0100 Message-ID: <4EC6BD55.5050709@tuxadero.com> References: <4EC57338.9040004@tuxadero.com> Reply-To: martin@tuxadero.com Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from einhorn.in-berlin.de ([192.109.42.8]:35001 "EHLO einhorn.in-berlin.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751156Ab1KRURe (ORCPT ); Fri, 18 Nov 2011 15:17:34 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Samuel Just Cc: ceph-devel@vger.kernel.org Hi Sam, here the crushmap http://85.214.49.87/ceph/crushmap.txt http://85.214.49.87/ceph/crushmap -martin Samuel Just schrieb: > It looks like a crushmap related problem. Could you send us the crus= hmap? >=20 > ceph osd getcrushmap >=20 > Thanks > -Sam >=20 > On Fri, Nov 18, 2011 at 10:13 AM, Gregory Farnum > wrote: >> On Fri, Nov 18, 2011 at 10:05 AM, Tommi Virtanen >> wrote: >>> On Thu, Nov 17, 2011 at 12:48, Martin Mailand = wrote: >>>> Hi, >>>> I am doing cluster failure test, where I shut down one OSD an wait= for the >>>> cluster to sync. But the sync never finshed, at around 4-5% it sto= ps. I >>>> stoped osd2. >>> ... >>>> 2011-11-17 16:42:45.520740 pg v1337: 600 pgs: 547 active+clean,= 53 >>>> active+clean+degraded; 113 GB data, 184 GB used, 1141 GB / 1395 GB= avail; >>>> 4025/82404 degraded (4.884%) >>> ... >>>> The osd log, the ceph.conf, pg dump, osd dump could be found here. >>>> >>>> http://85.214.49.87/ceph/ >>> This looks a bit worrying: >>> >>> 2011-11-17 17:56:35.771574 7f704c834700 -- 192.168.42.113:0/2424 >> >>> 192.168.42.114:6802/21115 pipe(0x2596c80 sd=3D17 pgs=3D0 cs=3D0 l=3D= 0).connect >>> claims to be 192.168.42.114:6802/21507 not 192.168.42.114:6802/2111= 5 - >>> wrong node! >>> >>> So osd.0 is basically refusing to talk to one of the other OSDs. I >>> don't understand the messenger well enough to know why this would b= e, >>> but it wouldn't surprise me if this problem kept the objects degrad= ed >>> -- it looks like a breakage in the osd<->osd communication. >>> >>> Now if this was the reason, I'd expect a restart of all the OSDs to >>> get it back in shape; messenger state is ephemeral. Can you confirm >>> that? >> Probably not =E2=80=94 that wrong node thing can occur for a lot of = different >> reasons, some of which matter and most of which don't. Sam's looking >> into the problem; there's something going wrong with the CRUSH >> calculations or the monitor PG placement overrides or something... >> -Greg >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel= " in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html