From mboxrd@z Thu Jan 1 00:00:00 1970 From: Amon Ott Subject: Re: Bug #1047 reproduced Date: Wed, 21 Dec 2011 17:36:24 +0100 Message-ID: <201112211736.25725.a.ott@m-privacy.de> References: <201112010937.37642.a.ott@m-privacy.de> <201112211337.44363.a.ott@m-privacy.de> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from www.m-privacy.de ([85.214.138.176]:46966 "EHLO www.m-privacy.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752053Ab1LUQgn convert rfc822-to-8bit (ORCPT ); Wed, 21 Dec 2011 11:36:43 -0500 In-Reply-To: Content-Disposition: inline Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: Sage Weil , ceph-devel@vger.kernel.org, Alexandre Oliva On Wednesday 21 December 2011 wrote Gregory Farnum: > On Wed, Dec 21, 2011 at 4:37 AM, Amon Ott wrote: > > On Friday 02 December 2011 wrote Sage Weil: > >> On Fri, 2 Dec 2011, Amon Ott wrote: > >> > On Thursday 01 December 2011 you wrote: > >> > > On all four nodes of my test cluster, MDS crashes with a trace= like > >> > > that in bug #1047. Example and ceph.conf attached. Ceph server= side > >> > > is from git master, last commit > >> > > ce6572273943ffdca4b7dc5344152d6c35106a2d. > >> > > > >> > > MDS does not start on any node here, it reliably crashes with = that > >> > > assert. > >> > > >> > Does it makes sense for you to keep the cluster in that broken s= tate, > >> > so that we can reproduce that bug or test a potential fix? Other= wise, > >> > I would recreate the Ceph filesystem to make more tests. I also = have a > >> > full log of one mds from start to crash here. > >> > >> Can you attach the log to #1047 for posterity? =A0I'll take a quic= k look > >> and see if there is any further info to gain from the log. =A0I'm = guessing > >> the actual bug occured before the crash, when the anchor table was= n't > >> updated properly, but there may be clues here. > > > > Did you find some time to look into this? The bug makes Ceph unusab= le for > > us even with moderate load. All mds instances die with the same ass= ert, > > the only way to recover in that state is to recreate the complete c= eph fs > > and restore backups. > > Sage is gone on vacation right now (unless he decides not to be for a > while), but we've been focusing our efforts on the OSDs lately so I > don't think he's looked at it. I'll see if I can carve out some time > tomorrow or Friday, but I can't promise anything. > > Alexandre, can you check this bug and make sure it looks like the sam= e > one you reported as #1850? Thank you for looking into it. The behaviour in #1850 looks quite simil= ar to=20 our bug, apart from the hardlinks. We copy many files here in our tests= , too.=20 Last time I hit the bug I had really restarted the master mds. Amon Ott --=20 Dr. Amon Ott m-privacy GmbH Tel: +49 30 24342334 Am K=F6llnischen Park 1 Fax: +49 30 24342336 10179 Berlin http://www.m-privacy.de Amtsgericht Charlottenburg, HRB 84946 Gesch=E4ftsf=FChrer: Dipl.-Kfm. Holger Maczkowsky, Roman Maczkowsky GnuPG-Key-ID: 0x2DD3A649 -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html