From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Kooman Subject: Re: Mimic cluster is offline and not healing Date: Fri, 28 Sep 2018 09:09:23 +0200 Message-ID: <20180928070923.GC17567@shell.dmz.bit.nl> References: <20180927131043.GB17567@shell.dmz.bit.nl> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: Content-Disposition: inline In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: ceph-users-bounces-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org Sender: "ceph-users" To: by morphin Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org, ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org List-Id: ceph-devel.vger.kernel.org Quoting by morphin (morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org): > Good news... :) > > After I tried everything. I decide to re-create my MONs from OSD's and > I used the script: > https://paste.ubuntu.com/p/rNMPdMPhT5/ > > And it worked!!! Congrats! > I think when 2 server crashed and come back same time some how MON's > confused and the maps just corrupted. > After re-creation all the MONs was have the same map so it worked. > But still I dont know how to hell the mons can cause endless %95 I/O ??? > This a bug anyway and if you dont want to leave the problem then do > not "enable" your mons. Just start them manual! Another tough lesson. The only time we needed to manually start the mons was at "bootstrap" time. After a reboot they are brought up by systemd ... and it keeps on working. Have you rebooted your mon(s) after the manual start? > > ceph -s: https://paste.ubuntu.com/p/m3hFF22jM9/ > > As you can see below some of the OSDs are still down. And when I start > them they dont start. > Check start log: https://paste.ubuntu.com/p/ZJQG4khdbx/ > Debug log: https://paste.ubuntu.com/p/J3JyGShHym/ > > What we can do for the problem? Apply PR https://github.com/ceph/ceph/pull/24064 I see that you are running Mimic 13.2.1 ... 13.2.2 was released a few days ago. Not sure if this fix has made it into 13.2.2. > What is the cause of the problem? Somehow it looks like you hit this issue: https://tracker.ceph.com/issues/24866 Gr. Stefan -- | BIT BV http://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info-68+x73Hep80@public.gmane.org