Mimic cluster is offline and not healing

public inbox for ceph-devel@vger.kernel.org
 help / color / mirror / Atom feed

* Mimic cluster is offline and not healing
@ 2018-09-27 12:19 by morphin
       [not found] ` <CAE-AtHqSpX09gnAfgXt1=nmyLKuvjgMMn+qKaiZ0nOUKwEARrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: by morphin @ 2018-09-27 12:19 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA

Hello,

I am writing this e-mail about an incident that has started last weekend.
There seems to something wrong with my e-mail. Some of my e-mails did
not reach-out. So I decided to start an new thread here and start from
begining.
One can find the email related e-mail thread
(http://lists.ceph.com/pipermail/ceph-community-ceph.com/2018-September/000292.html).

We have a cluster with 28 servers and 168 OSDs. OSDs are blustore on
NL-SAS (non SMR) and WAL+DB is NvME. My distro is Archlinux.

Last weekend I have upgraded from 12.2.4 to 13.2.1. And cluster did
not start since OSDs were stuck in booting state. Sage helped me about
it (thanks!) by creating MONs store.db from OSDs via ceph-object-tool.
At first everything was perfect.

However two days later I had an most unfortunate accident. 7 of my
servers crashed at the same time. When they came up cluster was in
HEALTH_ERR state. 2 of those servers were MONs (I have 3 total).

I’ve been working for 3days collecting and testing. But I could not
make any progress.

First of all I’ve double checked OS health, network health, disk
health and they have no problem. Then my further investigation results
are these:
I have rbd pool. There is 33TB of VM data.
As soon as OSD starts it makes lots of I/O on blustore disks (NL-SAS).
This makes OSD near unresponsive. Yu can’t even injectargs.
Cluster does not settle. I left it alone for 24 hour but OSD up count
dropped to ~50.
OSDs are loging too much slow request.
OSDs are loging lots of heartbeat messages. And eventually they are
marked as down.

Latest cluster status: https://paste.ubuntu.com/p/BhCHmVNZsX/
Ceph.conf : https://paste.ubuntu.com/p/FtY9gfpncN/
Sample OSD log: https://paste.ubuntu.com/p/ZsqpcQVRsj/
Mon log: https://paste.ubuntu.com/p/9T8QtMYZWT/
I/O utilization on disks: https://paste.ubuntu.com/p/mrCTKYpBZR/

SO I think my problem is really weird. Somehow pool cannot heal itself.

OSDs make %95 disk I/O utilization and peering is way too slow.. The
OSD I/O didnt end after 72 hours.

Because of the high I/O OSD's cant get an answer from other OSD's and
complains to the monitor. Monitor marking them "down" but I see OSD's
still running.

For example the "ceph -s" command says 50 OSD is up but I see 153 osd
process running at background and trying to reach other OSD's. So it
is very confusing and certainly not progressing.

We're trying every possible strategy. Now we stopped OSDs. Then we
start one OSD at a time with a server. First we start, wait for OSD to
finish I/O than move the next OSD in the same server. We figured-out
that even if the first OSDs I/O is finished second OSD triggers it
again. So when we started the final sitxh OSD, the rest of five OSDs
did &95 I/O too. And first OSD I/O finished in 8 minutes. But sixth
OSD I/O finished in 34 minutes!

Then we moved the next server. As soon as we started this servers OSD
the previously finished OSD started to do I/O again. So we gained
nothing.

Now we are plaining to set noup, then start all 168 OSDs and then
unset noup. Maybe this will prevent OSDs to make I/O over and over
again.

After 72 hours I believe we may hit a bug. Any help would be greatly
appreciated.

We're on IRC 7/24. Thanks to: Be:El, peetaur2, degreaser, Ti and IcePic.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mimic cluster is offline and not healing
       [not found] ` <CAE-AtHqSpX09gnAfgXt1=nmyLKuvjgMMn+qKaiZ0nOUKwEARrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-27 13:10   ` Stefan Kooman
       [not found]     ` <20180927131043.GB17567-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Stefan Kooman @ 2018-09-27 13:10 UTC (permalink / raw)
  To: by morphin
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

Quoting by morphin (morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> After 72 hours I believe we may hit a bug. Any help would be greatly
> appreciated.

Is it feasible for you to stop all client IO to the Ceph cluster? At
least until it stabilizes again. "ceph osd pause" would do the trick
(ceph osd unpause would unset it). 

What kind of workload are you running on the cluster? How does your
crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw; 
crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?

I have seen a (test) Ceph cluster "healing" itself to the point there was
nothing left to recover on. In *that* case the disks were overbooked
(multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
again. I would try to get all OSDs online again (and manually keep them
up / restart them, because you have set nodown).

Does the cluster recover at all?

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info-68+x73Hep80@public.gmane.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mimic cluster is offline and not healing
       [not found]     ` <20180927131043.GB17567-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
@ 2018-09-27 13:27       ` by morphin
       [not found]         ` <CAE-AtHodr9iaGF3vhkrv+J8mHsYk384Ni8MpbMvW6Xg_Tdw4GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: by morphin @ 2018-09-27 13:27 UTC (permalink / raw)
  To: stefan-68+x73Hep80
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

I should not have client I/O right now. All of my VMs are down right
now. There is only a single pool.

Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/

Cluster does not recover. After starting OSDs with the specified
flags, OSD up count drops from 168 to 50 with in 24 hours.
Stefan Kooman <stefan@bit.nl>, 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
>
> Quoting by morphin (morphinwithyou@gmail.com):
> > After 72 hours I believe we may hit a bug. Any help would be greatly
> > appreciated.
>
> Is it feasible for you to stop all client IO to the Ceph cluster? At
> least until it stabilizes again. "ceph osd pause" would do the trick
> (ceph osd unpause would unset it).
>
> What kind of workload are you running on the cluster? How does your
> crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
>
> I have seen a (test) Ceph cluster "healing" itself to the point there was
> nothing left to recover on. In *that* case the disks were overbooked
> (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> again. I would try to get all OSDs online again (and manually keep them
> up / restart them, because you have set nodown).
>
> Does the cluster recover at all?
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> | GPG: 0xD14839C6                   +31 318 648 688 / info@bit.nl
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mimic cluster is offline and not healing
       [not found]         ` <CAE-AtHodr9iaGF3vhkrv+J8mHsYk384Ni8MpbMvW6Xg_Tdw4GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-27 18:38           ` by morphin
       [not found]             ` <CAE-AtHpGLZu5ygyw0sLkOcB3mt-0pLfcLZiPKYuptDLAafy7uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: by morphin @ 2018-09-27 18:38 UTC (permalink / raw)
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

I think I might find something.
When I start an OSD its making High I/O  around %95 and the other OSDs
are also triggered and altogether they make same the I/O. This is true
even if when I set noup flag. So all the OSDs are making high I/O when
ever an OSD starts.

I think this is too much. I have 168 OSD and when I start them OSD I/O
job never finishes. I let the cluster for 70 hours and the high I/O
never finished at all.

We're trying to start OSD's host by host and wait for settlement but
it takes too much time.
OSD can not even answer "ceph tell osd.158 version". So if it becomes
so busy and this seems to be a loop since another OSD startup triggers
other OSD I/O.

So I debug it and I hope this can be examined.

This is debug=20 OSD log :
Full log:  https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
Less log: Only the last part before the high I/O is finished:
https://paste.ubuntu.com/p/7ZfwH8CBC5/
Strace -f -P osd;
- When I start the osd: https://paste.ubuntu.com/p/8n2kTvwnG6/
- After I/O is finished: https://paste.ubuntu.com/p/4sGfj7Bf4c/

Now some people in IRC says this is a bug, try Ubuntu and new Ceph
repo maybe it will help. I agree with them and I will give a shot.
What do you think?
by morphin <morphinwithyou@gmail.com>, 27 Eyl 2018 Per, 16:27
tarihinde şunu yazdı:
>
> I should not have client I/O right now. All of my VMs are down right
> now. There is only a single pool.
>
> Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/
>
> Cluster does not recover. After starting OSDs with the specified
> flags, OSD up count drops from 168 to 50 with in 24 hours.
> Stefan Kooman <stefan@bit.nl>, 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
> >
> > Quoting by morphin (morphinwithyou@gmail.com):
> > > After 72 hours I believe we may hit a bug. Any help would be greatly
> > > appreciated.
> >
> > Is it feasible for you to stop all client IO to the Ceph cluster? At
> > least until it stabilizes again. "ceph osd pause" would do the trick
> > (ceph osd unpause would unset it).
> >
> > What kind of workload are you running on the cluster? How does your
> > crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> > crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
> >
> > I have seen a (test) Ceph cluster "healing" itself to the point there was
> > nothing left to recover on. In *that* case the disks were overbooked
> > (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> > nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> > again. I would try to get all OSDs online again (and manually keep them
> > up / restart them, because you have set nodown).
> >
> > Does the cluster recover at all?
> >
> > Gr. Stefan
> >
> > --
> > | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> > | GPG: 0xD14839C6                   +31 318 648 688 / info@bit.nl
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mimic cluster is offline and not healing
       [not found]             ` <CAE-AtHpGLZu5ygyw0sLkOcB3mt-0pLfcLZiPKYuptDLAafy7uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-27 20:52               ` by morphin
       [not found]                 ` <CAE-AtHo2UVSFcMHMXszSPJXs=BRKb0PELzryMyu4LVEv910pQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: by morphin @ 2018-09-27 20:52 UTC (permalink / raw)
  To: ceph-users-idqoXFIVOFJgJs9I8MT0rw; +Cc: ceph-devel-u79uwXL29TY76Z2rM5mHXA

Good news... :)

After I tried everything. I decide to re-create my MONs from OSD's and
I used the script:
https://paste.ubuntu.com/p/rNMPdMPhT5/

And it worked!!!
I think when 2 server crashed and come back same time some how MON's
confused and the maps just corrupted.
After re-creation all the MONs was have the same map so it worked.
But still I dont know how to hell the mons can cause endless %95 I/O ???
This a bug anyway and if you dont want to leave the problem then do
not "enable" your mons. Just start them manual! Another tough lesson.

ceph -s: https://paste.ubuntu.com/p/m3hFF22jM9/

As you can see below some of the OSDs are still down. And when I start
them they dont start.
Check start log: https://paste.ubuntu.com/p/ZJQG4khdbx/
Debug log: https://paste.ubuntu.com/p/J3JyGShHym/

What we can do for the problem?
What is the cause of the problem?

Thank you everyone. You helped me a lot! :)
>
> I think I might find something.
> When I start an OSD its making High I/O  around %95 and the other OSDs
> are also triggered and altogether they make same the I/O. This is true
> even if when I set noup flag. So all the OSDs are making high I/O when
> ever an OSD starts.
>
> I think this is too much. I have 168 OSD and when I start them OSD I/O
> job never finishes. I let the cluster for 70 hours and the high I/O
> never finished at all.
>
> We're trying to start OSD's host by host and wait for settlement but
> it takes too much time.
> OSD can not even answer "ceph tell osd.158 version". So if it becomes
> so busy and this seems to be a loop since another OSD startup triggers
> other OSD I/O.
>
> So I debug it and I hope this can be examined.
>
> This is debug=20 OSD log :
> Full log:  https://www.dropbox.com/s/pwzqeajlsdwaoi1/ceph-osd.90.log?dl=0
> Less log: Only the last part before the high I/O is finished:
> https://paste.ubuntu.com/p/7ZfwH8CBC5/
> Strace -f -P osd;
> - When I start the osd: https://paste.ubuntu.com/p/8n2kTvwnG6/
> - After I/O is finished: https://paste.ubuntu.com/p/4sGfj7Bf4c/
>
> Now some people in IRC says this is a bug, try Ubuntu and new Ceph
> repo maybe it will help. I agree with them and I will give a shot.
> What do you think?
> by morphin <morphinwithyou@gmail.com>, 27 Eyl 2018 Per, 16:27
> tarihinde şunu yazdı:
> >
> > I should not have client I/O right now. All of my VMs are down right
> > now. There is only a single pool.
> >
> > Here is my crush map: https://paste.ubuntu.com/p/Z9G5hSdqCR/
> >
> > Cluster does not recover. After starting OSDs with the specified
> > flags, OSD up count drops from 168 to 50 with in 24 hours.
> > Stefan Kooman <stefan@bit.nl>, 27 Eyl 2018 Per, 16:10 tarihinde şunu yazdı:
> > >
> > > Quoting by morphin (morphinwithyou@gmail.com):
> > > > After 72 hours I believe we may hit a bug. Any help would be greatly
> > > > appreciated.
> > >
> > > Is it feasible for you to stop all client IO to the Ceph cluster? At
> > > least until it stabilizes again. "ceph osd pause" would do the trick
> > > (ceph osd unpause would unset it).
> > >
> > > What kind of workload are you running on the cluster? How does your
> > > crush map looks like (ceph osd getcrushmap -o  /tmp/crush_raw;
> > > crushtool -d /tmp/crush_raw -o /tmp/crush_edit)?
> > >
> > > I have seen a (test) Ceph cluster "healing" itself to the point there was
> > > nothing left to recover on. In *that* case the disks were overbooked
> > > (multiple OSDs per physical disk) ... The flags you set (nooout, nodown,
> > > nobackfill, norecover, noscrub, etc., etc.) helped to get it to recover
> > > again. I would try to get all OSDs online again (and manually keep them
> > > up / restart them, because you have set nodown).
> > >
> > > Does the cluster recover at all?
> > >
> > > Gr. Stefan
> > >
> > > --
> > > | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> > > | GPG: 0xD14839C6                   +31 318 648 688 / info@bit.nl
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Mimic cluster is offline and not healing
       [not found]                 ` <CAE-AtHo2UVSFcMHMXszSPJXs=BRKb0PELzryMyu4LVEv910pQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-09-28  7:09                   ` Stefan Kooman
  0 siblings, 0 replies; 6+ messages in thread
From: Stefan Kooman @ 2018-09-28  7:09 UTC (permalink / raw)
  To: by morphin
  Cc: ceph-users-idqoXFIVOFJgJs9I8MT0rw,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA

Quoting by morphin (morphinwithyou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org):
> Good news... :)
> 
> After I tried everything. I decide to re-create my MONs from OSD's and
> I used the script:
> https://paste.ubuntu.com/p/rNMPdMPhT5/
> 
> And it worked!!!

Congrats!

> I think when 2 server crashed and come back same time some how MON's
> confused and the maps just corrupted.
> After re-creation all the MONs was have the same map so it worked.
> But still I dont know how to hell the mons can cause endless %95 I/O ???
> This a bug anyway and if you dont want to leave the problem then do
> not "enable" your mons. Just start them manual! Another tough lesson.

The only time we needed to manually start the mons was at "bootstrap"
time. After a reboot they are brought up by systemd ... and it keeps on
working. Have you rebooted your mon(s) after the manual start?

> 
> ceph -s: https://paste.ubuntu.com/p/m3hFF22jM9/
> 
> As you can see below some of the OSDs are still down. And when I start
> them they dont start.
> Check start log: https://paste.ubuntu.com/p/ZJQG4khdbx/
> Debug log: https://paste.ubuntu.com/p/J3JyGShHym/
> 
> What we can do for the problem?
Apply PR https://github.com/ceph/ceph/pull/24064

I see that you are running Mimic 13.2.1 ... 13.2.2 was released a few
days ago. Not sure if this fix has made it into 13.2.2.

> What is the cause of the problem?

Somehow it looks like you hit this issue:
https://tracker.ceph.com/issues/24866

Gr. Stefan

-- 
| BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
| GPG: 0xD14839C6                   +31 318 648 688 / info-68+x73Hep80@public.gmane.org

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-09-28  7:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-09-27 12:19 Mimic cluster is offline and not healing by morphin
     [not found] ` <CAE-AtHqSpX09gnAfgXt1=nmyLKuvjgMMn+qKaiZ0nOUKwEARrA-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-27 13:10   ` Stefan Kooman
     [not found]     ` <20180927131043.GB17567-VkyGEX2O1ez1kYbDYJMsfg@public.gmane.org>
2018-09-27 13:27       ` by morphin
     [not found]         ` <CAE-AtHodr9iaGF3vhkrv+J8mHsYk384Ni8MpbMvW6Xg_Tdw4GQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-27 18:38           ` by morphin
     [not found]             ` <CAE-AtHpGLZu5ygyw0sLkOcB3mt-0pLfcLZiPKYuptDLAafy7uw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-27 20:52               ` by morphin
     [not found]                 ` <CAE-AtHo2UVSFcMHMXszSPJXs=BRKb0PELzryMyu4LVEv910pQQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-09-28  7:09                   ` Stefan Kooman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox