From: Craig Lewis <clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
To: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
Cc: ceph-users <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>,
ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Ceph Community
<ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
Subject: Re: 70+ OSD are DOWN and not coming up
Date: Thu, 22 May 2014 00:26:52 -0700 [thread overview]
Message-ID: <537DA6BC.20704@centraldesktop.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 2776 bytes --]
On 5/21/14 21:15 , Sage Weil wrote:
> On Wed, 21 May 2014, Craig Lewis wrote:
>> If you do this over IRC, can you please post a summary to the mailling
>> list?
>>
>> I believe I'm having this issue as well.
> In the other case, we found that some of the OSDs were behind processing
> maps (by several thousand epochs). The trick here to give them a chance
> to catch up is
>
> ceph osd set noup
> ceph osd set nodown
> ceph osd set noout
>
> and wait for them to stop spinning on the CPU. You can check which map
> each OSD is on with
>
> ceph daemon osd.NNN status
>
> to see which epoch they are on and compare that to
>
> ceph osd stat
>
> Once they are within 100 or less epochs,
>
> ceph osd unset noup
>
> and let them all start up.
>
> We haven't determined whether the original problem was caused by this or
> the other way around; we'll see once they are all caught up.
>
> sage
I was seeing the CPU spinning too, so I think it is the same issue.
Thanks for the explanation! I've been pulling my hair out for weeks.
I can give you a data point for the "how". My problems started with a
kswapd problem on 12.04.04 (kernel 3.5.0-46-generic
#70~precise1-Ubuntu). kswapd was consuming 100% CPU, and it was
blocking the ceph-osd processes. Once I prevented kswapd from doing
that, my OSDs couldn't recover. noout and nodown didn't help; the OSDs
would suicide and restart.
Upgrading to Ubuntu 14.04 seems to have helped. The cluster isn't all
clear yet, but it's getting better. The cluster is finally healthy
after 2 weeks of incomplete and stale. It's still unresponsive, but
it's making progress. I am still seeing OSD's consuming 100% CPU, but
only the OSDs that are actively deep-scrubing. Once the deep-scrub
finishes, the OSD starts behaving again. They seem to be slowly getting
better, which matches up with your explanation.
I'll go ahead at set noup. I don't think it's necessary at this point,
but it's not going to hurt.
I'm running Emperor, and looks like osd status isn't supported. Not a
big deal though. Deep-scrub has made it through half of the PGs in the
last 36 hours, so I'll just watch for another day or two. This is a
slave cluster, so I have that luxury.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org <mailto:clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
[-- Attachment #1.2: Type: text/html, Size: 4517 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
prev parent reply other threads:[~2014-05-22 7:26 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-20 9:02 70+ OSD are DOWN and not coming up Karan Singh
[not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
2014-05-20 15:18 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-21 12:37 ` Karan Singh
2014-05-22 1:34 ` Craig Lewis
[not found] ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
2014-05-22 4:15 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-22 7:26 ` Craig Lewis [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=537DA6BC.20704@centraldesktop.com \
--to=clewis-04jk9tcbggyp2ihm84uzcnbpr1lh4cv8@public.gmane.org \
--cc=ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
--cc=ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
--cc=sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.