From: Craig Lewis <clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
To: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
Cc: ceph-users <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>,
ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
Ceph Community
<ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
Subject: Re: 70+ OSD are DOWN and not coming up
Date: Thu, 22 May 2014 00:26:52 -0700 [thread overview]
Message-ID: <537DA6BC.20704@centraldesktop.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
[-- Attachment #1.1: Type: text/plain, Size: 2776 bytes --]
On 5/21/14 21:15 , Sage Weil wrote:
> On Wed, 21 May 2014, Craig Lewis wrote:
>> If you do this over IRC, can you please post a summary to the mailling
>> list?
>>
>> I believe I'm having this issue as well.
> In the other case, we found that some of the OSDs were behind processing
> maps (by several thousand epochs). The trick here to give them a chance
> to catch up is
>
> ceph osd set noup
> ceph osd set nodown
> ceph osd set noout
>
> and wait for them to stop spinning on the CPU. You can check which map
> each OSD is on with
>
> ceph daemon osd.NNN status
>
> to see which epoch they are on and compare that to
>
> ceph osd stat
>
> Once they are within 100 or less epochs,
>
> ceph osd unset noup
>
> and let them all start up.
>
> We haven't determined whether the original problem was caused by this or
> the other way around; we'll see once they are all caught up.
>
> sage
I was seeing the CPU spinning too, so I think it is the same issue.
Thanks for the explanation! I've been pulling my hair out for weeks.
I can give you a data point for the "how". My problems started with a
kswapd problem on 12.04.04 (kernel 3.5.0-46-generic
#70~precise1-Ubuntu). kswapd was consuming 100% CPU, and it was
blocking the ceph-osd processes. Once I prevented kswapd from doing
that, my OSDs couldn't recover. noout and nodown didn't help; the OSDs
would suicide and restart.
Upgrading to Ubuntu 14.04 seems to have helped. The cluster isn't all
clear yet, but it's getting better. The cluster is finally healthy
after 2 weeks of incomplete and stale. It's still unresponsive, but
it's making progress. I am still seeing OSD's consuming 100% CPU, but
only the OSDs that are actively deep-scrubing. Once the deep-scrub
finishes, the OSD starts behaving again. They seem to be slowly getting
better, which matches up with your explanation.
I'll go ahead at set noup. I don't think it's necessary at this point,
but it's not going to hurt.
I'm running Emperor, and looks like osd status isn't supported. Not a
big deal though. Deep-scrub has made it through half of the PGs in the
last 36 hours, so I'll just watch for another day or two. This is a
slave cluster, so I have that luxury.
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org <mailto:clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
[-- Attachment #1.2: Type: text/html, Size: 4517 bytes --]
[-- Attachment #2: Type: text/plain, Size: 178 bytes --]
_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
prev parent reply other threads:[~2014-05-22 7:26 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-20 9:02 70+ OSD are DOWN and not coming up Karan Singh
[not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
2014-05-20 15:18 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-21 12:37 ` Karan Singh
2014-05-22 1:34 ` Craig Lewis
[not found] ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
2014-05-22 4:15 ` Sage Weil
[not found] ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-22 7:26 ` Craig Lewis [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=537DA6BC.20704@centraldesktop.com \
--to=clewis-04jk9tcbggyp2ihm84uzcnbpr1lh4cv8@public.gmane.org \
--cc=ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
--cc=ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
--cc=sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).