Re: 70+ OSD are DOWN and not coming up

ceph-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Craig Lewis <clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
To: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
Cc: ceph-users <ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Ceph Community
	<ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org>
Subject: Re: 70+  OSD are DOWN and not coming up
Date: Thu, 22 May 2014 00:26:52 -0700	[thread overview]
Message-ID: <537DA6BC.20704@centraldesktop.com> (raw)
In-Reply-To: <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>

[-- Attachment #1.1: Type: text/plain, Size: 2776 bytes --]

On 5/21/14 21:15 , Sage Weil wrote:
> On Wed, 21 May 2014, Craig Lewis wrote:
>> If you do this over IRC, can you please post a summary to the mailling
>> list?
>>
>> I believe I'm having this issue as well.
> In the other case, we found that some of the OSDs were behind processing
> maps (by several thousand epochs).  The trick here to give them a chance
> to catch up is
>
>   ceph osd set noup
>   ceph osd set nodown
>   ceph osd set noout
>
> and wait for them to stop spinning on the CPU.  You can check which map
> each OSD is on with
>
>   ceph daemon osd.NNN status
>
> to see which epoch they are on and compare that to
>
>   ceph osd stat
>
> Once they are within 100 or less epochs,
>
>   ceph osd unset noup
>
> and let them all start up.
>
> We haven't determined whether the original problem was caused by this or
> the other way around; we'll see once they are all caught up.
>
> sage

I was seeing the CPU spinning too, so I think it is the same issue. 
Thanks for the explanation!  I've been pulling my hair out for weeks.

I can give you a data point for the "how".  My problems started with a 
kswapd problem on 12.04.04 (kernel 3.5.0-46-generic 
#70~precise1-Ubuntu).  kswapd was consuming 100% CPU, and it was 
blocking the ceph-osd processes.  Once I prevented kswapd from doing 
that, my OSDs couldn't recover.  noout and nodown didn't help; the OSDs 
would suicide and restart.

Upgrading to Ubuntu 14.04 seems to have helped.  The cluster isn't all 
clear yet, but it's getting better.  The cluster is finally healthy 
after 2 weeks of incomplete and stale.  It's still unresponsive, but 
it's making progress.  I am still seeing OSD's consuming 100% CPU, but 
only the OSDs that are actively deep-scrubing.  Once the deep-scrub 
finishes, the OSD starts behaving again.  They seem to be slowly getting 
better, which matches up with your explanation.

I'll go ahead at set noup.  I don't think it's necessary at this point, 
but it's not going to hurt.

I'm running Emperor, and looks like osd status isn't supported.  Not a 
big deal though.  Deep-scrub has made it through half of the PGs in the 
last 36 hours, so I'll just watch for another day or two. This is a 
slave cluster, so I have that luxury.

-- 

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org <mailto:clewis-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/>  | Twitter 
<http://www.twitter.com/centraldesktop>  | Facebook 
<http://www.facebook.com/CentralDesktop>  | LinkedIn 
<http://www.linkedin.com/groups?gid=147417>  | Blog 
<http://cdblog.centraldesktop.com/>

[-- Attachment #1.2: Type: text/html, Size: 4517 bytes --]

[-- Attachment #2: Type: text/plain, Size: 178 bytes --]

_______________________________________________
ceph-users mailing list
ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

     prev parent reply	other threads:[~2014-05-22  7:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-20  9:02 70+ OSD are DOWN and not coming up Karan Singh
     [not found] ` <5396D45F-E87E-4F53-85D8-B4DC1F630B78-Gn+qtVAUx6s@public.gmane.org>
2014-05-20 15:18   ` Sage Weil
     [not found]     ` <alpine.DEB.2.00.1405200815570.1689-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-21 12:37       ` Karan Singh
2014-05-22  1:34       ` Craig Lewis
     [not found]         ` <537D541D.2010005-04jk9TcbgGYP2IHM84UzcNBPR1lH4CV8@public.gmane.org>
2014-05-22  4:15           ` Sage Weil
     [not found]             ` <alpine.DEB.2.00.1405212111030.12847-vIokxiIdD2AQNTJnQDzGJqxOck334EZe@public.gmane.org>
2014-05-22  7:26               ` Craig Lewis [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=537DA6BC.20704@centraldesktop.com \
    --to=clewis-04jk9tcbggyp2ihm84uzcnbpr1lh4cv8@public.gmane.org \
    --cc=ceph-community-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
    --cc=ceph-devel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=ceph-users-idqoXFIVOFJgJs9I8MT0rw@public.gmane.org \
    --cc=sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).