8 out of 12 OSDs died after expansion on 0.56.1 (void OSD::do_waiters())

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wido den Hollander <wido@widodh.nl>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: 8 out of 12 OSDs died after expansion on 0.56.1 (void OSD::do_waiters())
Date: Wed, 16 Jan 2013 16:56:00 +0100	[thread overview]
Message-ID: <50F6CD90.3070007@widodh.nl> (raw)

Hi,

I'm testing a small Ceph cluster with Asus C60M1-1 mainboards.

The setup is:
- AMD Fusion C60 CPU
- 8GB DDR3
- 1x Intel 520 120GB SSD (OS + Journaling)
- 4x 1TB disk

I had two of these systems running, but yesterday I wanted to add a 
third one.

So I had 8 OSDs (one per disk) running on 0.56.1 and I added one host 
bringing the total to 12.

The cluster came into a degraded state (about 50%) and it started to 
recover until it reached somewhere about 48%

In a manner of about 5 minutes all the original 8 OSDs had crashed with 
the same backtrace:

     -1> 2013-01-15 17:20:29.058426 7f95a0fd8700 10 -- 
[2a00:f10:113:0:6051:e06c:df3:f374]:6803/4913 reaper done
      0> 2013-01-15 17:20:29.061054 7f959cfd0700 -1 osd/OSD.cc: In 
function 'void OSD::do_waiters()' thread 7f959cfd0700 time 2013-01-15 
17:20:29.057714
osd/OSD.cc: 3318: FAILED assert(osd_lock.is_locked())

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: (OSD::do_waiters()+0x2c3) [0x6251f3]
  2: (OSD::ms_dispatch(Message*)+0x1c4) [0x62d714]
  3: (DispatchQueue::entry()+0x349) [0x8ba289]
  4: (DispatchQueue::DispatchThread::entry()+0xd) [0x8137cd]
  5: (()+0x7e9a) [0x7f95a95dae9a]
  6: (clone()+0x6d) [0x7f95a805ecbd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

So osd.0 - osd.7 were down and osd.8 - osd.11 (the new ones) were still 
running happily.

I have to note that during this recovery the load of the first two 
machines spiked to 10 and the CPUs were 0% idle.

This morning I started all the OSDs again with a default loglevel since 
I don't want to stress the CPUs even more.

I know the C60 CPU is kind of limited, but it's a test-case!

The recovery started again and it showed about 90MB/sec (Gbit network) 
coming into the new node.

After about 4 hours the recovery successfully completed:

736 pgs: 1736 active+clean; 837 GB data, 1671 GB used, 9501 GB / 11172 
GB avail

Now, there was no high logging level on the OSDs prior to their crash, I 
only have the default logs.

And nothing happened after I started them again, all 12 are up now.

Is this a known one? If not, I'll file a bug in the tracker.

Wido

next             reply	other threads:[~2013-01-16 15:56 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-16 15:56 Wido den Hollander [this message]
2013-01-16 18:03 ` 8 out of 12 OSDs died after expansion on 0.56.1 (void OSD::do_waiters()) Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50F6CD90.3070007@widodh.nl \
    --to=wido@widodh.nl \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.