All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wido den Hollander <wido@widodh.nl>
To: "ceph-devel@vger.kernel.org" <ceph-devel@vger.kernel.org>
Subject: "hit suicide timeout" message after upgrade to 0.56
Date: Thu, 03 Jan 2013 13:26:29 +0100	[thread overview]
Message-ID: <50E578F5.8070805@widodh.nl> (raw)

Hi,

I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening 
and found out this morning that I had 23 OSDs still up and in.

Investigating some logs I found these messages:

*********************************************************************
     -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
     -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
     -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient: 
_send_mon_message to mon.pri at 
[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
     -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 -- 
[2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >> 
[2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 
:0 pgs=0 cs=0 l=1).fault
     -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 -- 
[2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >> 
[2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 
:44363 pgs=99 cs=1 l=0).fault with nothing to send, going to standby
     -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
     -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out 
after 300
     -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x107) [0x796877]
  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
  5: (()+0x7e9a) [0x7f9eb4571e9a]
  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

      0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient: 
_send_mon_message to mon.pri at 
[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
*********************************************************************

Reading these messages I'm trying to figure out why those messages came 
along.

Am I understanding this correctly that the heartbeat updates didn't come 
along in time and the OSDs committed suicide?

I read the code in common/HeartbeatMap.cc and it seems like that.

During the restart of the cluster the Atom CPUs were very busy, so could 
it be that the CPUs were just to busy and the OSDs weren't responding to 
heartbeats in time?

In total 16 of the 17 crashed OSDs are down with these log messages.

I'm now starting the 16 crashed OSDs one by one and that seems to go 
just fine.

I've set "osd recovery max active = 1" to prevent overloading the CPUs 
to much since I know Atoms are not that powerful. I'm just still trying 
to get it all working on them :)

Am I right this is probably a lack of CPU power during the heavy 
recovery which causes them to not respond to heartbeat updates in time?

Wido

             reply	other threads:[~2013-01-03 12:26 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-01-03 12:26 Wido den Hollander [this message]
2013-01-03 16:52 ` "hit suicide timeout" message after upgrade to 0.56 Sage Weil
2013-01-03 20:50   ` Wido den Hollander
2013-01-03 21:05     ` Sage Weil
2013-01-03 21:10       ` Wido den Hollander
2013-01-08 17:35   ` Gregory Farnum
2013-01-08 21:36     ` Sage Weil

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50E578F5.8070805@widodh.nl \
    --to=wido@widodh.nl \
    --cc=ceph-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.