All of lore.kernel.org
 help / color / mirror / Atom feed
* Hit suicide timeout after adding new osd
@ 2013-01-17 14:35 Jens Kristian Søgaard
  2013-01-17 14:47 ` Wido den Hollander
  0 siblings, 1 reply; 37+ messages in thread
From: Jens Kristian Søgaard @ 2013-01-17 14:35 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi guys,

I had a functioning Ceph system that reported HEALTH_OK. It was running 
with 3 osds on 3 servers.

Then I added an extra osd on 1 of the servers using the commands from 
the documentation here:

http://ceph.com/docs/master/rados/operations/add-or-rm-osds/

Shortly after I did that 2 of the existing osds crashed.

I restarted them and after some hours they were up and running again, 
but soon one of them crashed again - and a third existing osd crashed as 
well. I restarted those two and waited some hours for them to come up. A 
short while later one of them crashed again.

I have then restarted restarted that last one and watched the logs 
closely. It seems the same patterns repeats itself every time. It starts 
up doing its normal maintenance before going "up" (takes a long while). 
Then it seems to be running, but logs the following every 5 seconds:

heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had timed 
out after 30

After some time it logs:

===================================================
heartbeat_map is_healthy 'OSD::op_tp thread 0x7f051b7f6700' had suicide 
timed out after 300

2013-01-17 15:24:35.051524 7f053f149700 -1 common/HeartbeatMap.cc: In 
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, 
const char*, time_t)' thread 7f053f149700 time 2013-01-17 15:24:33.849654
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x2eb) [0x8462bb]
  2: (ceph::HeartbeatMap::is_healthy()+0x8e) [0x846a9e]
  3: (ceph::HeartbeatMap::check_touch_file()+0x28) [0x846cc8]
  4: (CephContextServiceThread::entry()+0x55) [0x8e01c5]
  5: /lib64/libpthread.so.0() [0x360de07d14]
  6: (clone()+0x6d) [0x360d6f167d]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

2013-01-17 15:24:35.301183 7f053f149700 -1 *** Caught signal (Aborted) **
  in thread 7f053f149700

  ceph version 0.56.1 (e4a541624df62ef353e754391cbbb707f54b16f7)
  1: /usr/bin/ceph-osd() [0x82ea90]
  2: /lib64/libpthread.so.0() [0x360de0efe0]
  3: (gsignal()+0x35) [0x360d635925]
  4: (abort()+0x148) [0x360d6370d8]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x3611660dad]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.
===================================================

How can I avoid this? - is it a bug, or have I done something wrong?

I'm running Ceph 0.56.1 from the official RPMs on Fedora 17.
The underlying disks and network connectivity has been tested and 
nothing seems to be wrong there.

Thanks in advance for your assistance!
-- 
Jens Kristian Søgaard, Mermaid Consulting ApS,
jens@mermaidconsulting.dk,
http://www.mermaidconsulting.com/
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2013-02-17 17:52 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-17 14:35 Hit suicide timeout after adding new osd Jens Kristian Søgaard
2013-01-17 14:47 ` Wido den Hollander
2013-01-17 14:50   ` Stefan Priebe
2013-01-17 15:33     ` Wido den Hollander
2013-01-17 15:37       ` Stefan Priebe
2013-01-17 17:17         ` Sage Weil
2013-01-17 20:32           ` Jens Kristian Søgaard
2013-01-17 22:03             ` Sage Weil
2013-01-18 11:24               ` Jens Kristian Søgaard
2013-01-18 21:28                 ` Sage Weil
2013-01-18 21:36                   ` Jens Kristian Søgaard
2013-01-18 21:44                     ` Sage Weil
2013-01-19  9:25                       ` Jens Kristian Søgaard
2013-01-19 16:44                         ` Sage Weil
2013-01-19 17:56                           ` Jens Kristian Søgaard
2013-01-19 18:19                             ` Sage Weil
2013-01-19 18:40                               ` Jens Kristian Søgaard
2013-01-19 20:08                                 ` Sage Weil
2013-01-19 20:29                                   ` Jens Kristian Søgaard
2013-01-19 22:04                                     ` Sage Weil
2013-01-21  0:14                                 ` Sage Weil
2013-01-21  6:59                                   ` Jens Kristian Søgaard
2013-01-21  7:11                                     ` Sage Weil
2013-01-23 12:14                                       ` Jens Kristian Søgaard
2013-01-23 12:26                                         ` Wido den Hollander
2013-01-23 12:29                                           ` Jens Kristian Søgaard
2013-01-23 13:13                                           ` Sage Weil
2013-01-23 20:59                                         ` Jens Kristian Søgaard
2013-01-23 22:56                                           ` Andrey Korolyov
2013-01-24  4:39                                             ` Sage Weil
2013-01-24  7:44                                               ` Andrey Korolyov
2013-01-24 18:01                                                 ` Sage Weil
2013-02-17 11:21                                                   ` Andrey Korolyov
2013-02-17 17:52                                                     ` Sage Weil
2013-01-24  4:28                                           ` Sage Weil
2013-01-24 10:08                                             ` Jens Kristian Søgaard
2013-01-24 18:06                                               ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.