"hit suicide timeout" message after upgrade to 0.56

All of lore.kernel.org
 help / color / mirror / Atom feed

* "hit suicide timeout" message after upgrade to 0.56
@ 2013-01-03 12:26 Wido den Hollander
  2013-01-03 16:52 ` Sage Weil
  0 siblings, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2013-01-03 12:26 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi,

I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening 
and found out this morning that I had 23 OSDs still up and in.

Investigating some logs I found these messages:

*********************************************************************
     -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
     -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
     -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient: 
_send_mon_message to mon.pri at 
[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
     -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 -- 
[2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >> 
[2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 
:0 pgs=0 cs=0 l=1).fault
     -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 -- 
[2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >> 
[2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 
:44363 pgs=99 cs=1 l=0).fault with nothing to send, going to standby
     -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
     -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map 
is_healthy 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out 
after 300
     -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 
common/HeartbeatMap.cc: In function 'bool 
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")

  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, 
long)+0x107) [0x796877]
  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
  5: (()+0x7e9a) [0x7f9eb4571e9a]
  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

      0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient: 
_send_mon_message to mon.pri at 
[2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
*********************************************************************

Reading these messages I'm trying to figure out why those messages came 
along.

Am I understanding this correctly that the heartbeat updates didn't come 
along in time and the OSDs committed suicide?

I read the code in common/HeartbeatMap.cc and it seems like that.

During the restart of the cluster the Atom CPUs were very busy, so could 
it be that the CPUs were just to busy and the OSDs weren't responding to 
heartbeats in time?

In total 16 of the 17 crashed OSDs are down with these log messages.

I'm now starting the 16 crashed OSDs one by one and that seems to go 
just fine.

I've set "osd recovery max active = 1" to prevent overloading the CPUs 
to much since I know Atoms are not that powerful. I'm just still trying 
to get it all working on them :)

Am I right this is probably a lack of CPU power during the heavy 
recovery which causes them to not respond to heartbeat updates in time?

Wido

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-03 12:26 "hit suicide timeout" message after upgrade to 0.56 Wido den Hollander
@ 2013-01-03 16:52 ` Sage Weil
  2013-01-03 20:50   ` Wido den Hollander
  2013-01-08 17:35   ` Gregory Farnum
  0 siblings, 2 replies; 7+ messages in thread
From: Sage Weil @ 2013-01-03 16:52 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel@vger.kernel.org

Hi Wido,

On Thu, 3 Jan 2013, Wido den Hollander wrote:
> Hi,
> 
> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and
> found out this morning that I had 23 OSDs still up and in.
> 
> Investigating some logs I found these messages:

This sounds quite a bit #3714.  You might give wip-3714 a try...

sage


> 
> *********************************************************************
>     -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>     -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
>     -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>     -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
> pgs=0 cs=0 l=1).fault
>     -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 :44363
> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
>     -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>     -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
>     -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> 
>  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> long)+0x107) [0x796877]
>  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
>  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
>  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
>  5: (()+0x7e9a) [0x7f9eb4571e9a]
>  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> interpret this.
> 
>      0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
> *********************************************************************
> 
> Reading these messages I'm trying to figure out why those messages came along.
> 
> Am I understanding this correctly that the heartbeat updates didn't come along
> in time and the OSDs committed suicide?
> 
> I read the code in common/HeartbeatMap.cc and it seems like that.
> 
> During the restart of the cluster the Atom CPUs were very busy, so could it be
> that the CPUs were just to busy and the OSDs weren't responding to heartbeats
> in time?
> 
> In total 16 of the 17 crashed OSDs are down with these log messages.
> 
> I'm now starting the 16 crashed OSDs one by one and that seems to go just
> fine.
> 
> I've set "osd recovery max active = 1" to prevent overloading the CPUs to much
> since I know Atoms are not that powerful. I'm just still trying to get it all
> working on them :)
> 
> Am I right this is probably a lack of CPU power during the heavy recovery
> which causes them to not respond to heartbeat updates in time?
> 
> Wido
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-03 16:52 ` Sage Weil
@ 2013-01-03 20:50   ` Wido den Hollander
  2013-01-03 21:05     ` Sage Weil
  2013-01-08 17:35   ` Gregory Farnum
  1 sibling, 1 reply; 7+ messages in thread
From: Wido den Hollander @ 2013-01-03 20:50 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org

Hi,

On 01/03/2013 05:52 PM, Sage Weil wrote:
> Hi Wido,
>
> On Thu, 3 Jan 2013, Wido den Hollander wrote:
>> Hi,
>>
>> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and
>> found out this morning that I had 23 OSDs still up and in.
>>
>> Investigating some logs I found these messages:
>
> This sounds quite a bit #3714.  You might give wip-3714 a try...
>

I'll give it a try and see if I can reproduce. Probably just have to 
kill the whole cluster and restart everything to see if it behaves the 
same way.

I'll give it a try and let you know.

Wido

> sage
>
>
>>
>> *********************************************************************
>>      -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>      -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
>>      -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>>      -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
>> pgs=0 cs=0 l=1).fault
>>      -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 :44363
>> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
>>      -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>      -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
>>      -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 common/HeartbeatMap.cc: In
>> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
>> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>
>>   ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> long)+0x107) [0x796877]
>>   2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
>>   3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
>>   4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
>>   5: (()+0x7e9a) [0x7f9eb4571e9a]
>>   6: (clone()+0x6d) [0x7f9eb2ff5cbd]
>>   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>       0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>> *********************************************************************
>>
>> Reading these messages I'm trying to figure out why those messages came along.
>>
>> Am I understanding this correctly that the heartbeat updates didn't come along
>> in time and the OSDs committed suicide?
>>
>> I read the code in common/HeartbeatMap.cc and it seems like that.
>>
>> During the restart of the cluster the Atom CPUs were very busy, so could it be
>> that the CPUs were just to busy and the OSDs weren't responding to heartbeats
>> in time?
>>
>> In total 16 of the 17 crashed OSDs are down with these log messages.
>>
>> I'm now starting the 16 crashed OSDs one by one and that seems to go just
>> fine.
>>
>> I've set "osd recovery max active = 1" to prevent overloading the CPUs to much
>> since I know Atoms are not that powerful. I'm just still trying to get it all
>> working on them :)
>>
>> Am I right this is probably a lack of CPU power during the heavy recovery
>> which causes them to not respond to heartbeat updates in time?
>>
>> Wido
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-03 20:50   ` Wido den Hollander
@ 2013-01-03 21:05     ` Sage Weil
  2013-01-03 21:10       ` Wido den Hollander
  0 siblings, 1 reply; 7+ messages in thread
From: Sage Weil @ 2013-01-03 21:05 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel@vger.kernel.org

On Thu, 3 Jan 2013, Wido den Hollander wrote:
> Hi,
> 
> On 01/03/2013 05:52 PM, Sage Weil wrote:
> > Hi Wido,
> > 
> > On Thu, 3 Jan 2013, Wido den Hollander wrote:
> > > Hi,
> > > 
> > > I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening
> > > and
> > > found out this morning that I had 23 OSDs still up and in.
> > > 
> > > Investigating some logs I found these messages:
> > 
> > This sounds quite a bit #3714.  You might give wip-3714 a try...
> > 
> 
> I'll give it a try and see if I can reproduce. Probably just have to kill the
> whole cluster and restart everything to see if it behaves the same way.
> 
> I'll give it a try and let you know.

You mean restart every ceph-osd with the updated code?  I hope/suspect 
that will do the trick.

BTW, these patches are now in 'testing'.. I'd run that code.

Thanks!
sage

> 
> Wido
> 
> > sage
> > 
> > 
> > > 
> > > *********************************************************************
> > >      -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map
> > > is_healthy
> > > 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
> > >      -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map
> > > is_healthy
> > > 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
> > >      -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
> > > _send_mon_message to mon.pri at
> > > [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
> > >      -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
> > > [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
> > > [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
> > > pgs=0 cs=0 l=1).fault
> > >      -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
> > > [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
> > > [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104
> > > :44363
> > > pgs=99 cs=1 l=0).fault with nothing to send, going to standby
> > >      -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map
> > > is_healthy
> > > 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
> > >      -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map
> > > is_healthy
> > > 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
> > >      -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1
> > > common/HeartbeatMap.cc: In
> > > function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
> > > char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
> > > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> > > 
> > >   ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> > >   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> > > long)+0x107) [0x796877]
> > >   2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
> > >   3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
> > >   4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
> > >   5: (()+0x7e9a) [0x7f9eb4571e9a]
> > >   6: (clone()+0x6d) [0x7f9eb2ff5cbd]
> > >   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> > > to
> > > interpret this.
> > > 
> > >       0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
> > > _send_mon_message to mon.pri at
> > > [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
> > > *********************************************************************
> > > 
> > > Reading these messages I'm trying to figure out why those messages came
> > > along.
> > > 
> > > Am I understanding this correctly that the heartbeat updates didn't come
> > > along
> > > in time and the OSDs committed suicide?
> > > 
> > > I read the code in common/HeartbeatMap.cc and it seems like that.
> > > 
> > > During the restart of the cluster the Atom CPUs were very busy, so could
> > > it be
> > > that the CPUs were just to busy and the OSDs weren't responding to
> > > heartbeats
> > > in time?
> > > 
> > > In total 16 of the 17 crashed OSDs are down with these log messages.
> > > 
> > > I'm now starting the 16 crashed OSDs one by one and that seems to go just
> > > fine.
> > > 
> > > I've set "osd recovery max active = 1" to prevent overloading the CPUs to
> > > much
> > > since I know Atoms are not that powerful. I'm just still trying to get it
> > > all
> > > working on them :)
> > > 
> > > Am I right this is probably a lack of CPU power during the heavy recovery
> > > which causes them to not respond to heartbeat updates in time?
> > > 
> > > Wido
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-03 21:05     ` Sage Weil
@ 2013-01-03 21:10       ` Wido den Hollander
  0 siblings, 0 replies; 7+ messages in thread
From: Wido den Hollander @ 2013-01-03 21:10 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel@vger.kernel.org



On 01/03/2013 10:05 PM, Sage Weil wrote:
> On Thu, 3 Jan 2013, Wido den Hollander wrote:
>> Hi,
>>
>> On 01/03/2013 05:52 PM, Sage Weil wrote:
>>> Hi Wido,
>>>
>>> On Thu, 3 Jan 2013, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening
>>>> and
>>>> found out this morning that I had 23 OSDs still up and in.
>>>>
>>>> Investigating some logs I found these messages:
>>>
>>> This sounds quite a bit #3714.  You might give wip-3714 a try...
>>>
>>
>> I'll give it a try and see if I can reproduce. Probably just have to kill the
>> whole cluster and restart everything to see if it behaves the same way.
>>
>> I'll give it a try and let you know.
>
> You mean restart every ceph-osd with the updated code?  I hope/suspect
> that will do the trick.
>

I need to restart for the new code indeed. But to trigger the heavy CPU 
usage I need to restart all the OSDs at once to make sure they are all 
very busy with recovery.

Not restart them with 5 minute intervals between each OSD.

That's my assumption.

Wido

> BTW, these patches are now in 'testing'.. I'd run that code.
>
> Thanks!
> sage
>
>>
>> Wido
>>
>>> sage
>>>
>>>
>>>>
>>>> *********************************************************************
>>>>       -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>>>       -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
>>>>       -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
>>>> _send_mon_message to mon.pri at
>>>> [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>>>>       -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
>>>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
>>>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
>>>> pgs=0 cs=0 l=1).fault
>>>>       -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
>>>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
>>>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104
>>>> :44363
>>>> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
>>>>       -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>>>       -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map
>>>> is_healthy
>>>> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
>>>>       -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1
>>>> common/HeartbeatMap.cc: In
>>>> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
>>>> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
>>>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>>>
>>>>    ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>>>    1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>>>> long)+0x107) [0x796877]
>>>>    2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
>>>>    3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
>>>>    4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
>>>>    5: (()+0x7e9a) [0x7f9eb4571e9a]
>>>>    6: (clone()+0x6d) [0x7f9eb2ff5cbd]
>>>>    NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>>>> to
>>>> interpret this.
>>>>
>>>>        0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
>>>> _send_mon_message to mon.pri at
>>>> [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>>>> *********************************************************************
>>>>
>>>> Reading these messages I'm trying to figure out why those messages came
>>>> along.
>>>>
>>>> Am I understanding this correctly that the heartbeat updates didn't come
>>>> along
>>>> in time and the OSDs committed suicide?
>>>>
>>>> I read the code in common/HeartbeatMap.cc and it seems like that.
>>>>
>>>> During the restart of the cluster the Atom CPUs were very busy, so could
>>>> it be
>>>> that the CPUs were just to busy and the OSDs weren't responding to
>>>> heartbeats
>>>> in time?
>>>>
>>>> In total 16 of the 17 crashed OSDs are down with these log messages.
>>>>
>>>> I'm now starting the 16 crashed OSDs one by one and that seems to go just
>>>> fine.
>>>>
>>>> I've set "osd recovery max active = 1" to prevent overloading the CPUs to
>>>> much
>>>> since I know Atoms are not that powerful. I'm just still trying to get it
>>>> all
>>>> working on them :)
>>>>
>>>> Am I right this is probably a lack of CPU power during the heavy recovery
>>>> which causes them to not respond to heartbeat updates in time?
>>>>
>>>> Wido
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-03 16:52 ` Sage Weil
  2013-01-03 20:50   ` Wido den Hollander
@ 2013-01-08 17:35   ` Gregory Farnum
  2013-01-08 21:36     ` Sage Weil
  1 sibling, 1 reply; 7+ messages in thread
From: Gregory Farnum @ 2013-01-08 17:35 UTC (permalink / raw)
  To: Sage Weil; +Cc: Wido den Hollander, ceph-devel@vger.kernel.org

I'm confused. Isn't the HeartbeatMap all about local thread
heartbeating (so, not pings with other OSDs)? I would assume the
upgrade and restart just caused a bunch of work and the CPUs got
overloaded.
-Greg

On Thu, Jan 3, 2013 at 8:52 AM, Sage Weil <sage@inktank.com> wrote:
> Hi Wido,
>
> On Thu, 3 Jan 2013, Wido den Hollander wrote:
>> Hi,
>>
>> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and
>> found out this morning that I had 23 OSDs still up and in.
>>
>> Investigating some logs I found these messages:
>
> This sounds quite a bit #3714.  You might give wip-3714 a try...
>
> sage
>
>
>>
>> *********************************************************************
>>     -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>     -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
>>     -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>>     -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
>> pgs=0 cs=0 l=1).fault
>>     -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
>> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
>> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 :44363
>> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
>>     -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
>>     -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
>>     -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 common/HeartbeatMap.cc: In
>> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
>> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>
>>  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
>> long)+0x107) [0x796877]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
>>  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
>>  5: (()+0x7e9a) [0x7f9eb4571e9a]
>>  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>>      0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
>> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
>> *********************************************************************
>>
>> Reading these messages I'm trying to figure out why those messages came along.
>>
>> Am I understanding this correctly that the heartbeat updates didn't come along
>> in time and the OSDs committed suicide?
>>
>> I read the code in common/HeartbeatMap.cc and it seems like that.
>>
>> During the restart of the cluster the Atom CPUs were very busy, so could it be
>> that the CPUs were just to busy and the OSDs weren't responding to heartbeats
>> in time?
>>
>> In total 16 of the 17 crashed OSDs are down with these log messages.
>>
>> I'm now starting the 16 crashed OSDs one by one and that seems to go just
>> fine.
>>
>> I've set "osd recovery max active = 1" to prevent overloading the CPUs to much
>> since I know Atoms are not that powerful. I'm just still trying to get it all
>> working on them :)
>>
>> Am I right this is probably a lack of CPU power during the heavy recovery
>> which causes them to not respond to heartbeat updates in time?
>>
>> Wido
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: "hit suicide timeout" message after upgrade to 0.56
  2013-01-08 17:35   ` Gregory Farnum
@ 2013-01-08 21:36     ` Sage Weil
  0 siblings, 0 replies; 7+ messages in thread
From: Sage Weil @ 2013-01-08 21:36 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Wido den Hollander, ceph-devel@vger.kernel.org

On Tue, 8 Jan 2013, Gregory Farnum wrote:
> I'm confused. Isn't the HeartbeatMap all about local thread
> heartbeating (so, not pings with other OSDs)? I would assume the
> upgrade and restart just caused a bunch of work and the CPUs got
> overloaded.

It is.  In #3714's case, the OSD was down for a long time, and when it 
started a huge number of maps were queued up for the PGs and then the OSD 
activated (and booted, marked itself up).  The worker threads then had to 
calculate a huge number of prior_intervals (by loading old osdmaps and 
calculating mappings) and timed out.

sage


> -Greg
> 
> On Thu, Jan 3, 2013 at 8:52 AM, Sage Weil <sage@inktank.com> wrote:
> > Hi Wido,
> >
> > On Thu, 3 Jan 2013, Wido den Hollander wrote:
> >> Hi,
> >>
> >> I updated my 10 node 40 OSD cluster from 0.48 to 0.56 yesterday evening and
> >> found out this morning that I had 23 OSDs still up and in.
> >>
> >> Investigating some logs I found these messages:
> >
> > This sounds quite a bit #3714.  You might give wip-3714 a try...
> >
> > sage
> >
> >
> >>
> >> *********************************************************************
> >>     -8> 2013-01-02 21:13:40.528936 7f9eb177a700  1 heartbeat_map is_healthy
> >> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
> >>     -7> 2013-01-02 21:13:40.528985 7f9eb177a700  1 heartbeat_map is_healthy
> >> 'OSD::op_tp thread 0x7f9ea375e700' had timed out after 30
> >>     -6> 2013-01-02 21:13:41.311088 7f9eaff77700 10 monclient:
> >> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
> >>     -5> 2013-01-02 21:13:45.047220 7f9e92282700  0 --
> >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:0/2882 >>
> >> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6805/2373 pipe(0x9d7ad80 sd=135 :0
> >> pgs=0 cs=0 l=1).fault
> >>     -4> 2013-01-02 21:13:45.049225 7f9e962c2700  0 --
> >> [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6801/2882 >>
> >> [2a00:f10:11b:cef0:225:90ff:fe33:49fe]:6804/2373 pipe(0x9d99000 sd=104 :44363
> >> pgs=99 cs=1 l=0).fault with nothing to send, going to standby
> >>     -3> 2013-01-02 21:13:45.529075 7f9eb177a700  1 heartbeat_map is_healthy
> >> 'OSD::op_tp thread 0x7f9ea2f5d700' had timed out after 30
> >>     -2> 2013-01-02 21:13:45.529115 7f9eb177a700  1 heartbeat_map is_healthy
> >> 'OSD::op_tp thread 0x7f9ea2f5d700' had suicide timed out after 300
> >>     -1> 2013-01-02 21:13:45.531952 7f9eb177a700 -1 common/HeartbeatMap.cc: In
> >> function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const
> >> char*, time_t)' thread 7f9eb177a700 time 2013-01-02 21:13:45.529176
> >> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> >>
> >>  ceph version 0.56 (1a32f0a0b42f169a7b55ed48ec3208f6d4edc1e8)
> >>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*,
> >> long)+0x107) [0x796877]
> >>  2: (ceph::HeartbeatMap::is_healthy()+0x87) [0x797207]
> >>  3: (ceph::HeartbeatMap::check_touch_file()+0x23) [0x797453]
> >>  4: (CephContextServiceThread::entry()+0x55) [0x8338d5]
> >>  5: (()+0x7e9a) [0x7f9eb4571e9a]
> >>  6: (clone()+0x6d) [0x7f9eb2ff5cbd]
> >>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
> >> interpret this.
> >>
> >>      0> 2013-01-02 21:13:46.314478 7f9eaff77700 10 monclient:
> >> _send_mon_message to mon.pri at [2a00:f10:11b:cef0:230:48ff:fed3:b086]:6789/0
> >> *********************************************************************
> >>
> >> Reading these messages I'm trying to figure out why those messages came along.
> >>
> >> Am I understanding this correctly that the heartbeat updates didn't come along
> >> in time and the OSDs committed suicide?
> >>
> >> I read the code in common/HeartbeatMap.cc and it seems like that.
> >>
> >> During the restart of the cluster the Atom CPUs were very busy, so could it be
> >> that the CPUs were just to busy and the OSDs weren't responding to heartbeats
> >> in time?
> >>
> >> In total 16 of the 17 crashed OSDs are down with these log messages.
> >>
> >> I'm now starting the 16 crashed OSDs one by one and that seems to go just
> >> fine.
> >>
> >> I've set "osd recovery max active = 1" to prevent overloading the CPUs to much
> >> since I know Atoms are not that powerful. I'm just still trying to get it all
> >> working on them :)
> >>
> >> Am I right this is probably a lack of CPU power during the heavy recovery
> >> which causes them to not respond to heartbeat updates in time?
> >>
> >> Wido
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-01-08 21:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-03 12:26 "hit suicide timeout" message after upgrade to 0.56 Wido den Hollander
2013-01-03 16:52 ` Sage Weil
2013-01-03 20:50   ` Wido den Hollander
2013-01-03 21:05     ` Sage Weil
2013-01-03 21:10       ` Wido den Hollander
2013-01-08 17:35   ` Gregory Farnum
2013-01-08 21:36     ` Sage Weil

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.