OSD suicide

All of lore.kernel.org
 help / color / mirror / Atom feed

* OSD suicide
@ 2012-04-02 14:01 Бородин Владимир
  2012-04-02 14:55 ` Wido den Hollander
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Бородин Владимир @ 2012-04-02 14:01 UTC (permalink / raw)
  To: ceph-devel

Hi all.

I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
am trying to put into RADOS (through python API) 20 million objects 20
KB each. I have two problems:
1. the speed is not as good as I expect (but that's not the main problem now),
2. after I put 10 million objects, OSDs started to mark itself down
and out. The logs give something like that:

2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
2012-04-02 17:04:58.87
7752)
2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
2012-04-02 17:04:59.57
8079)
2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
2012-04-02 17:05:00.67
8421)
2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
2012-04-02 17:05:01.67
8751)
2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
2012-04-02 17:05:02.57
9069)
2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
common/HeartbeatMap.cc: In function 'bool
ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
time_t)' thread 7f2d2a213700 time 2012-04-02 17:
05:22.894637
common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
 ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
const*, long)+0x1fe) [0x7634ee]
 2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
 4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
 5: (()+0x69ca) [0x7f2d2beab9ca]
 6: (clone()+0x6d) [0x7f2d2a4fccdd]
 ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
 1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
const*, long)+0x1fe) [0x7634ee]
 2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
 3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
 4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
 5: (()+0x69ca) [0x7f2d2beab9ca]
 6: (clone()+0x6d) [0x7f2d2a4fccdd]
*** Caught signal (Aborted) **
 in thread 7f2d2a213700
 ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
 1: /usr/bin/ceph-osd() [0x661cb1]
 2: (()+0xf8f0) [0x7f2d2beb48f0]
 3: (gsignal()+0x35) [0x7f2d2a449a75]
 4: (abort()+0x180) [0x7f2d2a44d5c0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
 6: (()+0xb7736) [0x7f2d2acea736]
 7: (()+0xb7763) [0x7f2d2acea763]
 8: (()+0xb785e) [0x7f2d2acea85e]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x841) [0x667541]
 10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
const*, long)+0x1fe) [0x7634ee]
 11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
 12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
 13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
 14: (()+0x69ca) [0x7f2d2beab9ca]
 15: (clone()+0x6d) [0x7f2d2a4fccdd]

Or something like that:

...
2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f7847369700' had timed out after 30
2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
'OSD::op_tp thread 0x7f7847369700' had timed out after 30
2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
we reset (peer sent cseq 2), sending RESETSESSION
2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
dropping, sharing map
2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442 >>
84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442 >>
84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
initiating reconnect
2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442 >>
84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442 >>
84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442 >>
84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
with nothing to send, going to standby
2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442 >>
84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
initiating reconnect
2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
2012-04-02 17:20:34.798943)
...
2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
2012-04-02 17:20:36.846704)
2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
l=0).connect got RESETSESSION
2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
2012-04-02 17:21:51.678001)
...
2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442 >>
84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
l=0).connect got RESETSESSION
2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
initiating reconnect
2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442 >>
84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
l=0).connect got RESETSESSION
2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
marked me down or wrong addr

In first case the OSD process terminates, it is marked out by the
cluster and re-replication is started. In second case the OSD resumes
over several time and then goes down again etc...

It seems that the problem is that the OSDs don't send/receive
heartbeat-messages, but why? The network seems to be good and time is
synchronized. The problem appeared only when I put a lot of objects
(and took more than a half of free space availible). I can give the
config file and any logs if needed.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-02 14:01 OSD suicide Бородин Владимир
@ 2012-04-02 14:55 ` Wido den Hollander
  2012-04-04 18:26   ` Borodin Vladimir
  2012-04-02 20:14 ` Stefan Kleijkers
  2012-04-02 20:18 ` Stefan Kleijkers
  2 siblings, 1 reply; 10+ messages in thread
From: Wido den Hollander @ 2012-04-02 14:55 UTC (permalink / raw)
  To: Бородин Владимир
  Cc: ceph-devel

Hi,

On 04/02/2012 04:01 PM, Бородин Владимир wrote:
> Hi all.
>
> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
> am trying to put into RADOS (through python API) 20 million objects 20
> KB each. I have two problems:
> 1. the speed is not as good as I expect (but that's not the main problem now),
> 2. after I put 10 million objects, OSDs started to mark itself down
> and out. The logs give something like that:
>
> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:58.87
> 7752)
> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:59.57
> 8079)
> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:00.67
> 8421)
> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:01.67
> 8751)
> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:02.57
> 9069)
> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
> common/HeartbeatMap.cc: In function 'bool
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
> 05:22.894637
> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
> *** Caught signal (Aborted) **
>   in thread 7f2d2a213700
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: /usr/bin/ceph-osd() [0x661cb1]
>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>   6: (()+0xb7736) [0x7f2d2acea736]
>   7: (()+0xb7763) [0x7f2d2acea763]
>   8: (()+0xb785e) [0x7f2d2acea85e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x841) [0x667541]
>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   14: (()+0x69ca) [0x7f2d2beab9ca]
>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>
> Or something like that:

I reported this one a couple of days ago: 
http://tracker.newdream.net/issues/2219

>
> ...
> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
> we reset (peer sent cseq 2), sending RESETSESSION
> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
> dropping, sharing map
> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:34.798943)
> ...
> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:36.846704)
> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:21:51.678001)
> ...
> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
> marked me down or wrong addr
>
> In first case the OSD process terminates, it is marked out by the
> cluster and re-replication is started. In second case the OSD resumes
> over several time and then goes down again etc...
>
> It seems that the problem is that the OSDs don't send/receive
> heartbeat-messages, but why? The network seems to be good and time is
> synchronized. The problem appeared only when I put a lot of objects
> (and took more than a half of free space availible). I can give the
> config file and any logs if needed.

Last week new heartbeat code was merged into the master tree. The 
up/down going of OSDs was something I was seeing over the last months.

Sage wrote some new code, it might be worth trying that.

Since I'm running that code I haven't seen any suicides, so it could be 
related.

It's this commit: 
https://github.com/ceph/ceph/commit/374bef9c97266600b4c6b83100485d7250363213

Could you try to build from git?

git clone https://github.com/ceph/ceph.git
cd ceph
git submodule init
git submodule update
dpkg-buildpackage (if you are using Debian).

Wido

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-02 14:55 ` Wido den Hollander
@ 2012-04-04 18:26   ` Borodin Vladimir
  2012-04-05 13:05     ` Wido den Hollander
  0 siblings, 1 reply; 10+ messages in thread
From: Borodin Vladimir @ 2012-04-04 18:26 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

Hello Wido,

Yesterday I built from git: $ ceph -v
ceph version 0.44.1-149-ge80126e
(commit:e80126ea689e9a972fbf09e8848fc4a2ade13c59)

The messages are a bit different but the problem seems to be the same.
Below is a part of the osd.48.log. Should I turn debug osd mode more
verbose or give any other information?

Regards,
Vladimir.

2012-04-04 22:17:35.854156 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:15.854125)
2012-04-04 22:17:35.854194 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20588.898191 seconds old
2012-04-04 22:17:36.854360 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:16.854343)
2012-04-04 22:17:36.854405 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20589.898402 seconds old
2012-04-04 22:17:37.027882 7fd55e6f3700 -- 84.201.161.48:0/22705
send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
22:17:37.027859) v2 because of no pipe on con 0x391d89b0
2012-04-04 22:17:37.027908 7fd55e6f3700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:17.027899)
2012-04-04 22:17:37.854569 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:17.854552)
2012-04-04 22:17:37.854593 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20590.898589 seconds old
2012-04-04 22:17:38.854768 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:18.854750)
2012-04-04 22:17:38.854795 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20591.898792 seconds old
2012-04-04 22:17:39.854960 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:19.854943)
2012-04-04 22:17:39.854986 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20592.898982 seconds old
2012-04-04 22:17:39.928092 7fd55e6f3700 -- 84.201.161.48:0/22705
send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
22:17:39.928071) v2 because of no pipe on con 0x391d89b0
2012-04-04 22:17:39.928117 7fd55e6f3700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:19.928107)
2012-04-04 22:17:40.087719 7fd57a531700 heartbeat_map is_healthy
'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
2012-04-04 22:17:40.855154 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:20.855137)
2012-04-04 22:17:40.855178 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20593.899174 seconds old
2012-04-04 22:17:41.855368 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:21.855350)
2012-04-04 22:17:41.855409 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20594.899406 seconds old
2012-04-04 22:17:42.855579 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:22.855562)
2012-04-04 22:17:42.855605 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20595.899601 seconds old
2012-04-04 22:17:43.855768 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:23.855751)
2012-04-04 22:17:43.855793 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20596.899789 seconds old
2012-04-04 22:17:44.628337 7fd55e6f3700 -- 84.201.161.48:0/22705
send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
22:17:44.628298) v2 because of no pipe on con 0x391d89b0
2012-04-04 22:17:44.628373 7fd55e6f3700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:24.628363)
2012-04-04 22:17:44.855969 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:24.855953)
2012-04-04 22:17:44.856001 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20597.899997 seconds old
2012-04-04 22:17:45.087874 7fd57a531700 heartbeat_map is_healthy
'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
2012-04-04 22:17:45.856191 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:25.856162)
2012-04-04 22:17:45.856230 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20598.900227 seconds old
2012-04-04 22:17:46.856388 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:26.856371)
2012-04-04 22:17:46.856432 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20599.900428 seconds old
2012-04-04 22:17:47.856603 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:27.856586)
2012-04-04 22:17:47.856628 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20600.900625 seconds old
2012-04-04 22:17:48.128570 7fd55e6f3700 -- 84.201.161.48:0/22705
send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
22:17:48.128545) v2 because of no pipe on con 0x391d89b0
2012-04-04 22:17:48.128595 7fd55e6f3700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:28.128587)
2012-04-04 22:17:48.856819 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:28.856791)
2012-04-04 22:17:48.856858 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20601.900854 seconds old
2012-04-04 22:17:49.857044 7fd575527700 osd.48 1910 heartbeat_check:
no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
2012-04-04 22:17:29.857019)
2012-04-04 22:17:49.857080 7fd575527700 --OSD::tracker--
ops_in_flight.size: 9; oldest is 20602.901076 seconds old
2012-04-04 22:17:50.088035 7fd57a531700 heartbeat_map is_healthy
'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60

2012/4/2 Wido den Hollander <wido@widodh.nl>:
> Hi,
>
>
> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>
>> Hi all.
>>
>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>> am trying to put into RADOS (through python API) 20 million objects 20
>> KB each. I have two problems:
>> 1. the speed is not as good as I expect (but that's not the main problem
>> now),
>> 2. after I put 10 million objects, OSDs started to mark itself down
>> and out. The logs give something like that:
>>
>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:04:58.87
>> 7752)
>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:04:59.57
>> 8079)
>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:00.67
>> 8421)
>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:01.67
>> 8751)
>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:02.57
>> 9069)
>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>> common/HeartbeatMap.cc: In function 'bool
>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>> 05:22.894637
>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>> *** Caught signal (Aborted) **
>>  in thread 7f2d2a213700
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: /usr/bin/ceph-osd() [0x661cb1]
>>  2: (()+0xf8f0) [0x7f2d2beb48f0]
>>  3: (gsignal()+0x35) [0x7f2d2a449a75]
>>  4: (abort()+0x180) [0x7f2d2a44d5c0]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>  6: (()+0xb7736) [0x7f2d2acea736]
>>  7: (()+0xb7763) [0x7f2d2acea763]
>>  8: (()+0xb785e) [0x7f2d2acea85e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x841) [0x667541]
>>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  14: (()+0x69ca) [0x7f2d2beab9ca]
>>  15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>
>> Or something like that:
>
>
> I reported this one a couple of days ago:
> http://tracker.newdream.net/issues/2219
>
>
>>
>> ...
>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>> we reset (peer sent cseq 2), sending RESETSESSION
>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>> dropping, sharing map
>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:20:34.798943)
>> ...
>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:20:36.846704)
>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:21:51.678001)
>> ...
>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>> marked me down or wrong addr
>>
>> In first case the OSD process terminates, it is marked out by the
>> cluster and re-replication is started. In second case the OSD resumes
>> over several time and then goes down again etc...
>>
>> It seems that the problem is that the OSDs don't send/receive
>> heartbeat-messages, but why? The network seems to be good and time is
>> synchronized. The problem appeared only when I put a lot of objects
>> (and took more than a half of free space availible). I can give the
>> config file and any logs if needed.
>
>
> Last week new heartbeat code was merged into the master tree. The up/down
> going of OSDs was something I was seeing over the last months.
>
> Sage wrote some new code, it might be worth trying that.
>
> Since I'm running that code I haven't seen any suicides, so it could be
> related.
>
> It's this commit:
> https://github.com/ceph/ceph/commit/374bef9c97266600b4c6b83100485d7250363213
>
> Could you try to build from git?
>
> git clone https://github.com/ceph/ceph.git
> cd ceph
> git submodule init
> git submodule update
> dpkg-buildpackage (if you are using Debian).
>
> Wido
>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-04 18:26   ` Borodin Vladimir
@ 2012-04-05 13:05     ` Wido den Hollander
  2012-04-05 14:46       ` Borodin Vladimir
  0 siblings, 1 reply; 10+ messages in thread
From: Wido den Hollander @ 2012-04-05 13:05 UTC (permalink / raw)
  To: Borodin Vladimir; +Cc: ceph-devel

Hi,

On 04/04/2012 08:26 PM, Borodin Vladimir wrote:
> Hello Wido,
>
> Yesterday I built from git: $ ceph -v
> ceph version 0.44.1-149-ge80126e
> (commit:e80126ea689e9a972fbf09e8848fc4a2ade13c59)

That commit should contain the new heartbeat code.

>
> The messages are a bit different but the problem seems to be the same.
> Below is a part of the osd.48.log. Should I turn debug osd mode more
> verbose or give any other information?

Are the other OSDs also saying that osd.49 is down? Right now it seems 
to me that osd.49 is really down. The "bouncing" I was seeing before 
caused a OSD to mark multiple OSDs down at once, it wouldn't complain 
over and over about the same OSD.

Are you still running XFS with mdraid under the OSDs?

Wido

>
> Regards,
> Vladimir.
>
> 2012-04-04 22:17:35.854156 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:15.854125)
> 2012-04-04 22:17:35.854194 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20588.898191 seconds old
> 2012-04-04 22:17:36.854360 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:16.854343)
> 2012-04-04 22:17:36.854405 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20589.898402 seconds old
> 2012-04-04 22:17:37.027882 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:37.027859) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:37.027908 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:17.027899)
> 2012-04-04 22:17:37.854569 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:17.854552)
> 2012-04-04 22:17:37.854593 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20590.898589 seconds old
> 2012-04-04 22:17:38.854768 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:18.854750)
> 2012-04-04 22:17:38.854795 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20591.898792 seconds old
> 2012-04-04 22:17:39.854960 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:19.854943)
> 2012-04-04 22:17:39.854986 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20592.898982 seconds old
> 2012-04-04 22:17:39.928092 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:39.928071) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:39.928117 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:19.928107)
> 2012-04-04 22:17:40.087719 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
> 2012-04-04 22:17:40.855154 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:20.855137)
> 2012-04-04 22:17:40.855178 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20593.899174 seconds old
> 2012-04-04 22:17:41.855368 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:21.855350)
> 2012-04-04 22:17:41.855409 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20594.899406 seconds old
> 2012-04-04 22:17:42.855579 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:22.855562)
> 2012-04-04 22:17:42.855605 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20595.899601 seconds old
> 2012-04-04 22:17:43.855768 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:23.855751)
> 2012-04-04 22:17:43.855793 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20596.899789 seconds old
> 2012-04-04 22:17:44.628337 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:44.628298) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:44.628373 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:24.628363)
> 2012-04-04 22:17:44.855969 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:24.855953)
> 2012-04-04 22:17:44.856001 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20597.899997 seconds old
> 2012-04-04 22:17:45.087874 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
> 2012-04-04 22:17:45.856191 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:25.856162)
> 2012-04-04 22:17:45.856230 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20598.900227 seconds old
> 2012-04-04 22:17:46.856388 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:26.856371)
> 2012-04-04 22:17:46.856432 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20599.900428 seconds old
> 2012-04-04 22:17:47.856603 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:27.856586)
> 2012-04-04 22:17:47.856628 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20600.900625 seconds old
> 2012-04-04 22:17:48.128570 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:48.128545) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:48.128595 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:28.128587)
> 2012-04-04 22:17:48.856819 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:28.856791)
> 2012-04-04 22:17:48.856858 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20601.900854 seconds old
> 2012-04-04 22:17:49.857044 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:29.857019)
> 2012-04-04 22:17:49.857080 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20602.901076 seconds old
> 2012-04-04 22:17:50.088035 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
>
> 2012/4/2 Wido den Hollander<wido@widodh.nl>:
>> Hi,
>>
>>
>> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>>
>>> Hi all.
>>>
>>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>>> am trying to put into RADOS (through python API) 20 million objects 20
>>> KB each. I have two problems:
>>> 1. the speed is not as good as I expect (but that's not the main problem
>>> now),
>>> 2. after I put 10 million objects, OSDs started to mark itself down
>>> and out. The logs give something like that:
>>>
>>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:58.87
>>> 7752)
>>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:59.57
>>> 8079)
>>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:00.67
>>> 8421)
>>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:01.67
>>> 8751)
>>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:02.57
>>> 9069)
>>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>>> common/HeartbeatMap.cc: In function 'bool
>>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>>> 05:22.894637
>>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>> *** Caught signal (Aborted) **
>>>   in thread 7f2d2a213700
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: /usr/bin/ceph-osd() [0x661cb1]
>>>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>>>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>>>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>>>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>>   6: (()+0xb7736) [0x7f2d2acea736]
>>>   7: (()+0xb7763) [0x7f2d2acea763]
>>>   8: (()+0xb785e) [0x7f2d2acea85e]
>>>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x841) [0x667541]
>>>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   14: (()+0x69ca) [0x7f2d2beab9ca]
>>>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>
>>> Or something like that:
>>
>>
>> I reported this one a couple of days ago:
>> http://tracker.newdream.net/issues/2219
>>
>>
>>>
>>> ...
>>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>>> we reset (peer sent cseq 2), sending RESETSESSION
>>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>>> dropping, sharing map
>>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:34.798943)
>>> ...
>>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:36.846704)
>>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:21:51.678001)
>>> ...
>>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>>> marked me down or wrong addr
>>>
>>> In first case the OSD process terminates, it is marked out by the
>>> cluster and re-replication is started. In second case the OSD resumes
>>> over several time and then goes down again etc...
>>>
>>> It seems that the problem is that the OSDs don't send/receive
>>> heartbeat-messages, but why? The network seems to be good and time is
>>> synchronized. The problem appeared only when I put a lot of objects
>>> (and took more than a half of free space availible). I can give the
>>> config file and any logs if needed.
>>
>>
>> Last week new heartbeat code was merged into the master tree. The up/down
>> going of OSDs was something I was seeing over the last months.
>>
>> Sage wrote some new code, it might be worth trying that.
>>
>> Since I'm running that code I haven't seen any suicides, so it could be
>> related.
>>
>> It's this commit:
>> https://github.com/ceph/ceph/commit/374bef9c97266600b4c6b83100485d7250363213
>>
>> Could you try to build from git?
>>
>> git clone https://github.com/ceph/ceph.git
>> cd ceph
>> git submodule init
>> git submodule update
>> dpkg-buildpackage (if you are using Debian).
>>
>> Wido
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-05 13:05     ` Wido den Hollander
@ 2012-04-05 14:46       ` Borodin Vladimir
  0 siblings, 0 replies; 10+ messages in thread
From: Borodin Vladimir @ 2012-04-05 14:46 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

Yes, I'm still running XFS with mdraid under the OSDs. Do you
recommend another configuration?

Right now I have:
$ ceph osd dump -o - | grep osd
dumped osdmap epoch 2311
max_osd 51
osd.47 down in  weight 1 up_from 2284 up_thru 2301 down_at 2310
last_clean_interval [1782,2276) 84.201.161.47:6800/11790
84.201.161.47:6801/11790 84.201.161.47:6802/11790 exists
osd.48 down in  weight 1 up_from 2280 up_thru 2285 down_at 2309
last_clean_interval [1781,2268) 84.201.161.48:6800/16880
84.201.161.48:6801/16880 84.201.161.48:6802/16880 exists
osd.49 up   in  weight 1 up_from 2285 up_thru 2304 down_at 2274
last_clean_interval [1785,2268) 84.201.161.49:6801/5228
84.201.161.49:6802/5228 84.201.161.49:6803/5228 exists,up
osd.50 up   in  weight 1 up_from 2306 up_thru 2310 down_at 2304
last_clean_interval [2287,2299) 84.201.161.50:6801/5422
84.201.161.50:6802/5422 84.201.161.50:6803/5422 exists,up

Osd.47 and osd.48 just fell. Their log has these messages:

2012-04-05 18:36:35.003031 7fe0d7ff7700 -- 84.201.161.47:0/11790
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:35.002950) v2 because of no pipe on con 0x2890da0
2012-04-05 18:36:35.503292 7fe0d7ff7700 -- 84.201.161.47:0/11790
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:35.503204) v2 because of no pipe on con 0x2890da0
2012-04-05 18:36:36.243577 7fe0db7fe700 -- 84.201.161.47:6800/11790
send_keepalive con 0x7fe0c5448920, no pipe.
2012-04-05 18:36:36.629511 7fe0d70bd700 -- 84.201.161.47:6801/11790 >>
84.201.161.50:6802/29817 pipe(0x34ee010 sd=85 pgs=2 cs=2 l=0).connect
claims to be 84.201.161.50:6802/5422 not 84.201.161.50:6802/29817 -
wrong node!
2012-04-05 18:36:36.822603 7fe0d70bd700 -- 84.201.161.47:6801/11790 >>
84.201.161.50:6802/29817 pipe(0x34ee010 sd=85 pgs=2 cs=2 l=0).connect
claims to be 84.201.161.50:6802/5422 not 84.201.161.50:6802/29817 -
wrong node!
2012-04-05 18:36:37.066666 7fe0d70bd700 -- 84.201.161.47:6801/11790 >>
84.201.161.50:6802/29817 pipe(0x34ee010 sd=85 pgs=2 cs=2 l=0).connect
claims to be 84.201.161.50:6802/5422 not 84.201.161.50:6802/29817 -
wrong node!
2012-04-05 18:36:37.803564 7fe0d7ff7700 -- 84.201.161.47:0/11790
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:37.803480) v2 because of no pipe on con 0x2890da0
2012-04-05 18:36:40.103824 7fe0d7ff7700 -- 84.201.161.47:0/11790
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:40.103740) v2 because of no pipe on con 0x2890da0

2012-04-05 18:35:56.999079 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:35:56.998985) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:35:58.699376 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:35:58.699273) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:35:59.799640 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:35:59.799558) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:36:04.499920 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:04.499823) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:36:10.400198 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:10.400106) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:36:11.500497 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:11.500392) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:36:13.200759 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:13.200677) v2 because of no pipe on con 0x7f2ebd5c6360
2012-04-05 18:36:16.701023 7f2eb0fe9700 -- 84.201.161.48:0/16880
send_message dropped message osd_ping(ping e0 stamp 2012-04-05
18:36:16.700941) v2 because of no pipe on con 0x7f2ebd5c6360

At the time of writing previous letter "ceph osd dump -o -" said me
that osd.48 was down and out. That's why I posted a piece of it's log.

Can the problem be related to recovery of the data?
I have often OSD crashes (due to suicide for example) => PGs are
frequently moving, re-replicating and recovering => OSDs are busy with
data recovery and they do not have enough time to answer heartbeat
messages. Or it is some kind of science fiction? ☺
I can't remember the time when I last had all PGs in active+clean
state and I'm not sure the cluster will be able to recover to normal
state now.

Regards,
Vladimir.

5 апреля 2012 г. 17:05 пользователь Wido den Hollander <wido@widodh.nl> написал:
> Hi,
>
>
> On 04/04/2012 08:26 PM, Borodin Vladimir wrote:
>>
>> Hello Wido,
>>
>> Yesterday I built from git: $ ceph -v
>> ceph version 0.44.1-149-ge80126e
>> (commit:e80126ea689e9a972fbf09e8848fc4a2ade13c59)
>
>
> That commit should contain the new heartbeat code.
>
>
>>
>> The messages are a bit different but the problem seems to be the same.
>> Below is a part of the osd.48.log. Should I turn debug osd mode more
>> verbose or give any other information?
>
>
> Are the other OSDs also saying that osd.49 is down? Right now it seems to me
> that osd.49 is really down. The "bouncing" I was seeing before caused a OSD
> to mark multiple OSDs down at once, it wouldn't complain over and over about
> the same OSD.
>
> Are you still running XFS with mdraid under the OSDs?
>
> Wido
>
>
>>
>> Regards,
>> Vladimir.
>>
>> 2012-04-04 22:17:35.854156 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:15.854125)
>> 2012-04-04 22:17:35.854194 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20588.898191 seconds old
>> 2012-04-04 22:17:36.854360 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:16.854343)
>> 2012-04-04 22:17:36.854405 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20589.898402 seconds old
>> 2012-04-04 22:17:37.027882 7fd55e6f3700 -- 84.201.161.48:0/22705
>> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
>> 22:17:37.027859) v2 because of no pipe on con 0x391d89b0
>> 2012-04-04 22:17:37.027908 7fd55e6f3700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:17.027899)
>> 2012-04-04 22:17:37.854569 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:17.854552)
>> 2012-04-04 22:17:37.854593 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20590.898589 seconds old
>> 2012-04-04 22:17:38.854768 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:18.854750)
>> 2012-04-04 22:17:38.854795 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20591.898792 seconds old
>> 2012-04-04 22:17:39.854960 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:19.854943)
>> 2012-04-04 22:17:39.854986 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20592.898982 seconds old
>> 2012-04-04 22:17:39.928092 7fd55e6f3700 -- 84.201.161.48:0/22705
>> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
>> 22:17:39.928071) v2 because of no pipe on con 0x391d89b0
>> 2012-04-04 22:17:39.928117 7fd55e6f3700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:19.928107)
>> 2012-04-04 22:17:40.087719 7fd57a531700 heartbeat_map is_healthy
>> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
>> 2012-04-04 22:17:40.855154 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:20.855137)
>> 2012-04-04 22:17:40.855178 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20593.899174 seconds old
>> 2012-04-04 22:17:41.855368 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:21.855350)
>> 2012-04-04 22:17:41.855409 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20594.899406 seconds old
>> 2012-04-04 22:17:42.855579 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:22.855562)
>> 2012-04-04 22:17:42.855605 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20595.899601 seconds old
>> 2012-04-04 22:17:43.855768 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:23.855751)
>> 2012-04-04 22:17:43.855793 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20596.899789 seconds old
>> 2012-04-04 22:17:44.628337 7fd55e6f3700 -- 84.201.161.48:0/22705
>> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
>> 22:17:44.628298) v2 because of no pipe on con 0x391d89b0
>> 2012-04-04 22:17:44.628373 7fd55e6f3700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:24.628363)
>> 2012-04-04 22:17:44.855969 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:24.855953)
>> 2012-04-04 22:17:44.856001 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20597.899997 seconds old
>> 2012-04-04 22:17:45.087874 7fd57a531700 heartbeat_map is_healthy
>> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
>> 2012-04-04 22:17:45.856191 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:25.856162)
>> 2012-04-04 22:17:45.856230 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20598.900227 seconds old
>> 2012-04-04 22:17:46.856388 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:26.856371)
>> 2012-04-04 22:17:46.856432 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20599.900428 seconds old
>> 2012-04-04 22:17:47.856603 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:27.856586)
>> 2012-04-04 22:17:47.856628 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20600.900625 seconds old
>> 2012-04-04 22:17:48.128570 7fd55e6f3700 -- 84.201.161.48:0/22705
>> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
>> 22:17:48.128545) v2 because of no pipe on con 0x391d89b0
>> 2012-04-04 22:17:48.128595 7fd55e6f3700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:28.128587)
>> 2012-04-04 22:17:48.856819 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:28.856791)
>> 2012-04-04 22:17:48.856858 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20601.900854 seconds old
>> 2012-04-04 22:17:49.857044 7fd575527700 osd.48 1910 heartbeat_check:
>> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
>> 2012-04-04 22:17:29.857019)
>> 2012-04-04 22:17:49.857080 7fd575527700 --OSD::tracker--
>> ops_in_flight.size: 9; oldest is 20602.901076 seconds old
>> 2012-04-04 22:17:50.088035 7fd57a531700 heartbeat_map is_healthy
>> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
>>
>> 2012/4/2 Wido den Hollander<wido@widodh.nl>:
>>>
>>> Hi,
>>>
>>>
>>> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>>>
>>>>
>>>> Hi all.
>>>>
>>>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>>>> am trying to put into RADOS (through python API) 20 million objects 20
>>>> KB each. I have two problems:
>>>> 1. the speed is not as good as I expect (but that's not the main problem
>>>> now),
>>>> 2. after I put 10 million objects, OSDs started to mark itself down
>>>> and out. The logs give something like that:
>>>>
>>>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>>> 2012-04-02 17:04:58.87
>>>> 7752)
>>>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>>> 2012-04-02 17:04:59.57
>>>> 8079)
>>>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>>> 2012-04-02 17:05:00.67
>>>> 8421)
>>>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>>> 2012-04-02 17:05:01.67
>>>> 8751)
>>>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>>> 2012-04-02 17:05:02.57
>>>> 9069)
>>>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>>>> common/HeartbeatMap.cc: In function 'bool
>>>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>>>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>>>> 05:22.894637
>>>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>>> const*, long)+0x1fe) [0x7634ee]
>>>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>>> const*, long)+0x1fe) [0x7634ee]
>>>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>> *** Caught signal (Aborted) **
>>>>  in thread 7f2d2a213700
>>>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>>  1: /usr/bin/ceph-osd() [0x661cb1]
>>>>  2: (()+0xf8f0) [0x7f2d2beb48f0]
>>>>  3: (gsignal()+0x35) [0x7f2d2a449a75]
>>>>  4: (abort()+0x180) [0x7f2d2a44d5c0]
>>>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>>>  6: (()+0xb7736) [0x7f2d2acea736]
>>>>  7: (()+0xb7763) [0x7f2d2acea763]
>>>>  8: (()+0xb785e) [0x7f2d2acea85e]
>>>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>>> const*)+0x841) [0x667541]
>>>>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>>> const*, long)+0x1fe) [0x7634ee]
>>>>  11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>>  12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>>  13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>>  14: (()+0x69ca) [0x7f2d2beab9ca]
>>>>  15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>>
>>>> Or something like that:
>>>
>>>
>>>
>>> I reported this one a couple of days ago:
>>> http://tracker.newdream.net/issues/2219
>>>
>>>
>>>>
>>>> ...
>>>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>>>> we reset (peer sent cseq 2), sending RESETSESSION
>>>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>>>> dropping, sharing map
>>>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>>>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>>>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>>>> initiating reconnect
>>>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>>>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>>>> with nothing to send, going to standby
>>>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>>>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>>>> initiating reconnect
>>>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>>> 2012-04-02 17:20:34.798943)
>>>> ...
>>>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>>> 2012-04-02 17:20:36.846704)
>>>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>>>> l=0).connect got RESETSESSION
>>>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>>> 2012-04-02 17:21:51.678001)
>>>> ...
>>>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>>>> l=0).connect got RESETSESSION
>>>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>>>> initiating reconnect
>>>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>>>> l=0).connect got RESETSESSION
>>>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>>>> marked me down or wrong addr
>>>>
>>>> In first case the OSD process terminates, it is marked out by the
>>>> cluster and re-replication is started. In second case the OSD resumes
>>>> over several time and then goes down again etc...
>>>>
>>>> It seems that the problem is that the OSDs don't send/receive
>>>> heartbeat-messages, but why? The network seems to be good and time is
>>>> synchronized. The problem appeared only when I put a lot of objects
>>>> (and took more than a half of free space availible). I can give the
>>>> config file and any logs if needed.
>>>
>>>
>>>
>>> Last week new heartbeat code was merged into the master tree. The up/down
>>> going of OSDs was something I was seeing over the last months.
>>>
>>> Sage wrote some new code, it might be worth trying that.
>>>
>>> Since I'm running that code I haven't seen any suicides, so it could be
>>> related.
>>>
>>> It's this commit:
>>>
>>> https://github.com/ceph/ceph/commit/374bef9c97266600b4c6b83100485d7250363213
>>>
>>> Could you try to build from git?
>>>
>>> git clone https://github.com/ceph/ceph.git
>>> cd ceph
>>> git submodule init
>>> git submodule update
>>> dpkg-buildpackage (if you are using Debian).
>>>
>>> Wido
>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-02 14:01 OSD suicide Бородин Владимир
  2012-04-02 14:55 ` Wido den Hollander
@ 2012-04-02 20:14 ` Stefan Kleijkers
  2012-04-03 17:16   ` Borodin Vladimir
  2012-04-02 20:18 ` Stefan Kleijkers
  2 siblings, 1 reply; 10+ messages in thread
From: Stefan Kleijkers @ 2012-04-02 20:14 UTC (permalink / raw)
  To: Бородин Владимир
  Cc: ceph-devel

Hello,

A while back I had the same errors you are seeing. I had these problems 
only when using mdraid. After doing IO for some time the IO stalled and 
in most cases if you look at the cepg-osd daemon it's in D mode (waiting 
for IO). Also if you look with top you notice a very high load and IO wait.

I didn't find out what the exact reason was, but I stopped using mdraid 
and use a setup with an OSD per disk and the disks formatted with XFS. 
This gave me the best stability and performance.

Stefan

On 04/02/2012 04:01 PM, Бородин Владимир wrote:
> Hi all.
>
> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
> am trying to put into RADOS (through python API) 20 million objects 20
> KB each. I have two problems:
> 1. the speed is not as good as I expect (but that's not the main problem now),
> 2. after I put 10 million objects, OSDs started to mark itself down
> and out. The logs give something like that:
>
> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:58.87
> 7752)
> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:59.57
> 8079)
> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:00.67
> 8421)
> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:01.67
> 8751)
> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:02.57
> 9069)
> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
> common/HeartbeatMap.cc: In function 'bool
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
> 05:22.894637
> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
> *** Caught signal (Aborted) **
>   in thread 7f2d2a213700
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: /usr/bin/ceph-osd() [0x661cb1]
>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>   6: (()+0xb7736) [0x7f2d2acea736]
>   7: (()+0xb7763) [0x7f2d2acea763]
>   8: (()+0xb785e) [0x7f2d2acea85e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x841) [0x667541]
>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   14: (()+0x69ca) [0x7f2d2beab9ca]
>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>
> Or something like that:
>
> ...
> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
> we reset (peer sent cseq 2), sending RESETSESSION
> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
> dropping, sharing map
> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:34.798943)
> ...
> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:36.846704)
> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:21:51.678001)
> ...
> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
> marked me down or wrong addr
>
> In first case the OSD process terminates, it is marked out by the
> cluster and re-replication is started. In second case the OSD resumes
> over several time and then goes down again etc...
>
> It seems that the problem is that the OSDs don't send/receive
> heartbeat-messages, but why? The network seems to be good and time is
> synchronized. The problem appeared only when I put a lot of objects
> (and took more than a half of free space availible). I can give the
> config file and any logs if needed.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-02 20:14 ` Stefan Kleijkers
@ 2012-04-03 17:16   ` Borodin Vladimir
  2012-04-03 17:41     ` Stefan Kleijkers
  0 siblings, 1 reply; 10+ messages in thread
From: Borodin Vladimir @ 2012-04-03 17:16 UTC (permalink / raw)
  To: Stefan Kleijkers; +Cc: ceph-devel

Yes, Stefan. You are right. I'm not sure about the D state, but high
cpu usage is fact.
I do want to try an OSD per disk configuration but a bit later.

Thanks,
Vladimir.

2012/4/3 Stefan Kleijkers <stefan@unilogicnetworks.net>:
> Hello,
>
> A while back I had the same errors you are seeing. I had these problems only
> when using mdraid. After doing IO for some time the IO stalled and in most
> cases if you look at the cepg-osd daemon it's in D mode (waiting for IO).
> Also if you look with top you notice a very high load and IO wait.
>
> I didn't find out what the exact reason was, but I stopped using mdraid and
> use a setup with an OSD per disk and the disks formatted with XFS. This gave
> me the best stability and performance.
>
> Stefan
>
>
> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>
>> Hi all.
>>
>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>> am trying to put into RADOS (through python API) 20 million objects 20
>> KB each. I have two problems:
>> 1. the speed is not as good as I expect (but that's not the main problem
>> now),
>> 2. after I put 10 million objects, OSDs started to mark itself down
>> and out. The logs give something like that:
>>
>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:04:58.87
>> 7752)
>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:04:59.57
>> 8079)
>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:00.67
>> 8421)
>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:01.67
>> 8751)
>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>> 2012-04-02 17:05:02.57
>> 9069)
>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>> common/HeartbeatMap.cc: In function 'bool
>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>> 05:22.894637
>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  5: (()+0x69ca) [0x7f2d2beab9ca]
>>  6: (clone()+0x6d) [0x7f2d2a4fccdd]
>> *** Caught signal (Aborted) **
>>  in thread 7f2d2a213700
>>  ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>  1: /usr/bin/ceph-osd() [0x661cb1]
>>  2: (()+0xf8f0) [0x7f2d2beb48f0]
>>  3: (gsignal()+0x35) [0x7f2d2a449a75]
>>  4: (abort()+0x180) [0x7f2d2a44d5c0]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>  6: (()+0xb7736) [0x7f2d2acea736]
>>  7: (()+0xb7763) [0x7f2d2acea763]
>>  8: (()+0xb785e) [0x7f2d2acea85e]
>>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x841) [0x667541]
>>  10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>> const*, long)+0x1fe) [0x7634ee]
>>  11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>  12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>  13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>  14: (()+0x69ca) [0x7f2d2beab9ca]
>>  15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>
>> Or something like that:
>>
>> ...
>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>> we reset (peer sent cseq 2), sending RESETSESSION
>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>> dropping, sharing map
>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>> with nothing to send, going to standby
>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:20:34.798943)
>> ...
>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:20:36.846704)
>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>> 2012-04-02 17:21:51.678001)
>> ...
>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>> initiating reconnect
>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>> l=0).connect got RESETSESSION
>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>> marked me down or wrong addr
>>
>> In first case the OSD process terminates, it is marked out by the
>> cluster and re-replication is started. In second case the OSD resumes
>> over several time and then goes down again etc...
>>
>> It seems that the problem is that the OSDs don't send/receive
>> heartbeat-messages, but why? The network seems to be good and time is
>> synchronized. The problem appeared only when I put a lot of objects
>> (and took more than a half of free space availible). I can give the
>> config file and any logs if needed.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-03 17:16   ` Borodin Vladimir
@ 2012-04-03 17:41     ` Stefan Kleijkers
  2012-04-03 17:49       ` Sage Weil
  0 siblings, 1 reply; 10+ messages in thread
From: Stefan Kleijkers @ 2012-04-03 17:41 UTC (permalink / raw)
  To: Borodin Vladimir; +Cc: ceph-devel

Hello Vladimir,

Well in that case you could try BTRFS. With BTRFS it's possible to grab 
all the disks in a node together in a RAID0/RAID1/RAID10 configuration. 
So you can run one or a few OSDs per node. But I would recommend the 
newest kernel possible. I haven't tried with the 3.3 range, but with the 
early 3.2.x kernels I got BTRFS crashes. And with the later 3.2.x 
kernels I saw a real slowdown after some time.

If you get it stabilised with the mdraid, please let me know, I'm still 
interested in that setup. With the current setup I have the problem that 
with a disk crash in most cases I can't umount the filesystem anymore 
and I have to reboot the node. I would like to avoid that and with 
mdraid it's possible to swap a disk without bringing the system down.

Stefan

On 04/03/2012 07:16 PM, Borodin Vladimir wrote:
> Yes, Stefan. You are right. I'm not sure about the D state, but high
> cpu usage is fact.
> I do want to try an OSD per disk configuration but a bit later.
>
> Thanks,
> Vladimir.
>
> 2012/4/3 Stefan Kleijkers<stefan@unilogicnetworks.net>:
>> Hello,
>>
>> A while back I had the same errors you are seeing. I had these problems only
>> when using mdraid. After doing IO for some time the IO stalled and in most
>> cases if you look at the cepg-osd daemon it's in D mode (waiting for IO).
>> Also if you look with top you notice a very high load and IO wait.
>>
>> I didn't find out what the exact reason was, but I stopped using mdraid and
>> use a setup with an OSD per disk and the disks formatted with XFS. This gave
>> me the best stability and performance.
>>
>> Stefan
>>
>>
>> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>> Hi all.
>>>
>>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>>> am trying to put into RADOS (through python API) 20 million objects 20
>>> KB each. I have two problems:
>>> 1. the speed is not as good as I expect (but that's not the main problem
>>> now),
>>> 2. after I put 10 million objects, OSDs started to mark itself down
>>> and out. The logs give something like that:
>>>
>>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:58.87
>>> 7752)
>>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:59.57
>>> 8079)
>>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:00.67
>>> 8421)
>>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:01.67
>>> 8751)
>>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:02.57
>>> 9069)
>>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>>> common/HeartbeatMap.cc: In function 'bool
>>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>>> 05:22.894637
>>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>> *** Caught signal (Aborted) **
>>>   in thread 7f2d2a213700
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: /usr/bin/ceph-osd() [0x661cb1]
>>>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>>>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>>>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>>>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>>   6: (()+0xb7736) [0x7f2d2acea736]
>>>   7: (()+0xb7763) [0x7f2d2acea763]
>>>   8: (()+0xb785e) [0x7f2d2acea85e]
>>>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x841) [0x667541]
>>>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   14: (()+0x69ca) [0x7f2d2beab9ca]
>>>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>
>>> Or something like that:
>>>
>>> ...
>>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>>> we reset (peer sent cseq 2), sending RESETSESSION
>>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>>> dropping, sharing map
>>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:34.798943)
>>> ...
>>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:36.846704)
>>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:21:51.678001)
>>> ...
>>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>>> marked me down or wrong addr
>>>
>>> In first case the OSD process terminates, it is marked out by the
>>> cluster and re-replication is started. In second case the OSD resumes
>>> over several time and then goes down again etc...
>>>
>>> It seems that the problem is that the OSDs don't send/receive
>>> heartbeat-messages, but why? The network seems to be good and time is
>>> synchronized. The problem appeared only when I put a lot of objects
>>> (and took more than a half of free space availible). I can give the
>>> config file and any logs if needed.
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-03 17:41     ` Stefan Kleijkers
@ 2012-04-03 17:49       ` Sage Weil
  0 siblings, 0 replies; 10+ messages in thread
From: Sage Weil @ 2012-04-03 17:49 UTC (permalink / raw)
  To: Stefan Kleijkers; +Cc: Borodin Vladimir, ceph-devel

On Tue, 3 Apr 2012, Stefan Kleijkers wrote:
> Hello Vladimir,
> 
> Well in that case you could try BTRFS. With BTRFS it's possible to grab all
> the disks in a node together in a RAID0/RAID1/RAID10 configuration. So you can
> run one or a few OSDs per node. But I would recommend the newest kernel
> possible. I haven't tried with the 3.3 range, but with the early 3.2.x kernels
> I got BTRFS crashes. And with the later 3.2.x kernels I saw a real slowdown
> after some time.

I should mention that the large metadata blocks were just sent upstream 
and merged for 3.4-rc1.  If you specify a larger metadata block size at 
mkfs.btrfs time (I've been told 16 KB seems to work well) the slowdowns 
should go away.  (We haven't verified this yet ourselves.)

It's probably also worth mentioning that these patches all went into the 
SLES 11 SP2 kernel (based on 3.0).

The btrfs tree is also based on 3.0, so you should be able to merge it 
into any kernel since without pain.

sage


> If you get it stabilised with the mdraid, please let me know, I'm still
> interested in that setup. With the current setup I have the problem that with
> a disk crash in most cases I can't umount the filesystem anymore and I have to
> reboot the node. I would like to avoid that and with mdraid it's possible to
> swap a disk without bringing the system down.
> 
> Stefan
> 
> On 04/03/2012 07:16 PM, Borodin Vladimir wrote:
> > Yes, Stefan. You are right. I'm not sure about the D state, but high
> > cpu usage is fact.
> > I do want to try an OSD per disk configuration but a bit later.
> > 
> > Thanks,
> > Vladimir.
> > 
> > 2012/4/3 Stefan Kleijkers<stefan@unilogicnetworks.net>:
> > > Hello,
> > > 
> > > A while back I had the same errors you are seeing. I had these problems
> > > only
> > > when using mdraid. After doing IO for some time the IO stalled and in most
> > > cases if you look at the cepg-osd daemon it's in D mode (waiting for IO).
> > > Also if you look with top you notice a very high load and IO wait.
> > > 
> > > I didn't find out what the exact reason was, but I stopped using mdraid
> > > and
> > > use a setup with an OSD per disk and the disks formatted with XFS. This
> > > gave
> > > me the best stability and performance.
> > > 
> > > Stefan
> > > 
> > > 
> > > On 04/02/2012 04:01 PM, ??????? ???????? wrote:
> > > > Hi all.
> > > > 
> > > > I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
> > > > am trying to put into RADOS (through python API) 20 million objects 20
> > > > KB each. I have two problems:
> > > > 1. the speed is not as good as I expect (but that's not the main problem
> > > > now),
> > > > 2. after I put 10 million objects, OSDs started to mark itself down
> > > > and out. The logs give something like that:
> > > > 
> > > > 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
> > > > 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> > > > 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> > > > no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> > > > 2012-04-02 17:04:58.87
> > > > 7752)
> > > > 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> > > > no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> > > > 2012-04-02 17:04:59.57
> > > > 8079)
> > > > 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> > > > no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> > > > 2012-04-02 17:05:00.67
> > > > 8421)
> > > > 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> > > > no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> > > > 2012-04-02 17:05:01.67
> > > > 8751)
> > > > 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> > > > no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> > > > 2012-04-02 17:05:02.57
> > > > 9069)
> > > > 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
> > > > 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> > > > 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
> > > > 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
> > > > common/HeartbeatMap.cc: In function 'bool
> > > > ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> > > > time_t)' thread 7f2d2a213700 time 2012-04-02 17:
> > > > 05:22.894637
> > > > common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
> > > >   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
> > > >   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > > > const*, long)+0x1fe) [0x7634ee]
> > > >   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
> > > >   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
> > > >   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
> > > >   5: (()+0x69ca) [0x7f2d2beab9ca]
> > > >   6: (clone()+0x6d) [0x7f2d2a4fccdd]
> > > >   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
> > > >   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > > > const*, long)+0x1fe) [0x7634ee]
> > > >   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
> > > >   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
> > > >   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
> > > >   5: (()+0x69ca) [0x7f2d2beab9ca]
> > > >   6: (clone()+0x6d) [0x7f2d2a4fccdd]
> > > > *** Caught signal (Aborted) **
> > > >   in thread 7f2d2a213700
> > > >   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
> > > >   1: /usr/bin/ceph-osd() [0x661cb1]
> > > >   2: (()+0xf8f0) [0x7f2d2beb48f0]
> > > >   3: (gsignal()+0x35) [0x7f2d2a449a75]
> > > >   4: (abort()+0x180) [0x7f2d2a44d5c0]
> > > >   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
> > > >   6: (()+0xb7736) [0x7f2d2acea736]
> > > >   7: (()+0xb7763) [0x7f2d2acea763]
> > > >   8: (()+0xb785e) [0x7f2d2acea85e]
> > > >   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x841) [0x667541]
> > > >   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> > > > const*, long)+0x1fe) [0x7634ee]
> > > >   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
> > > >   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
> > > >   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
> > > >   14: (()+0x69ca) [0x7f2d2beab9ca]
> > > >   15: (clone()+0x6d) [0x7f2d2a4fccdd]
> > > > 
> > > > Or something like that:
> > > > 
> > > > ...
> > > > 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
> > > > 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> > > > 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
> > > > 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> > > > 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
> > > > 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> > > > 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
> > > > 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> > > > 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
> > > > we reset (peer sent cseq 2), sending RESETSESSION
> > > > 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
> > > > dropping, sharing map
> > > > 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
> > > > 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
> > > > 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
> > > > initiating reconnect
> > > > 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
> > > > 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
> > > > with nothing to send, going to standby
> > > > 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
> > > > 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
> > > > initiating reconnect
> > > > 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
> > > > no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> > > > 2012-04-02 17:20:34.798943)
> > > > ...
> > > > 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
> > > > no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> > > > 2012-04-02 17:20:36.846704)
> > > > 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
> > > > l=0).connect got RESETSESSION
> > > > 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
> > > > no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> > > > 2012-04-02 17:21:51.678001)
> > > > ...
> > > > 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
> > > > l=0).connect got RESETSESSION
> > > > 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
> > > > initiating reconnect
> > > > 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
> > > > 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
> > > > l=0).connect got RESETSESSION
> > > > 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
> > > > marked me down or wrong addr
> > > > 
> > > > In first case the OSD process terminates, it is marked out by the
> > > > cluster and re-replication is started. In second case the OSD resumes
> > > > over several time and then goes down again etc...
> > > > 
> > > > It seems that the problem is that the OSDs don't send/receive
> > > > heartbeat-messages, but why? The network seems to be good and time is
> > > > synchronized. The problem appeared only when I put a lot of objects
> > > > (and took more than a half of free space availible). I can give the
> > > > config file and any logs if needed.
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: OSD suicide
  2012-04-02 14:01 OSD suicide Бородин Владимир
  2012-04-02 14:55 ` Wido den Hollander
  2012-04-02 20:14 ` Stefan Kleijkers
@ 2012-04-02 20:18 ` Stefan Kleijkers
  2 siblings, 0 replies; 10+ messages in thread
From: Stefan Kleijkers @ 2012-04-02 20:18 UTC (permalink / raw)
  To: Бородин Владимир
  Cc: ceph-devel

Hello,

A while back I had the same errors you are seeing. I had these problems 
only when using mdraid. After doing IO for some time the IO stalled and 
in most cases if you look at the cepg-osd daemon it's in D mode (waiting 
for IO). Also if you look with top you notice a very high load and IO wait.

I didn't find out what the exact reason was, but I stopped using mdraid 
and use a setup with an OSD per disk and the disks formatted with XFS. 
This gave me the best stability and performance.

Stefan

On 04/02/2012 04:01 PM, Бородин Владимир wrote:
> Hi all.
>
> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
> am trying to put into RADOS (through python API) 20 million objects 20
> KB each. I have two problems:
> 1. the speed is not as good as I expect (but that's not the main problem now),
> 2. after I put 10 million objects, OSDs started to mark itself down
> and out. The logs give something like that:
>
> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:58.87
> 7752)
> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:04:59.57
> 8079)
> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:00.67
> 8421)
> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:01.67
> 8751)
> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
> 2012-04-02 17:05:02.57
> 9069)
> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
> common/HeartbeatMap.cc: In function 'bool
> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
> 05:22.894637
> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   5: (()+0x69ca) [0x7f2d2beab9ca]
>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
> *** Caught signal (Aborted) **
>   in thread 7f2d2a213700
>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>   1: /usr/bin/ceph-osd() [0x661cb1]
>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>   6: (()+0xb7736) [0x7f2d2acea736]
>   7: (()+0xb7763) [0x7f2d2acea763]
>   8: (()+0xb785e) [0x7f2d2acea85e]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x841) [0x667541]
>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
> const*, long)+0x1fe) [0x7634ee]
>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>   14: (()+0x69ca) [0x7f2d2beab9ca]
>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>
> Or something like that:
>
> ...
> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
> we reset (peer sent cseq 2), sending RESETSESSION
> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
> dropping, sharing map
> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
> with nothing to send, going to standby
> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:34.798943)
> ...
> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:20:36.846704)
> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
> 2012-04-02 17:21:51.678001)
> ...
> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
> initiating reconnect
> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
> l=0).connect got RESETSESSION
> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
> marked me down or wrong addr
>
> In first case the OSD process terminates, it is marked out by the
> cluster and re-replication is started. In second case the OSD resumes
> over several time and then goes down again etc...
>
> It seems that the problem is that the OSDs don't send/receive
> heartbeat-messages, but why? The network seems to be good and time is
> synchronized. The problem appeared only when I put a lot of objects
> (and took more than a half of free space availible). I can give the
> config file and any logs if needed.
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2012-04-05 14:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-04-02 14:01 OSD suicide Бородин Владимир
2012-04-02 14:55 ` Wido den Hollander
2012-04-04 18:26   ` Borodin Vladimir
2012-04-05 13:05     ` Wido den Hollander
2012-04-05 14:46       ` Borodin Vladimir
2012-04-02 20:14 ` Stefan Kleijkers
2012-04-03 17:16   ` Borodin Vladimir
2012-04-03 17:41     ` Stefan Kleijkers
2012-04-03 17:49       ` Sage Weil
2012-04-02 20:18 ` Stefan Kleijkers

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.