Re: OSD suicide - Wido den Hollander

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Wido den Hollander <wido@widodh.nl>
To: Borodin Vladimir <v.a.borodin@gmail.com>
Cc: ceph-devel@vger.kernel.org
Subject: Re: OSD suicide
Date: Thu, 05 Apr 2012 15:05:10 +0200	[thread overview]
Message-ID: <4F7D9886.20209@widodh.nl> (raw)
In-Reply-To: <CAGEBRh2hx6zSU3pfM+vmmoiE4iH7r7tD_aeFEhnic=gobfrsEQ@mail.gmail.com>

Hi,

On 04/04/2012 08:26 PM, Borodin Vladimir wrote:
> Hello Wido,
>
> Yesterday I built from git: $ ceph -v
> ceph version 0.44.1-149-ge80126e
> (commit:e80126ea689e9a972fbf09e8848fc4a2ade13c59)

That commit should contain the new heartbeat code.

>
> The messages are a bit different but the problem seems to be the same.
> Below is a part of the osd.48.log. Should I turn debug osd mode more
> verbose or give any other information?

Are the other OSDs also saying that osd.49 is down? Right now it seems 
to me that osd.49 is really down. The "bouncing" I was seeing before 
caused a OSD to mark multiple OSDs down at once, it wouldn't complain 
over and over about the same OSD.

Are you still running XFS with mdraid under the OSDs?

Wido

>
> Regards,
> Vladimir.
>
> 2012-04-04 22:17:35.854156 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:15.854125)
> 2012-04-04 22:17:35.854194 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20588.898191 seconds old
> 2012-04-04 22:17:36.854360 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:16.854343)
> 2012-04-04 22:17:36.854405 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20589.898402 seconds old
> 2012-04-04 22:17:37.027882 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:37.027859) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:37.027908 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:17.027899)
> 2012-04-04 22:17:37.854569 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:17.854552)
> 2012-04-04 22:17:37.854593 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20590.898589 seconds old
> 2012-04-04 22:17:38.854768 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:18.854750)
> 2012-04-04 22:17:38.854795 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20591.898792 seconds old
> 2012-04-04 22:17:39.854960 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:19.854943)
> 2012-04-04 22:17:39.854986 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20592.898982 seconds old
> 2012-04-04 22:17:39.928092 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:39.928071) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:39.928117 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:19.928107)
> 2012-04-04 22:17:40.087719 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
> 2012-04-04 22:17:40.855154 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:20.855137)
> 2012-04-04 22:17:40.855178 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20593.899174 seconds old
> 2012-04-04 22:17:41.855368 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:21.855350)
> 2012-04-04 22:17:41.855409 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20594.899406 seconds old
> 2012-04-04 22:17:42.855579 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:22.855562)
> 2012-04-04 22:17:42.855605 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20595.899601 seconds old
> 2012-04-04 22:17:43.855768 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:23.855751)
> 2012-04-04 22:17:43.855793 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20596.899789 seconds old
> 2012-04-04 22:17:44.628337 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:44.628298) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:44.628373 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:24.628363)
> 2012-04-04 22:17:44.855969 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:24.855953)
> 2012-04-04 22:17:44.856001 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20597.899997 seconds old
> 2012-04-04 22:17:45.087874 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
> 2012-04-04 22:17:45.856191 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:25.856162)
> 2012-04-04 22:17:45.856230 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20598.900227 seconds old
> 2012-04-04 22:17:46.856388 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:26.856371)
> 2012-04-04 22:17:46.856432 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20599.900428 seconds old
> 2012-04-04 22:17:47.856603 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:27.856586)
> 2012-04-04 22:17:47.856628 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20600.900625 seconds old
> 2012-04-04 22:17:48.128570 7fd55e6f3700 -- 84.201.161.48:0/22705
> send_message dropped message osd_ping(ping e1910 stamp 2012-04-04
> 22:17:48.128545) v2 because of no pipe on con 0x391d89b0
> 2012-04-04 22:17:48.128595 7fd55e6f3700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:28.128587)
> 2012-04-04 22:17:48.856819 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:28.856791)
> 2012-04-04 22:17:48.856858 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20601.900854 seconds old
> 2012-04-04 22:17:49.857044 7fd575527700 osd.48 1910 heartbeat_check:
> no reply from osd.49 since 2012-04-04 19:33:38.761585 (cutoff
> 2012-04-04 22:17:29.857019)
> 2012-04-04 22:17:49.857080 7fd575527700 --OSD::tracker--
> ops_in_flight.size: 9; oldest is 20602.901076 seconds old
> 2012-04-04 22:17:50.088035 7fd57a531700 heartbeat_map is_healthy
> 'OSD::disk_tp thread 0x7fd55f6f5700' had timed out after 60
>
> 2012/4/2 Wido den Hollander<wido@widodh.nl>:
>> Hi,
>>
>>
>> On 04/02/2012 04:01 PM, Бородин Владимир wrote:
>>>
>>> Hi all.
>>>
>>> I have a cluster with 4 OSDs (mdRAID10 on each of them with XFS) and I
>>> am trying to put into RADOS (through python API) 20 million objects 20
>>> KB each. I have two problems:
>>> 1. the speed is not as good as I expect (but that's not the main problem
>>> now),
>>> 2. after I put 10 million objects, OSDs started to mark itself down
>>> and out. The logs give something like that:
>>>
>>> 2012-04-02 17:05:17.894395 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:18.877781 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:58.87
>>> 7752)
>>> 2012-04-02 17:05:19.578112 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:04:59.57
>>> 8079)
>>> 2012-04-02 17:05:20.678455 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:00.67
>>> 8421)
>>> 2012-04-02 17:05:21.678785 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:01.67
>>> 8751)
>>> 2012-04-02 17:05:22.579101 7f2d1a8f3700 osd.47 1673 heartbeat_check:
>>> no heartbeat from osd.49 since 2012-04-02 17:02:49.217108 (cutoff
>>> 2012-04-02 17:05:02.57
>>> 9069)
>>> 2012-04-02 17:05:22.894568 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had timed out after 30
>>> 2012-04-02 17:05:22.894601 7f2d2a213700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f2d1d0f8700' had suicide timed out after 300
>>> common/HeartbeatMap.cc: In function 'bool
>>> ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*,
>>> time_t)' thread 7f2d2a213700 time 2012-04-02 17:
>>> 05:22.894637
>>> common/HeartbeatMap.cc: 78: FAILED assert(0 == "hit suicide timeout")
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   2: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   3: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   4: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   5: (()+0x69ca) [0x7f2d2beab9ca]
>>>   6: (clone()+0x6d) [0x7f2d2a4fccdd]
>>> *** Caught signal (Aborted) **
>>>   in thread 7f2d2a213700
>>>   ceph version 0.44.1 (commit:c89b7f22c8599eb974e75a2f7a5f855358199dee)
>>>   1: /usr/bin/ceph-osd() [0x661cb1]
>>>   2: (()+0xf8f0) [0x7f2d2beb48f0]
>>>   3: (gsignal()+0x35) [0x7f2d2a449a75]
>>>   4: (abort()+0x180) [0x7f2d2a44d5c0]
>>>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f2d2acec58d]
>>>   6: (()+0xb7736) [0x7f2d2acea736]
>>>   7: (()+0xb7763) [0x7f2d2acea763]
>>>   8: (()+0xb785e) [0x7f2d2acea85e]
>>>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x841) [0x667541]
>>>   10: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char
>>> const*, long)+0x1fe) [0x7634ee]
>>>   11: (ceph::HeartbeatMap::is_healthy()+0x7f) [0x76381f]
>>>   12: (ceph::HeartbeatMap::check_touch_file()+0x20) [0x763a50]
>>>   13: (CephContextServiceThread::entry()+0x5f) [0x65a31f]
>>>   14: (()+0x69ca) [0x7f2d2beab9ca]
>>>   15: (clone()+0x6d) [0x7f2d2a4fccdd]
>>>
>>> Or something like that:
>>
>>
>> I reported this one a couple of days ago:
>> http://tracker.newdream.net/issues/2219
>>
>>
>>>
>>> ...
>>> 2012-04-02 17:01:38.673223 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:38.673267 7f7855486700 heartbeat_map is_healthy
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:01:38.833509 7f7847369700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847369700' had timed out after 30
>>> 2012-04-02 17:01:39.031229 7f7847b6a700 heartbeat_map reset_timeout
>>> 'OSD::op_tp thread 0x7f7847b6a700' had timed out after 30
>>> 2012-04-02 17:02:06.971487 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=0 cs=0 l=0).accept
>>> we reset (peer sent cseq 2), sending RESETSESSION
>>> 2012-04-02 17:02:49.321812 7f784324b700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6802/15581 pipe(0x47197280 sd=50 pgs=53 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:03:11.677528 7f784a470700 osd.48 1675 from dead osd.49,
>>> dropping, sharing map
>>> 2012-04-02 17:05:26.355673 7f784344d700 -- 84.201.161.48:0/17442>>
>>> 84.201.161.47:6802/17587 pipe(0x1bb4280 sd=375 pgs=783 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:05:26.355767 7f7844059700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.47:0/17587 pipe(0x4ba5c500 sd=201 pgs=17 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:05:26.355936 7f7843b54700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.47:6801/17587 pipe(0x3ad97780 sd=36 pgs=138 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.624220 7f7843049700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=37 pgs=55 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:43.974073 7f784445d700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.49:0/15581 pipe(0x4be36000 sd=38 pgs=15 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.556758 7f7842c45700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=41 pgs=178 cs=1 l=0).fault
>>> with nothing to send, going to standby
>>> 2012-04-02 17:18:47.775391 7f7844760700 -- 84.201.161.48:6804/17442>>
>>> 84.201.161.50:0/7906 pipe(0x46f10280 sd=42 pgs=31 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:20:54.798971 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:34.798943)
>>> ...
>>> 2012-04-02 17:20:56.846736 7f784fc7b700 osd.48 1678 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:20:36.846704)
>>> 2012-04-02 17:21:15.408175 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=55 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:11.678030 7f784fc7b700 osd.48 1680 heartbeat_check:
>>> no heartbeat from osd.47 since 2012-04-02 17:05:22.574731 (cutoff
>>> 2012-04-02 17:21:51.678001)
>>> ...
>>> 2012-04-02 17:22:26.012018 7f7844b64700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.50:6802/7906 pipe(0x47197c80 sd=39 pgs=178 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:22:26.064256 7f7842d46700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=50 pgs=64 cs=1 l=0).fault
>>> initiating reconnect
>>> 2012-04-02 17:22:26.065367 7f7842b44700 -- 84.201.161.48:6803/17442>>
>>> 84.201.161.49:6804/15581 pipe(0x47197a00 sd=41 pgs=64 cs=2
>>> l=0).connect got RESETSESSION
>>> 2012-04-02 17:24:07.987587 7f784ac71700 log [WRN] : map e1706 wrongly
>>> marked me down or wrong addr
>>>
>>> In first case the OSD process terminates, it is marked out by the
>>> cluster and re-replication is started. In second case the OSD resumes
>>> over several time and then goes down again etc...
>>>
>>> It seems that the problem is that the OSDs don't send/receive
>>> heartbeat-messages, but why? The network seems to be good and time is
>>> synchronized. The problem appeared only when I put a lot of objects
>>> (and took more than a half of free space availible). I can give the
>>> config file and any logs if needed.
>>
>>
>> Last week new heartbeat code was merged into the master tree. The up/down
>> going of OSDs was something I was seeing over the last months.
>>
>> Sage wrote some new code, it might be worth trying that.
>>
>> Since I'm running that code I haven't seen any suicides, so it could be
>> related.
>>
>> It's this commit:
>> https://github.com/ceph/ceph/commit/374bef9c97266600b4c6b83100485d7250363213
>>
>> Could you try to build from git?
>>
>> git clone https://github.com/ceph/ceph.git
>> cd ceph
>> git submodule init
>> git submodule update
>> dpkg-buildpackage (if you are using Debian).
>>
>> Wido
>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2012-04-05 13:05 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-02 14:01 OSD suicide Бородин Владимир
2012-04-02 14:55 ` Wido den Hollander
2012-04-04 18:26   ` Borodin Vladimir
2012-04-05 13:05     ` Wido den Hollander [this message]
2012-04-05 14:46       ` Borodin Vladimir
2012-04-02 20:14 ` Stefan Kleijkers
2012-04-03 17:16   ` Borodin Vladimir
2012-04-03 17:41     ` Stefan Kleijkers
2012-04-03 17:49       ` Sage Weil
2012-04-02 20:18 ` Stefan Kleijkers

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4F7D9886.20209@widodh.nl \
    --to=wido@widodh.nl \
    --cc=ceph-devel@vger.kernel.org \
    --cc=v.a.borodin@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.