From mboxrd@z Thu Jan 1 00:00:00 1970 From: Wido den Hollander Subject: Re: Repeated messages of "heartbeat_check: no heartbeat from" Date: Thu, 23 Feb 2012 11:45:39 +0100 Message-ID: <4F4618D3.4030501@widodh.nl> References: <1312546310.2754.41.camel@wido-laptop.pcextreme.nl> <1312972246.2742.7.camel@wido-desktop> <4F3BBD50.3030905@widodh.nl> Mime-Version: 1.0 Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: QUOTED-PRINTABLE Return-path: Received: from smtp02.mail.pcextreme.nl ([109.72.87.138]:34924 "EHLO smtp02.mail.pcextreme.nl" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753315Ab2BWKpr (ORCPT ); Thu, 23 Feb 2012 05:45:47 -0500 In-Reply-To: Sender: ceph-devel-owner@vger.kernel.org List-ID: To: Gregory Farnum Cc: ceph-devel@vger.kernel.org Hi, On 02/22/2012 07:08 PM, Gregory Farnum wrote: > Wido, > Sorry we lost track of this last week =97 we were all distracted by F= AST 12! :) > No problem! > So it looks like they're both on the same map and osd.4 is sending > pings to osd.19, but osd.19 is just ignoring them? Or do you really > have on debug_os and not debug_osd? :) That was a typo, I have debug_osd set to 20. I haven't rebooted the OSD's since and now osd.4 and osd.19 are not=20 complaining anymore, but it's now a different set of OSD's who are=20 saying the other one is down. I'm still running v0.41 btw. I'm not going to touch the cluster until=20 this one is tracked down, it keeps coming back. Suggestions? Wido > -Greg > > On Wed, Feb 15, 2012 at 6:12 AM, Wido den Hollander = wrote: >> Hi, >> >> >> On 08/10/2011 12:30 PM, Wido den Hollander wrote: >>> >>> Hi, >>> >>> I tried reproducing this today with the new heartbeat code, couldn'= t. >>> >>> The new heartbeat code seems to work much better under load, I have= n't >>> seen any bouncing like I saw before. >> >> >> I'm still observing this bouncing, after a couple of days running my= cluster >> the OSD's started to say they are down again. >> >> [INF] osd.19 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6809/14435 faile= d (by >> osd.4 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6800/15389 >> >> [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6806/15035 failed= (by >> osd.25 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6803/14344 >> >> [INF] osd.22 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/14527 faile= d (by >> osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/14188 >> >> It's the same again, a couple of OSD's don't like eachother anymore = and >> start marking the other as down. >> >> My ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf >> >> I'm taking osd.19 and osd.4 here as example. >> >> Some logs on Pastebin: http://pastebin.com/EYUHbPua >> >> All systems are idle, so this can't be an issue of limited CPU cycle= s. >> >> As this seems to keep coming back, what steps should we take next? >> >> debug_ms and os are on 20. >> >> What I noticed is that osd.4 never says: "osd.19 already has epoch 1= 9", but >> that is I guess since it thinks osd.19 is down? >> >> On osd.4 I never see I connection coming in from osd.19. I'm trying = to >> figure out which of the two is wrong here. >> >> Any suggestions? >> >> Wido >> >> >>> >>> Wido >>> >>> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote: >>>> >>>> Hi, >>>> >>>> I'm seeing something weird which I don't want to report in a issue= yet >>>> in the tracker. >>>> >>>> I suffered a pretty big crash which caused me to loose 15 of my 40= OSDs. >>>> Some died because my WD Green 2TB disks failed, others since btrfs >>>> didn't survive the hard reboot I had to give them. >>>> >>>> When watching 'ceph -w' I noticed this: >>>> >>>> 2011-08-05 13:32:03.209958 pg v504313: 8008 pgs: 1 inactive, 33= 4 >>>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 ac= tive >>>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peerin= g, 9 >>>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 = GB / >>>> 46575 GB avail; 8255/3391107 degraded (0.243%) >>>> 2011-08-05 13:32:03.225712 mds e6033: 1/1/1 up {0=3Ddga=3Dup:rep= lay(laggy >>>> or crashed)} >>>> 2011-08-05 13:32:03.225747 osd e9934: 40 osds: 25 up, 25 in >>>> 2011-08-05 13:32:03.225874 log 2011-08-05 13:32:01.646596 mon0 >>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37 >>>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7 >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746) >>>> 2011-08-05 13:32:03.225964 mon e1: 1 mons at >>>> {dga=3D[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0} >>>> 2011-08-05 13:32:03.929650 log 2011-08-05 13:32:03.603681 mon0 >>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19 >>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5 >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506) >>>> >>>> Here you see that osd5 is reporting that osd37 is down and osd7 is >>>> reporting that osd19 is down. >>>> >>>> These messages keep being repeated, over and over. My 'debug ms' i= s at >>>> 1, so I took a look at the logs: >>>> >>>> Aug 5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963 >>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364) >>>> Aug 5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963 >>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632) >>>> Aug 5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965 >>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198) >>>> Aug 5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965 >>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04 >>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161) >>>> >>>> osd5 keeps claiming that osd19 is down, but when I check the logs = I see >>>> that osd19 is running and exchanging heartbeat messages with the o= ther >>>> OSDs in the cluster. >>>> >>>> What seems odd to me is that there are no log messages about osd7 >>>> claiming that osd37 is down, although it keeps being repeated by '= ceph >>>> -w'. >>>> >>>> $ ceph osd injectargs 7 '--debug-ms 1' >>>> $ ceph osd injectargs 37 '--debug-ms 1' >>>> >>>> I made sure that was injected, but still, no log messages. >>>> >>>> Going through the logs a bit further (after setting the logging to= 20) I >>>> saw osd.19 reporting: >>>> >>>> Aug 5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 -- >>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>> >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 = sd=3D56 >>>> pgs=3D1530 cs=3D1 l=3D0).reader got envelope type=3D70 src osd5 fr= ont=3D61 data=3D0 >>>> off 0 >>>> >>>> It seems that it is communicating with osd5, but osd5 does not see= m to >>>> communicate with osd19? >>>> >>>> The log also confirms that: >>>> >>>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f7531579700 -- >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 --> >>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19 >>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 --= ?+0 >>>> 0x6948000 con 0x1cd7710 >>>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f7531579700 -- >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message >>>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e1= 0020 >>>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 >>>> Aug 5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 -- >>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>> >>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=3D13 = pgs=3D178 >>>> cs=3D1 l=3D1).writer encoding 17093 0x6948000 osd_failure(osd19 >>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 >>>> >>>> Now, to prevent (I tried...) my cluster from bouncing around I pla= ced >>>> this in the OSD section: >>>> >>>> osd min down reporters =3D 3 >>>> osd min down reports =3D 5 >>>> >>>> Since only osd5 and osd7 are sending out these messages osd19 nor = osd37 >>>> get marked down, but what is going on here? >>>> >>>> I haven't restarted the OSDs yet. >>>> >>>> I noticed that the OSD heartbeat process is being revised currentl= y, but >>>> I thought it might be helpful to report this. >>>> >>>> Wido >>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe ceph-dev= el" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> >>> >>> -- >>> To unsubscribe from this list: send the line "unsubscribe ceph-deve= l" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel= " in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel"= in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" i= n the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html