Repeated messages of "heartbeat_check: no heartbeat from"

All of lore.kernel.org
 help / color / mirror / Atom feed

* Repeated messages of "heartbeat_check: no heartbeat from"
@ 2011-08-05 12:11 Wido den Hollander
  2011-08-10 10:30 ` Wido den Hollander
  0 siblings, 1 reply; 11+ messages in thread
From: Wido den Hollander @ 2011-08-05 12:11 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm seeing something weird which I don't want to report in a issue yet
in the tracker.

I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
Some died because my WD Green 2TB disks failed, others since btrfs
didn't survive the hard reboot I had to give them.

When watching 'ceph -w' I noticed this:

2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
+clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
46575 GB avail; 8255/3391107 degraded (0.243%)
2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
or crashed)}
2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
[2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
[2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
2011-08-05 13:32:03.225964   mon e1: 1 mons at
{dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)

Here you see that osd5 is reporting that osd37 is down and osd7 is
reporting that osd19 is down.

These messages keep being repeated, over and over. My 'debug ms' is at
1, so I took a look at the logs:

Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
heartbeat_check: no heartbeat from osd19 since 2011-08-04
17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)

osd5 keeps claiming that osd19 is down, but when I check the logs I see
that osd19 is running and exchanging heartbeat messages with the other
OSDs in the cluster.

What seems odd to me is that there are no log messages about osd7
claiming that osd37 is down, although it keeps being repeated by 'ceph
-w'.

$ ceph osd injectargs 7 '--debug-ms 1'
$ ceph osd injectargs 37 '--debug-ms 1'

I made sure that was injected, but still, no log messages.

Going through the logs a bit further (after setting the logging to 20) I
saw osd.19 reporting:

Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
[2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740 >>
[2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
off 0

It seems that it is communicating with osd5, but osd5 does not seem to
communicate with osd19?

The log also confirms that:

Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
0x6948000 con 0x1cd7710
Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
[2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 >>
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
[2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1

Now, to prevent (I tried...) my cluster from bouncing around I placed
this in the OSD section:

    osd min down reporters = 3
    osd min down reports = 5

Since only osd5 and osd7 are sending out these messages osd19 nor osd37
get marked down, but what is going on here?

I haven't restarted the OSDs yet.

I noticed that the OSD heartbeat process is being revised currently, but
I thought it might be helpful to report this.

Wido


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2011-08-05 12:11 Repeated messages of "heartbeat_check: no heartbeat from" Wido den Hollander
@ 2011-08-10 10:30 ` Wido den Hollander
  2011-10-14 14:26   ` Wido den Hollander
  2012-02-15 14:12   ` Wido den Hollander
  0 siblings, 2 replies; 11+ messages in thread
From: Wido den Hollander @ 2011-08-10 10:30 UTC (permalink / raw)
  To: ceph-devel

Hi,

I tried reproducing this today with the new heartbeat code, couldn't.

The new heartbeat code seems to work much better under load, I haven't
seen any bouncing like I saw before.

Wido

On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
> Hi,
> 
> I'm seeing something weird which I don't want to report in a issue yet
> in the tracker.
> 
> I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
> Some died because my WD Green 2TB disks failed, others since btrfs
> didn't survive the hard reboot I had to give them.
> 
> When watching 'ceph -w' I noticed this:
> 
> 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
> 46575 GB avail; 8255/3391107 degraded (0.243%)
> 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
> or crashed)}
> 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
> 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
> 2011-08-05 13:32:03.225964   mon e1: 1 mons at
> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
> 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
> 
> Here you see that osd5 is reporting that osd37 is down and osd7 is
> reporting that osd19 is down.
> 
> These messages keep being repeated, over and over. My 'debug ms' is at
> 1, so I took a look at the logs:
> 
> Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
> heartbeat_check: no heartbeat from osd19 since 2011-08-04
> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
> Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
> heartbeat_check: no heartbeat from osd19 since 2011-08-04
> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
> Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
> heartbeat_check: no heartbeat from osd19 since 2011-08-04
> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
> Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
> heartbeat_check: no heartbeat from osd19 since 2011-08-04
> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
> 
> osd5 keeps claiming that osd19 is down, but when I check the logs I see
> that osd19 is running and exchanging heartbeat messages with the other
> OSDs in the cluster.
> 
> What seems odd to me is that there are no log messages about osd7
> claiming that osd37 is down, although it keeps being repeated by 'ceph
> -w'.
> 
> $ ceph osd injectargs 7 '--debug-ms 1'
> $ ceph osd injectargs 37 '--debug-ms 1'
> 
> I made sure that was injected, but still, no log messages.
> 
> Going through the logs a bit further (after setting the logging to 20) I
> saw osd.19 reporting:
> 
> Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740 >>
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
> off 0
> 
> It seems that it is communicating with osd5, but osd5 does not seem to
> communicate with osd19?
> 
> The log also confirms that:
> 
> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
> 0x6948000 con 0x1cd7710
> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
> Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 >>
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
> 
> Now, to prevent (I tried...) my cluster from bouncing around I placed
> this in the OSD section:
> 
>     osd min down reporters = 3
>     osd min down reports = 5
> 
> Since only osd5 and osd7 are sending out these messages osd19 nor osd37
> get marked down, but what is going on here?
> 
> I haven't restarted the OSDs yet.
> 
> I noticed that the OSD heartbeat process is being revised currently, but
> I thought it might be helpful to report this.
> 
> Wido
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2011-08-10 10:30 ` Wido den Hollander
@ 2011-10-14 14:26   ` Wido den Hollander
  2011-10-14 15:16     ` Sage Weil
  2012-02-15 14:12   ` Wido den Hollander
  1 sibling, 1 reply; 11+ messages in thread
From: Wido den Hollander @ 2011-10-14 14:26 UTC (permalink / raw)
  To: ceph-devel

Hi,

I'm seeing this behavior again. In the monitor messages this started to
appear:

2011-10-14 16:22:19.853752   log 2011-10-14 16:22:14.887881 mon.0
[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 3281 : [INF] osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)

So, osd.0 is telling us that osd.12 is down, but the proces is running
and seems alive.

The logs are saying:

Oct 14 16:20:52 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:32.025154)
Oct 14 16:20:52 atom0 osd.0[10679]: 7f81d2d93700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:32.108369)
Oct 14 16:20:53 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
heartbeat_check: no heartbeat from osd.12 since 2011-10-14
16:00:07.636627 (cutoff 2011-10-14 16:20:33.025462)

While osd.12 reports getting messages from osd.0

Oct 14 16:20:54 atom3 osd.12[8024]: 7fa3621bb700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024 <== osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 1317 ====
osd_ping(heartbeat e511 as_of 355) v1 ==== 61+0+0 (2126910847 0 0)
0x1e281c0 con 0x1f02c80

I see osd.0 sending pg_info messages:

Oct 14 16:20:58 atom0 osd.0[10679]: 7f81d769d700 --
[2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 --> osd.12
[2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -- pg_info(1 pgs e512) v1
-- ?+0 0x3bf1e00

And osd.12 is also sending stuff back:

Oct 14 16:21:06 atom3 osd.12[8024]: 7fa3629bc700 --
[2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 --> osd.0
[2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -- pg_info(1 pgs e518) v1
-- ?+0 0x6f3e540

I'm running the latest version, but I did do a complete powercycle of my
cluster this morning. It's after the powercycle these messages started
appearing.

Any ideas?

Wido

On Wed, 2011-08-10 at 12:30 +0200, Wido den Hollander wrote:
> Hi,
> 
> I tried reproducing this today with the new heartbeat code, couldn't.
> 
> The new heartbeat code seems to work much better under load, I haven't
> seen any bouncing like I saw before.
> 
> Wido
> 
> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
> > Hi,
> > 
> > I'm seeing something weird which I don't want to report in a issue yet
> > in the tracker.
> > 
> > I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
> > Some died because my WD Green 2TB disks failed, others since btrfs
> > didn't survive the hard reboot I had to give them.
> > 
> > When watching 'ceph -w' I noticed this:
> > 
> > 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
> > active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
> > +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
> > crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
> > 46575 GB avail; 8255/3391107 degraded (0.243%)
> > 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
> > or crashed)}
> > 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
> > 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
> > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
> > [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
> > 2011-08-05 13:32:03.225964   mon e1: 1 mons at
> > {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
> > 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
> > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
> > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
> > 
> > Here you see that osd5 is reporting that osd37 is down and osd7 is
> > reporting that osd19 is down.
> > 
> > These messages keep being repeated, over and over. My 'debug ms' is at
> > 1, so I took a look at the logs:
> > 
> > Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
> > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
> > Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
> > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
> > Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
> > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
> > Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
> > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
> > 
> > osd5 keeps claiming that osd19 is down, but when I check the logs I see
> > that osd19 is running and exchanging heartbeat messages with the other
> > OSDs in the cluster.
> > 
> > What seems odd to me is that there are no log messages about osd7
> > claiming that osd37 is down, although it keeps being repeated by 'ceph
> > -w'.
> > 
> > $ ceph osd injectargs 7 '--debug-ms 1'
> > $ ceph osd injectargs 37 '--debug-ms 1'
> > 
> > I made sure that was injected, but still, no log messages.
> > 
> > Going through the logs a bit further (after setting the logging to 20) I
> > saw osd.19 reporting:
> > 
> > Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
> > [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740 >>
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
> > pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
> > off 0
> > 
> > It seems that it is communicating with osd5, but osd5 does not seem to
> > communicate with osd19?
> > 
> > The log also confirms that:
> > 
> > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
> > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
> > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
> > 0x6948000 con 0x1cd7710
> > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
> > osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
> > v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
> > Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
> > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 >>
> > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
> > cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
> > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
> > 
> > Now, to prevent (I tried...) my cluster from bouncing around I placed
> > this in the OSD section:
> > 
> >     osd min down reporters = 3
> >     osd min down reports = 5
> > 
> > Since only osd5 and osd7 are sending out these messages osd19 nor osd37
> > get marked down, but what is going on here?
> > 
> > I haven't restarted the OSDs yet.
> > 
> > I noticed that the OSD heartbeat process is being revised currently, but
> > I thought it might be helpful to report this.
> > 
> > Wido
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2011-10-14 14:26   ` Wido den Hollander
@ 2011-10-14 15:16     ` Sage Weil
       [not found]       ` <4E9C4F45.8030704@widodh.nl>
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-10-14 15:16 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

On Fri, 14 Oct 2011, Wido den Hollander wrote:
> Hi,
> 
> I'm seeing this behavior again. In the monitor messages this started to
> appear:
> 
> 2011-10-14 16:22:19.853752   log 2011-10-14 16:22:14.887881 mon.0
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 3281 : [INF] osd.12
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
> 
> So, osd.0 is telling us that osd.12 is down, but the proces is running
> and seems alive.
> 
> The logs are saying:
> 
> Oct 14 16:20:52 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> 16:00:07.636627 (cutoff 2011-10-14 16:20:32.025154)
> Oct 14 16:20:52 atom0 osd.0[10679]: 7f81d2d93700 osd.0 511
> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> 16:00:07.636627 (cutoff 2011-10-14 16:20:32.108369)
> Oct 14 16:20:53 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> 16:00:07.636627 (cutoff 2011-10-14 16:20:33.025462)
> 
> While osd.12 reports getting messages from osd.0
> 
> Oct 14 16:20:54 atom3 osd.12[8024]: 7fa3621bb700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024 <== osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 1317 ====
> osd_ping(heartbeat e511 as_of 355) v1 ==== 61+0+0 (2126910847 0 0)
> 0x1e281c0 con 0x1f02c80
> 
> I see osd.0 sending pg_info messages:
> 
> Oct 14 16:20:58 atom0 osd.0[10679]: 7f81d769d700 --
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 --> osd.12
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -- pg_info(1 pgs e512) v1
> -- ?+0 0x3bf1e00
> 
> And osd.12 is also sending stuff back:
> 
> Oct 14 16:21:06 atom3 osd.12[8024]: 7fa3629bc700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 --> osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -- pg_info(1 pgs e518) v1
> -- ?+0 0x6f3e540
> 
> I'm running the latest version, but I did do a complete powercycle of my
> cluster this morning. It's after the powercycle these messages started
> appearing.
> 
> Any ideas?

Can you post full osd logs for a pair where there was some marking-down 
occuring?  I'll take a look.

Thanks!
sage

> 
> Wido
> 
> On Wed, 2011-08-10 at 12:30 +0200, Wido den Hollander wrote:
> > Hi,
> > 
> > I tried reproducing this today with the new heartbeat code, couldn't.
> > 
> > The new heartbeat code seems to work much better under load, I haven't
> > seen any bouncing like I saw before.
> > 
> > Wido
> > 
> > On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
> > > Hi,
> > > 
> > > I'm seeing something weird which I don't want to report in a issue yet
> > > in the tracker.
> > > 
> > > I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
> > > Some died because my WD Green 2TB disks failed, others since btrfs
> > > didn't survive the hard reboot I had to give them.
> > > 
> > > When watching 'ceph -w' I noticed this:
> > > 
> > > 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
> > > active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
> > > +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
> > > crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
> > > 46575 GB avail; 8255/3391107 degraded (0.243%)
> > > 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
> > > or crashed)}
> > > 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
> > > 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
> > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
> > > [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
> > > 2011-08-05 13:32:03.225964   mon e1: 1 mons at
> > > {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
> > > 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
> > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
> > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
> > > 
> > > Here you see that osd5 is reporting that osd37 is down and osd7 is
> > > reporting that osd19 is down.
> > > 
> > > These messages keep being repeated, over and over. My 'debug ms' is at
> > > 1, so I took a look at the logs:
> > > 
> > > Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
> > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
> > > Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
> > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
> > > Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
> > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
> > > Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
> > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
> > > 
> > > osd5 keeps claiming that osd19 is down, but when I check the logs I see
> > > that osd19 is running and exchanging heartbeat messages with the other
> > > OSDs in the cluster.
> > > 
> > > What seems odd to me is that there are no log messages about osd7
> > > claiming that osd37 is down, although it keeps being repeated by 'ceph
> > > -w'.
> > > 
> > > $ ceph osd injectargs 7 '--debug-ms 1'
> > > $ ceph osd injectargs 37 '--debug-ms 1'
> > > 
> > > I made sure that was injected, but still, no log messages.
> > > 
> > > Going through the logs a bit further (after setting the logging to 20) I
> > > saw osd.19 reporting:
> > > 
> > > Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
> > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740 >>
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
> > > pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
> > > off 0
> > > 
> > > It seems that it is communicating with osd5, but osd5 does not seem to
> > > communicate with osd19?
> > > 
> > > The log also confirms that:
> > > 
> > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
> > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
> > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
> > > 0x6948000 con 0x1cd7710
> > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
> > > osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
> > > v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
> > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
> > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 >>
> > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
> > > cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
> > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
> > > 
> > > Now, to prevent (I tried...) my cluster from bouncing around I placed
> > > this in the OSD section:
> > > 
> > >     osd min down reporters = 3
> > >     osd min down reports = 5
> > > 
> > > Since only osd5 and osd7 are sending out these messages osd19 nor osd37
> > > get marked down, but what is going on here?
> > > 
> > > I haven't restarted the OSDs yet.
> > > 
> > > I noticed that the OSD heartbeat process is being revised currently, but
> > > I thought it might be helpful to report this.
> > > 
> > > Wido
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

[parent not found: <4E9C4F45.8030704@widodh.nl>]

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
       [not found]       ` <4E9C4F45.8030704@widodh.nl>
@ 2011-10-25 23:06         ` Sage Weil
  2011-10-27  6:20           ` Wido den Hollander
  0 siblings, 1 reply; 11+ messages in thread
From: Sage Weil @ 2011-10-25 23:06 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

Hi Wido,

Finally got to looking through these logs :( and I see osd.12 pinging lots 
of others OSDs but not osd.0.  The logs don't go far back enough to see 
osd.0 requesting heartbeats, though.

Is this something you are able to reproduce?  A similar log for the 
marker-down that extends back to when the process started will have more 
info.

Thanks!
sage



On Mon, 17 Oct 2011, Wido den Hollander wrote:

> Hi,
> 
> On 10/14/2011 05:16 PM, Sage Weil wrote:
> > On Fri, 14 Oct 2011, Wido den Hollander wrote:
> > > Hi,
> > > 
> > > I'm seeing this behavior again. In the monitor messages this started to
> > > appear:
> > > 
> > > 2011-10-14 16:22:19.853752   log 2011-10-14 16:22:14.887881 mon.0
> > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 3281 : [INF] osd.12
> > > [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
> > > [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
> > > 
> > > So, osd.0 is telling us that osd.12 is down, but the proces is running
> > > and seems alive.
> > > 
> > > The logs are saying:
> > > 
> > > Oct 14 16:20:52 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
> > > heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> > > 16:00:07.636627 (cutoff 2011-10-14 16:20:32.025154)
> > > Oct 14 16:20:52 atom0 osd.0[10679]: 7f81d2d93700 osd.0 511
> > > heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> > > 16:00:07.636627 (cutoff 2011-10-14 16:20:32.108369)
> > > Oct 14 16:20:53 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
> > > heartbeat_check: no heartbeat from osd.12 since 2011-10-14
> > > 16:00:07.636627 (cutoff 2011-10-14 16:20:33.025462)
> > > 
> > > While osd.12 reports getting messages from osd.0
> > > 
> > > Oct 14 16:20:54 atom3 osd.12[8024]: 7fa3621bb700 --
> > > [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024<== osd.0
> > > [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 1317 ====
> > > osd_ping(heartbeat e511 as_of 355) v1 ==== 61+0+0 (2126910847 0 0)
> > > 0x1e281c0 con 0x1f02c80
> > > 
> > > I see osd.0 sending pg_info messages:
> > > 
> > > Oct 14 16:20:58 atom0 osd.0[10679]: 7f81d769d700 --
> > > [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -->  osd.12
> > > [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -- pg_info(1 pgs e512) v1
> > > -- ?+0 0x3bf1e00
> > > 
> > > And osd.12 is also sending stuff back:
> > > 
> > > Oct 14 16:21:06 atom3 osd.12[8024]: 7fa3629bc700 --
> > > [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -->  osd.0
> > > [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -- pg_info(1 pgs e518) v1
> > > -- ?+0 0x6f3e540
> > > 
> > > I'm running the latest version, but I did do a complete powercycle of my
> > > cluster this morning. It's after the powercycle these messages started
> > > appearing.
> > > 
> > > Any ideas?
> > Can you post full osd logs for a pair where there was some marking-down
> > occuring?  I'll take a look.
> 
> Sure! Attached in the GZIP
> 
> I'm using remote-syslog, so the logs might be not how you want them. Let me
> know if so!
> 
> I used "ceph osd tell 0 injectargs '--debug-ms 20'" to set the debugging
> higher.
> 
> What I noticed was this:
> 
> $ tail -f osd.log |grep "osd\.0"|grep "osd\.12"
> 
> Oct 17 17:48:33 monitor mon.dga[560]: 7fa5fc520700 log [INF] : osd.12
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
> Oct 17 17:48:33 monitor mon.dga[560]: 2011-10-17 17:48:33.930666 mon.0
> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 2440 : [INF] osd.12
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
> Oct 17 17:48:36 atom3 osd.12[8024]: 7fa356039700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024 >>
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 pipe(0x6e0e000 sd=20 pgs=133
> cs=1 l=0).reader got envelope type=70 src osd.0 front=61 data=0 off 0
> Oct 17 17:48:36 atom3 osd.12[8024]: 7fa3621bb700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024 <== osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 278150 ==== osd_ping(heartbeat
> e577 as_of 355) v1 ==== 61+0+0 (2665319624 0 0) 0x52e1540 con 0x1f02c80
> 
> $ tail -f osd.log |grep "osd\.0"|grep "osd\.15" (Random other OSD)
> 
> Oct 17 17:49:29 atom3 osd.15[10295]: 7fd2ee88c700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295 <== osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277738 ==== osd_ping(heartbeat
> e577 as_of 404) v1 ==== 61+0+0 (1619141758 0 0) 0x33568c0 con 0x56a0c80
> Oct 17 17:49:33 atom3 osd.15[10295]: 7fd2ee88c700 --
> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295 <== osd.0
> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277742 ==== osd_ping(heartbeat
> e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x6739c40 con 0x56a0c80
> Oct 17 17:49:34 atom0 osd.0[10679]: 7f81c91ff700 --
> [2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679 >>
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 pipe(0x4bbda00 sd=87 pgs=19
> cs=1 l=0).reader got envelope type=70 src osd.15 front=61 data=0 off 0
> Oct 17 17:49:34 atom0 osd.0[10679]: 7f81d6e9c700 --
> [2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679 <== osd.15
> [2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 278117 ==== osd_ping(heartbeat
> e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x2fd5c40 con 0x5a6cb40
> 
> If you look at the log you'll see that osd.0 is reporting that osd.12 and
> osd.39 are down.
> 
> The logs show me that osd.0 is never communicating with osd.12 nor osd.39 and
> simply reports them as down.
> 
> The last time I saw this it was exactly the same. One OSD was repeating that
> two other OSD's were down.
> 
> My guess is that osd.0 is the problem here.
> 
> > 
> > Thanks!
> > sage
> > 
> > > Wido
> > > 
> > > On Wed, 2011-08-10 at 12:30 +0200, Wido den Hollander wrote:
> > > > Hi,
> > > > 
> > > > I tried reproducing this today with the new heartbeat code, couldn't.
> > > > 
> > > > The new heartbeat code seems to work much better under load, I haven't
> > > > seen any bouncing like I saw before.
> > > > 
> > > > Wido
> > > > 
> > > > On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
> > > > > Hi,
> > > > > 
> > > > > I'm seeing something weird which I don't want to report in a issue yet
> > > > > in the tracker.
> > > > > 
> > > > > I suffered a pretty big crash which caused me to loose 15 of my 40
> > > > > OSDs.
> > > > > Some died because my WD Green 2TB disks failed, others since btrfs
> > > > > didn't survive the hard reboot I had to give them.
> > > > > 
> > > > > When watching 'ceph -w' I noticed this:
> > > > > 
> > > > > 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
> > > > > active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
> > > > > +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
> > > > > crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
> > > > > 46575 GB avail; 8255/3391107 degraded (0.243%)
> > > > > 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up
> > > > > {0=dga=up:replay(laggy
> > > > > or crashed)}
> > > > > 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
> > > > > 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
> > > > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
> > > > > [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
> > > > > 2011-08-05 13:32:03.225964   mon e1: 1 mons at
> > > > > {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
> > > > > 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
> > > > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
> > > > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
> > > > > 
> > > > > Here you see that osd5 is reporting that osd37 is down and osd7 is
> > > > > reporting that osd19 is down.
> > > > > 
> > > > > These messages keep being repeated, over and over. My 'debug ms' is at
> > > > > 1, so I took a look at the logs:
> > > > > 
> > > > > Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
> > > > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > > > 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
> > > > > Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
> > > > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > > > 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
> > > > > Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
> > > > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > > > 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
> > > > > Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
> > > > > heartbeat_check: no heartbeat from osd19 since 2011-08-04
> > > > > 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
> > > > > 
> > > > > osd5 keeps claiming that osd19 is down, but when I check the logs I
> > > > > see
> > > > > that osd19 is running and exchanging heartbeat messages with the other
> > > > > OSDs in the cluster.
> > > > > 
> > > > > What seems odd to me is that there are no log messages about osd7
> > > > > claiming that osd37 is down, although it keeps being repeated by 'ceph
> > > > > -w'.
> > > > > 
> > > > > $ ceph osd injectargs 7 '--debug-ms 1'
> > > > > $ ceph osd injectargs 37 '--debug-ms 1'
> > > > > 
> > > > > I made sure that was injected, but still, no log messages.
> > > > > 
> > > > > Going through the logs a bit further (after setting the logging to 20)
> > > > > I
> > > > > saw osd.19 reporting:
> > > > > 
> > > > > Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
> > > > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0
> > > > > sd=56
> > > > > pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61
> > > > > data=0
> > > > > off 0
> > > > > 
> > > > > It seems that it is communicating with osd5, but osd5 does not seem to
> > > > > communicate with osd19?
> > > > > 
> > > > > The log also confirms that:
> > > > > 
> > > > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
> > > > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
> > > > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
> > > > > 0x6948000 con 0x1cd7710
> > > > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
> > > > > osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
> > > > > v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
> > > > > Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
> > > > > [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
> > > > > [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13
> > > > > pgs=178
> > > > > cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
> > > > > [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
> > > > > 
> > > > > Now, to prevent (I tried...) my cluster from bouncing around I placed
> > > > > this in the OSD section:
> > > > > 
> > > > >      osd min down reporters = 3
> > > > >      osd min down reports = 5
> > > > > 
> > > > > Since only osd5 and osd7 are sending out these messages osd19 nor
> > > > > osd37
> > > > > get marked down, but what is going on here?
> > > > > 
> > > > > I haven't restarted the OSDs yet.
> > > > > 
> > > > > I noticed that the OSD heartbeat process is being revised currently,
> > > > > but
> > > > > I thought it might be helpful to report this.
> > > > > 
> > > > > Wido
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2011-10-25 23:06         ` Sage Weil
@ 2011-10-27  6:20           ` Wido den Hollander
  0 siblings, 0 replies; 11+ messages in thread
From: Wido den Hollander @ 2011-10-27  6:20 UTC (permalink / raw)
  To: Sage Weil; +Cc: ceph-devel

Hi Sage,

On 10/26/2011 01:06 AM, Sage Weil wrote:
> Hi Wido,
>
> Finally got to looking through these logs :( and I see osd.12 pinging lots
> of others OSDs but not osd.0.  The logs don't go far back enough to see
> osd.0 requesting heartbeats, though.

Ah, bummer :(

>
> Is this something you are able to reproduce?  A similar log for the
> marker-down that extends back to when the process started will have more
> info.

No, not really. But I've seen it happening when I restarted my whole 
cluster and fired up the OSD's rapidly after eachother.

Not sure if I'm able to reproduce, but I sure can give it a try.

Wido

>
> Thanks!
> sage
>
>
>
> On Mon, 17 Oct 2011, Wido den Hollander wrote:
>
>> Hi,
>>
>> On 10/14/2011 05:16 PM, Sage Weil wrote:
>>> On Fri, 14 Oct 2011, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> I'm seeing this behavior again. In the monitor messages this started to
>>>> appear:
>>>>
>>>> 2011-10-14 16:22:19.853752   log 2011-10-14 16:22:14.887881 mon.0
>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 3281 : [INF] osd.12
>>>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
>>>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
>>>>
>>>> So, osd.0 is telling us that osd.12 is down, but the proces is running
>>>> and seems alive.
>>>>
>>>> The logs are saying:
>>>>
>>>> Oct 14 16:20:52 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
>>>> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
>>>> 16:00:07.636627 (cutoff 2011-10-14 16:20:32.025154)
>>>> Oct 14 16:20:52 atom0 osd.0[10679]: 7f81d2d93700 osd.0 511
>>>> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
>>>> 16:00:07.636627 (cutoff 2011-10-14 16:20:32.108369)
>>>> Oct 14 16:20:53 atom0 osd.0[10679]: 7f81deeac700 osd.0 511
>>>> heartbeat_check: no heartbeat from osd.12 since 2011-10-14
>>>> 16:00:07.636627 (cutoff 2011-10-14 16:20:33.025462)
>>>>
>>>> While osd.12 reports getting messages from osd.0
>>>>
>>>> Oct 14 16:20:54 atom3 osd.12[8024]: 7fa3621bb700 --
>>>> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024<== osd.0
>>>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 1317 ====
>>>> osd_ping(heartbeat e511 as_of 355) v1 ==== 61+0+0 (2126910847 0 0)
>>>> 0x1e281c0 con 0x1f02c80
>>>>
>>>> I see osd.0 sending pg_info messages:
>>>>
>>>> Oct 14 16:20:58 atom0 osd.0[10679]: 7f81d769d700 --
>>>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -->   osd.12
>>>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -- pg_info(1 pgs e512) v1
>>>> -- ?+0 0x3bf1e00
>>>>
>>>> And osd.12 is also sending stuff back:
>>>>
>>>> Oct 14 16:21:06 atom3 osd.12[8024]: 7fa3629bc700 --
>>>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6801/8023 -->   osd.0
>>>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6812/10678 -- pg_info(1 pgs e518) v1
>>>> -- ?+0 0x6f3e540
>>>>
>>>> I'm running the latest version, but I did do a complete powercycle of my
>>>> cluster this morning. It's after the powercycle these messages started
>>>> appearing.
>>>>
>>>> Any ideas?
>>> Can you post full osd logs for a pair where there was some marking-down
>>> occuring?  I'll take a look.
>>
>> Sure! Attached in the GZIP
>>
>> I'm using remote-syslog, so the logs might be not how you want them. Let me
>> know if so!
>>
>> I used "ceph osd tell 0 injectargs '--debug-ms 20'" to set the debugging
>> higher.
>>
>> What I noticed was this:
>>
>> $ tail -f osd.log |grep "osd\.0"|grep "osd\.12"
>>
>> Oct 17 17:48:33 monitor mon.dga[560]: 7fa5fc520700 log [INF] : osd.12
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
>> Oct 17 17:48:33 monitor mon.dga[560]: 2011-10-17 17:48:33.930666 mon.0
>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 2440 : [INF] osd.12
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6800/8023 failed (by osd.0
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6800/10678)
>> Oct 17 17:48:36 atom3 osd.12[8024]: 7fa356039700 --
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024>>
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 pipe(0x6e0e000 sd=20 pgs=133
>> cs=1 l=0).reader got envelope type=70 src osd.0 front=61 data=0 off 0
>> Oct 17 17:48:36 atom3 osd.12[8024]: 7fa3621bb700 --
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/8024<== osd.0
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 278150 ==== osd_ping(heartbeat
>> e577 as_of 355) v1 ==== 61+0+0 (2665319624 0 0) 0x52e1540 con 0x1f02c80
>>
>> $ tail -f osd.log |grep "osd\.0"|grep "osd\.15" (Random other OSD)
>>
>> Oct 17 17:49:29 atom3 osd.15[10295]: 7fd2ee88c700 --
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295<== osd.0
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277738 ==== osd_ping(heartbeat
>> e577 as_of 404) v1 ==== 61+0+0 (1619141758 0 0) 0x33568c0 con 0x56a0c80
>> Oct 17 17:49:33 atom3 osd.15[10295]: 7fd2ee88c700 --
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:0/10295<== osd.0
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:6813/10678 277742 ==== osd_ping(heartbeat
>> e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x6739c40 con 0x56a0c80
>> Oct 17 17:49:34 atom0 osd.0[10679]: 7f81c91ff700 --
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679>>
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 pipe(0x4bbda00 sd=87 pgs=19
>> cs=1 l=0).reader got envelope type=70 src osd.15 front=61 data=0 off 0
>> Oct 17 17:49:34 atom0 osd.0[10679]: 7f81d6e9c700 --
>> [2a00:f10:113:1:225:90ff:fe33:49fe]:0/10679<== osd.15
>> [2a00:f10:113:1:225:90ff:fe33:49b0]:6811/10294 278117 ==== osd_ping(heartbeat
>> e577 as_of 404) v1 ==== 61+0+0 (116581228 0 0) 0x2fd5c40 con 0x5a6cb40
>>
>> If you look at the log you'll see that osd.0 is reporting that osd.12 and
>> osd.39 are down.
>>
>> The logs show me that osd.0 is never communicating with osd.12 nor osd.39 and
>> simply reports them as down.
>>
>> The last time I saw this it was exactly the same. One OSD was repeating that
>> two other OSD's were down.
>>
>> My guess is that osd.0 is the problem here.
>>
>>>
>>> Thanks!
>>> sage
>>>
>>>> Wido
>>>>
>>>> On Wed, 2011-08-10 at 12:30 +0200, Wido den Hollander wrote:
>>>>> Hi,
>>>>>
>>>>> I tried reproducing this today with the new heartbeat code, couldn't.
>>>>>
>>>>> The new heartbeat code seems to work much better under load, I haven't
>>>>> seen any bouncing like I saw before.
>>>>>
>>>>> Wido
>>>>>
>>>>> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I'm seeing something weird which I don't want to report in a issue yet
>>>>>> in the tracker.
>>>>>>
>>>>>> I suffered a pretty big crash which caused me to loose 15 of my 40
>>>>>> OSDs.
>>>>>> Some died because my WD Green 2TB disks failed, others since btrfs
>>>>>> didn't survive the hard reboot I had to give them.
>>>>>>
>>>>>> When watching 'ceph -w' I noticed this:
>>>>>>
>>>>>> 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
>>>>>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
>>>>>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
>>>>>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
>>>>>> 46575 GB avail; 8255/3391107 degraded (0.243%)
>>>>>> 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up
>>>>>> {0=dga=up:replay(laggy
>>>>>> or crashed)}
>>>>>> 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
>>>>>> 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
>>>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
>>>>>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
>>>>>> 2011-08-05 13:32:03.225964   mon e1: 1 mons at
>>>>>> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
>>>>>> 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
>>>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
>>>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
>>>>>>
>>>>>> Here you see that osd5 is reporting that osd37 is down and osd7 is
>>>>>> reporting that osd19 is down.
>>>>>>
>>>>>> These messages keep being repeated, over and over. My 'debug ms' is at
>>>>>> 1, so I took a look at the logs:
>>>>>>
>>>>>> Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
>>>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
>>>>>> Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
>>>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
>>>>>> Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
>>>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
>>>>>> Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
>>>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
>>>>>>
>>>>>> osd5 keeps claiming that osd19 is down, but when I check the logs I
>>>>>> see
>>>>>> that osd19 is running and exchanging heartbeat messages with the other
>>>>>> OSDs in the cluster.
>>>>>>
>>>>>> What seems odd to me is that there are no log messages about osd7
>>>>>> claiming that osd37 is down, although it keeps being repeated by 'ceph
>>>>>> -w'.
>>>>>>
>>>>>> $ ceph osd injectargs 7 '--debug-ms 1'
>>>>>> $ ceph osd injectargs 37 '--debug-ms 1'
>>>>>>
>>>>>> I made sure that was injected, but still, no log messages.
>>>>>>
>>>>>> Going through the logs a bit further (after setting the logging to 20)
>>>>>> I
>>>>>> saw osd.19 reporting:
>>>>>>
>>>>>> Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
>>>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0
>>>>>> sd=56
>>>>>> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61
>>>>>> data=0
>>>>>> off 0
>>>>>>
>>>>>> It seems that it is communicating with osd5, but osd5 does not seem to
>>>>>> communicate with osd19?
>>>>>>
>>>>>> The log also confirms that:
>>>>>>
>>>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
>>>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
>>>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
>>>>>> 0x6948000 con 0x1cd7710
>>>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
>>>>>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
>>>>>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
>>>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
>>>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
>>>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13
>>>>>> pgs=178
>>>>>> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
>>>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
>>>>>>
>>>>>> Now, to prevent (I tried...) my cluster from bouncing around I placed
>>>>>> this in the OSD section:
>>>>>>
>>>>>>       osd min down reporters = 3
>>>>>>       osd min down reports = 5
>>>>>>
>>>>>> Since only osd5 and osd7 are sending out these messages osd19 nor
>>>>>> osd37
>>>>>> get marked down, but what is going on here?
>>>>>>
>>>>>> I haven't restarted the OSDs yet.
>>>>>>
>>>>>> I noticed that the OSD heartbeat process is being revised currently,
>>>>>> but
>>>>>> I thought it might be helpful to report this.
>>>>>>
>>>>>> Wido
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>>> in
>>>>>> the body of a message to majordomo@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>>> the body of a message to majordomo@vger.kernel.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2011-08-10 10:30 ` Wido den Hollander
  2011-10-14 14:26   ` Wido den Hollander
@ 2012-02-15 14:12   ` Wido den Hollander
  2012-02-22 18:08     ` Gregory Farnum
  1 sibling, 1 reply; 11+ messages in thread
From: Wido den Hollander @ 2012-02-15 14:12 UTC (permalink / raw)
  To: ceph-devel

Hi,

On 08/10/2011 12:30 PM, Wido den Hollander wrote:
> Hi,
>
> I tried reproducing this today with the new heartbeat code, couldn't.
>
> The new heartbeat code seems to work much better under load, I haven't
> seen any bouncing like I saw before.

I'm still observing this bouncing, after a couple of days running my 
cluster the OSD's started to say they are down again.

[INF] osd.19 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6809/14435 failed 
(by osd.4 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6800/15389

[INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6806/15035 failed (by 
osd.25 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6803/14344

[INF] osd.22 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/14527 failed 
(by osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/14188

It's the same again, a couple of OSD's don't like eachother anymore and 
start marking the other as down.

My ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf

I'm taking osd.19 and osd.4 here as example.

Some logs on Pastebin: http://pastebin.com/EYUHbPua

All systems are idle, so this can't be an issue of limited CPU cycles.

As this seems to keep coming back, what steps should we take next?

debug_ms and os are on 20.

What I noticed is that osd.4 never says: "osd.19 already has epoch 19", 
but that is I guess since it thinks osd.19 is down?

On osd.4 I never see I connection coming in from osd.19. I'm trying to 
figure out which of the two is wrong here.

Any suggestions?

Wido

>
> Wido
>
> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
>> Hi,
>>
>> I'm seeing something weird which I don't want to report in a issue yet
>> in the tracker.
>>
>> I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
>> Some died because my WD Green 2TB disks failed, others since btrfs
>> didn't survive the hard reboot I had to give them.
>>
>> When watching 'ceph -w' I noticed this:
>>
>> 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
>> 46575 GB avail; 8255/3391107 degraded (0.243%)
>> 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
>> or crashed)}
>> 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
>> 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
>> 2011-08-05 13:32:03.225964   mon e1: 1 mons at
>> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
>> 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
>>
>> Here you see that osd5 is reporting that osd37 is down and osd7 is
>> reporting that osd19 is down.
>>
>> These messages keep being repeated, over and over. My 'debug ms' is at
>> 1, so I took a look at the logs:
>>
>> Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
>> Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
>> Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
>> Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
>>
>> osd5 keeps claiming that osd19 is down, but when I check the logs I see
>> that osd19 is running and exchanging heartbeat messages with the other
>> OSDs in the cluster.
>>
>> What seems odd to me is that there are no log messages about osd7
>> claiming that osd37 is down, although it keeps being repeated by 'ceph
>> -w'.
>>
>> $ ceph osd injectargs 7 '--debug-ms 1'
>> $ ceph osd injectargs 37 '--debug-ms 1'
>>
>> I made sure that was injected, but still, no log messages.
>>
>> Going through the logs a bit further (after setting the logging to 20) I
>> saw osd.19 reporting:
>>
>> Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
>> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
>> off 0
>>
>> It seems that it is communicating with osd5, but osd5 does not seem to
>> communicate with osd19?
>>
>> The log also confirms that:
>>
>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
>> 0x6948000 con 0x1cd7710
>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
>> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
>>
>> Now, to prevent (I tried...) my cluster from bouncing around I placed
>> this in the OSD section:
>>
>>      osd min down reporters = 3
>>      osd min down reports = 5
>>
>> Since only osd5 and osd7 are sending out these messages osd19 nor osd37
>> get marked down, but what is going on here?
>>
>> I haven't restarted the OSDs yet.
>>
>> I noticed that the OSD heartbeat process is being revised currently, but
>> I thought it might be helpful to report this.
>>
>> Wido
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2012-02-15 14:12   ` Wido den Hollander
@ 2012-02-22 18:08     ` Gregory Farnum
  2012-02-23 10:45       ` Wido den Hollander
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2012-02-22 18:08 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel

Wido,
Sorry we lost track of this last week — we were all distracted by FAST 12! :)

So it looks like they're both on the same map and osd.4 is sending
pings to osd.19, but osd.19 is just ignoring them? Or do you really
have on debug_os and not debug_osd? :)
-Greg

On Wed, Feb 15, 2012 at 6:12 AM, Wido den Hollander <wido@widodh.nl> wrote:
> Hi,
>
>
> On 08/10/2011 12:30 PM, Wido den Hollander wrote:
>>
>> Hi,
>>
>> I tried reproducing this today with the new heartbeat code, couldn't.
>>
>> The new heartbeat code seems to work much better under load, I haven't
>> seen any bouncing like I saw before.
>
>
> I'm still observing this bouncing, after a couple of days running my cluster
> the OSD's started to say they are down again.
>
> [INF] osd.19 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6809/14435 failed (by
> osd.4 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6800/15389
>
> [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6806/15035 failed (by
> osd.25 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6803/14344
>
> [INF] osd.22 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/14527 failed (by
> osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/14188
>
> It's the same again, a couple of OSD's don't like eachother anymore and
> start marking the other as down.
>
> My ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf
>
> I'm taking osd.19 and osd.4 here as example.
>
> Some logs on Pastebin: http://pastebin.com/EYUHbPua
>
> All systems are idle, so this can't be an issue of limited CPU cycles.
>
> As this seems to keep coming back, what steps should we take next?
>
> debug_ms and os are on 20.
>
> What I noticed is that osd.4 never says: "osd.19 already has epoch 19", but
> that is I guess since it thinks osd.19 is down?
>
> On osd.4 I never see I connection coming in from osd.19. I'm trying to
> figure out which of the two is wrong here.
>
> Any suggestions?
>
> Wido
>
>
>>
>> Wido
>>
>> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
>>>
>>> Hi,
>>>
>>> I'm seeing something weird which I don't want to report in a issue yet
>>> in the tracker.
>>>
>>> I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
>>> Some died because my WD Green 2TB disks failed, others since btrfs
>>> didn't survive the hard reboot I had to give them.
>>>
>>> When watching 'ceph -w' I noticed this:
>>>
>>> 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
>>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
>>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
>>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
>>> 46575 GB avail; 8255/3391107 degraded (0.243%)
>>> 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
>>> or crashed)}
>>> 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
>>> 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
>>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
>>> 2011-08-05 13:32:03.225964   mon e1: 1 mons at
>>> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
>>> 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
>>>
>>> Here you see that osd5 is reporting that osd37 is down and osd7 is
>>> reporting that osd19 is down.
>>>
>>> These messages keep being repeated, over and over. My 'debug ms' is at
>>> 1, so I took a look at the logs:
>>>
>>> Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
>>> Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
>>> Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
>>> Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
>>>
>>> osd5 keeps claiming that osd19 is down, but when I check the logs I see
>>> that osd19 is running and exchanging heartbeat messages with the other
>>> OSDs in the cluster.
>>>
>>> What seems odd to me is that there are no log messages about osd7
>>> claiming that osd37 is down, although it keeps being repeated by 'ceph
>>> -w'.
>>>
>>> $ ceph osd injectargs 7 '--debug-ms 1'
>>> $ ceph osd injectargs 37 '--debug-ms 1'
>>>
>>> I made sure that was injected, but still, no log messages.
>>>
>>> Going through the logs a bit further (after setting the logging to 20) I
>>> saw osd.19 reporting:
>>>
>>> Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
>>> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
>>> off 0
>>>
>>> It seems that it is communicating with osd5, but osd5 does not seem to
>>> communicate with osd19?
>>>
>>> The log also confirms that:
>>>
>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
>>> 0x6948000 con 0x1cd7710
>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
>>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
>>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
>>> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
>>>
>>> Now, to prevent (I tried...) my cluster from bouncing around I placed
>>> this in the OSD section:
>>>
>>>     osd min down reporters = 3
>>>     osd min down reports = 5
>>>
>>> Since only osd5 and osd7 are sending out these messages osd19 nor osd37
>>> get marked down, but what is going on here?
>>>
>>> I haven't restarted the OSDs yet.
>>>
>>> I noticed that the OSD heartbeat process is being revised currently, but
>>> I thought it might be helpful to report this.
>>>
>>> Wido
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2012-02-22 18:08     ` Gregory Farnum
@ 2012-02-23 10:45       ` Wido den Hollander
  2012-02-24  5:18         ` Gregory Farnum
  0 siblings, 1 reply; 11+ messages in thread
From: Wido den Hollander @ 2012-02-23 10:45 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel

Hi,


On 02/22/2012 07:08 PM, Gregory Farnum wrote:
> Wido,
> Sorry we lost track of this last week — we were all distracted by FAST 12! :)
>
No problem!

> So it looks like they're both on the same map and osd.4 is sending
> pings to osd.19, but osd.19 is just ignoring them? Or do you really
> have on debug_os and not debug_osd? :)

That was a typo, I have debug_osd set to 20.

I haven't rebooted the OSD's since and now osd.4 and osd.19 are not 
complaining anymore, but it's now a different set of OSD's who are 
saying the other one is down.

I'm still running v0.41 btw. I'm not going to touch the cluster until 
this one is tracked down, it keeps coming back.

Suggestions?

Wido

> -Greg
>
> On Wed, Feb 15, 2012 at 6:12 AM, Wido den Hollander<wido@widodh.nl>  wrote:
>> Hi,
>>
>>
>> On 08/10/2011 12:30 PM, Wido den Hollander wrote:
>>>
>>> Hi,
>>>
>>> I tried reproducing this today with the new heartbeat code, couldn't.
>>>
>>> The new heartbeat code seems to work much better under load, I haven't
>>> seen any bouncing like I saw before.
>>
>>
>> I'm still observing this bouncing, after a couple of days running my cluster
>> the OSD's started to say they are down again.
>>
>> [INF] osd.19 [2a00:f10:11b:cef0:225:90ff:fe33:49cc]:6809/14435 failed (by
>> osd.4 [2a00:f10:11b:cef0:225:90ff:fe32:cf64]:6800/15389
>>
>> [INF] osd.8 [2a00:f10:11b:cef0:225:90ff:fe33:49f2]:6806/15035 failed (by
>> osd.25 [2a00:f10:11b:cef0:225:90ff:fe33:49ca]:6803/14344
>>
>> [INF] osd.22 [2a00:f10:11b:cef0:225:90ff:fe33:497c]:6806/14527 failed (by
>> osd.13 [2a00:f10:11b:cef0:225:90ff:fe33:49b0]:6803/14188
>>
>> It's the same again, a couple of OSD's don't like eachother anymore and
>> start marking the other as down.
>>
>> My ceph.conf: http://zooi.widodh.nl/ceph/ceph.conf
>>
>> I'm taking osd.19 and osd.4 here as example.
>>
>> Some logs on Pastebin: http://pastebin.com/EYUHbPua
>>
>> All systems are idle, so this can't be an issue of limited CPU cycles.
>>
>> As this seems to keep coming back, what steps should we take next?
>>
>> debug_ms and os are on 20.
>>
>> What I noticed is that osd.4 never says: "osd.19 already has epoch 19", but
>> that is I guess since it thinks osd.19 is down?
>>
>> On osd.4 I never see I connection coming in from osd.19. I'm trying to
>> figure out which of the two is wrong here.
>>
>> Any suggestions?
>>
>> Wido
>>
>>
>>>
>>> Wido
>>>
>>> On Fri, 2011-08-05 at 14:11 +0200, Wido den Hollander wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm seeing something weird which I don't want to report in a issue yet
>>>> in the tracker.
>>>>
>>>> I suffered a pretty big crash which caused me to loose 15 of my 40 OSDs.
>>>> Some died because my WD Green 2TB disks failed, others since btrfs
>>>> didn't survive the hard reboot I had to give them.
>>>>
>>>> When watching 'ceph -w' I noticed this:
>>>>
>>>> 2011-08-05 13:32:03.209958    pg v504313: 8008 pgs: 1 inactive, 334
>>>> active+clean, 10 peering, 1 crashed+peering, 14 down+peering, 7 active
>>>> +clean+degraded, 7624 crashed+down+peering, 8 down+degraded+peering, 9
>>>> crashed+down+degraded+peering; 4417 GB data, 13121 GB used, 32504 GB /
>>>> 46575 GB avail; 8255/3391107 degraded (0.243%)
>>>> 2011-08-05 13:32:03.225712   mds e6033: 1/1/1 up {0=dga=up:replay(laggy
>>>> or crashed)}
>>>> 2011-08-05 13:32:03.225747   osd e9934: 40 osds: 25 up, 25 in
>>>> 2011-08-05 13:32:03.225874   log 2011-08-05 13:32:01.646596 mon0
>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29796 : [INF] osd37
>>>> [2a00:f10:113:1:225:90ff:fe33:49a4]:6803/17881 failed (by osd7
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6806/2746)
>>>> 2011-08-05 13:32:03.225964   mon e1: 1 mons at
>>>> {dga=[2a00:f10:113:1:230:48ff:fed3:b086]:6789/0}
>>>> 2011-08-05 13:32:03.929650   log 2011-08-05 13:32:03.603681 mon0
>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 29797 : [INF] osd19
>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 failed (by osd5
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506)
>>>>
>>>> Here you see that osd5 is reporting that osd37 is down and osd7 is
>>>> reporting that osd19 is down.
>>>>
>>>> These messages keep being repeated, over and over. My 'debug ms' is at
>>>> 1, so I took a look at the logs:
>>>>
>>>> Aug  5 13:32:41 atom1 osd.5[2507]: 7f75227fc700 osd5 9963
>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:21.916364)
>>>> Aug  5 13:32:42 atom1 osd.5[2507]: 7f7531579700 osd5 9963
>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:22.381632)
>>>> Aug  5 13:32:43 atom1 osd.5[2507]: 7f7531579700 osd5 9965
>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:23.474198)
>>>> Aug  5 13:32:44 atom1 osd.5[2507]: 7f75227fc700 osd5 9965
>>>> heartbeat_check: no heartbeat from osd19 since 2011-08-04
>>>> 17:01:02.504942 (cutoff 2011-08-05 13:32:24.027161)
>>>>
>>>> osd5 keeps claiming that osd19 is down, but when I check the logs I see
>>>> that osd19 is running and exchanging heartbeat messages with the other
>>>> OSDs in the cluster.
>>>>
>>>> What seems odd to me is that there are no log messages about osd7
>>>> claiming that osd37 is down, although it keeps being repeated by 'ceph
>>>> -w'.
>>>>
>>>> $ ceph osd injectargs 7 '--debug-ms 1'
>>>> $ ceph osd injectargs 37 '--debug-ms 1'
>>>>
>>>> I made sure that was injected, but still, no log messages.
>>>>
>>>> Going through the logs a bit further (after setting the logging to 20) I
>>>> saw osd.19 reporting:
>>>>
>>>> Aug  5 13:48:23 atom4 osd.19[2741]: 7f634ca8e700 --
>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6810/2740>>
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6808/2506 pipe(0x7f64e4193ec0 sd=56
>>>> pgs=1530 cs=1 l=0).reader got envelope type=70 src osd5 front=61 data=0
>>>> off 0
>>>>
>>>> It seems that it is communicating with osd5, but osd5 does not seem to
>>>> communicate with osd19?
>>>>
>>>> The log also confirms that:
>>>>
>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 -->
>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 -- osd_failure(osd19
>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1 -- ?+0
>>>> 0x6948000 con 0x1cd7710
>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f7531579700 --
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506 submit_message
>>>> osd_failure(osd19 [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020
>>>> v10020) v1 remote [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0
>>>> Aug  5 14:10:05 atom1 osd.5[2507]: 7f75365a1700 --
>>>> [2a00:f10:113:1:225:90ff:fe32:cf64]:6800/2506>>
>>>> [2a00:f10:113:1:230:48ff:fed3:b086]:6789/0 pipe(0x1cd74a0 sd=13 pgs=178
>>>> cs=1 l=1).writer encoding 17093 0x6948000 osd_failure(osd19
>>>> [2a00:f10:113:1:225:90ff:fe33:49cc]:6806/2740 e10020 v10020) v1
>>>>
>>>> Now, to prevent (I tried...) my cluster from bouncing around I placed
>>>> this in the OSD section:
>>>>
>>>>      osd min down reporters = 3
>>>>      osd min down reports = 5
>>>>
>>>> Since only osd5 and osd7 are sending out these messages osd19 nor osd37
>>>> get marked down, but what is going on here?
>>>>
>>>> I haven't restarted the OSDs yet.
>>>>
>>>> I noticed that the OSD heartbeat process is being revised currently, but
>>>> I thought it might be helpful to report this.
>>>>
>>>> Wido
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2012-02-23 10:45       ` Wido den Hollander
@ 2012-02-24  5:18         ` Gregory Farnum
  2012-02-28 15:42           ` Wido den Hollander
  0 siblings, 1 reply; 11+ messages in thread
From: Gregory Farnum @ 2012-02-24  5:18 UTC (permalink / raw)
  To: Wido den Hollander; +Cc: ceph-devel, Sage Weil

On Thu, Feb 23, 2012 at 2:45 AM, Wido den Hollander <wido@widodh.nl> wrote:
> Hi,
>
>
>
> On 02/22/2012 07:08 PM, Gregory Farnum wrote:
>>
>> Wido,
>> Sorry we lost track of this last week — we were all distracted by FAST 12!
>> :)
>>
> No problem!
>
>
>> So it looks like they're both on the same map and osd.4 is sending
>> pings to osd.19, but osd.19 is just ignoring them? Or do you really
>> have on debug_os and not debug_osd? :)
>
>
> That was a typo, I have debug_osd set to 20.
>
> I haven't rebooted the OSD's since and now osd.4 and osd.19 are not
> complaining anymore, but it's now a different set of OSD's who are saying
> the other one is down.
>
> I'm still running v0.41 btw. I'm not going to touch the cluster until this
> one is tracked down, it keeps coming back.
>
> Suggestions?

Well, like Sage said long ago, this will be easiest to diagnose if
there are logs available for both OSDs that cover the entire time
after one requested heartbeats from the other.

If you do have these and can post them somewhere, I'm sure Sage or I
will find it interesting enough to look through...  ;)
If not, I'm out of ideas, although I'm not super-familiar with the
heartbeat code since Sage rewrote it so we may be able to come up with
something if we discuss it more.
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Repeated messages of "heartbeat_check: no heartbeat from"
  2012-02-24  5:18         ` Gregory Farnum
@ 2012-02-28 15:42           ` Wido den Hollander
  0 siblings, 0 replies; 11+ messages in thread
From: Wido den Hollander @ 2012-02-28 15:42 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: ceph-devel, Sage Weil

Hi,

On 02/24/2012 06:18 AM, Gregory Farnum wrote:
> On Thu, Feb 23, 2012 at 2:45 AM, Wido den Hollander<wido@widodh.nl>  wrote:
>> Hi,
>>
>>
>>
>> On 02/22/2012 07:08 PM, Gregory Farnum wrote:
>>>
>>> Wido,
>>> Sorry we lost track of this last week — we were all distracted by FAST 12!
>>> :)
>>>
>> No problem!
>>
>>
>>> So it looks like they're both on the same map and osd.4 is sending
>>> pings to osd.19, but osd.19 is just ignoring them? Or do you really
>>> have on debug_os and not debug_osd? :)
>>
>>
>> That was a typo, I have debug_osd set to 20.
>>
>> I haven't rebooted the OSD's since and now osd.4 and osd.19 are not
>> complaining anymore, but it's now a different set of OSD's who are saying
>> the other one is down.
>>
>> I'm still running v0.41 btw. I'm not going to touch the cluster until this
>> one is tracked down, it keeps coming back.
>>
>> Suggestions?
>
> Well, like Sage said long ago, this will be easiest to diagnose if
> there are logs available for both OSDs that cover the entire time
> after one requested heartbeats from the other.
>
> If you do have these and can post them somewhere, I'm sure Sage or I
> will find it interesting enough to look through...  ;)
> If not, I'm out of ideas, although I'm not super-familiar with the
> heartbeat code since Sage rewrote it so we may be able to come up with
> something if we discuss it more.
> -Greg

I created an issue for this with logs attached: 
http://tracker.newdream.net/issues/2116

Thanks,

Wido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-02-28 15:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-08-05 12:11 Repeated messages of "heartbeat_check: no heartbeat from" Wido den Hollander
2011-08-10 10:30 ` Wido den Hollander
2011-10-14 14:26   ` Wido den Hollander
2011-10-14 15:16     ` Sage Weil
     [not found]       ` <4E9C4F45.8030704@widodh.nl>
2011-10-25 23:06         ` Sage Weil
2011-10-27  6:20           ` Wido den Hollander
2012-02-15 14:12   ` Wido den Hollander
2012-02-22 18:08     ` Gregory Farnum
2012-02-23 10:45       ` Wido den Hollander
2012-02-24  5:18         ` Gregory Farnum
2012-02-28 15:42           ` Wido den Hollander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.