Suicide

All of lore.kernel.org
 help / color / mirror / Atom feed

* Suicide
@ 2011-04-15 14:48 Zenon Panoussis
  2011-04-15 15:16 ` Suicide Zenon Panoussis
                   ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-15 14:48 UTC (permalink / raw)
  To: ceph-devel

Hello again.

A new day, a new problem:

ceph version 0.26.commit: 9981ff90968398da43c63106694d661f5e3d07d5. process: cmds. pid: 8613
2011-04-15 15:50:11.945105 7fd081c07700 mds-1.0 ms_handle_connect on 192.168.178.100:6789/0
2011-04-15 15:50:16.277166 7fd081c07700 mds-1.0 handle_mds_map standby
2011-04-15 15:50:17.662254 7fd081c07700 mds-1.0 handle_mds_map standby
2011-04-15 15:50:33.050509 7fd081c07700 mds0.41 handle_mds_map i am now mds0.41
2011-04-15 15:50:33.050614 7fd081c07700 mds0.41 handle_mds_map state change up:standby --> up:replay
2011-04-15 15:50:33.050812 7fd081c07700 mds0.41 replay_start
2011-04-15 15:50:33.050908 7fd081c07700 mds0.41  recovery set is
2011-04-15 15:50:33.050962 7fd081c07700 mds0.41  need osdmap epoch 184, have 178
2011-04-15 15:50:33.051143 7fd081c07700 mds0.cache handle_mds_failure mds0 : recovery peers are
2011-04-15 15:50:33.057263 7fd081c07700 mds0.41 ms_handle_connect on 192.168.178.100:6801/8729
2011-04-15 15:50:33.071290 7fd081c07700 mds0.41 ms_handle_connect on 192.168.178.101:6801/4572
2011-04-15 15:50:33.972268 7fd081c07700 mds0.cache creating system inode with ino:100
2011-04-15 15:50:33.972783 7fd081c07700 mds0.cache creating system inode with ino:1
2011-04-15 15:50:44.568883 7fd0735fc700 mds0.journaler try_read_entry got 0 len entry at offset 390071095
2011-04-15 15:50:44.569039 7fd0735fc700 mds0.log _replay journaler got error -22, aborting
2011-04-15 15:50:44.576781 7fd0735fc700 mds0.41 boot_start encountered an error, failing
2011-04-15 15:50:44.576871 7fd0735fc700 mds0.41 suicide.  wanted up:replay, now down:dne
2011-04-15 15:50:44.580226 7fd082c0a720 stopped.

Last night I mounted the ceph file system OK and started copying data to it.
Logging was at the sample.ceph.conf defaults, so pretty hefty. The logs go
to /var/log and the journal is in /var, which in this case are on different
partitions. At some point copying hung. It was really very late, so instead
of troubleshooting I switched off the machines and went to bed.

Today I tried to mount the ceph file system and failed, several times. I then
found that /var/log was full to the last byte on (only) one of my two nodes.
I cleaned up and restarted ceph, but trying to mount the file system continued
giving me "error 5" and the mds log entries above. It looks very much like issue
451 reproduced, but I don't grasp the logic of it. It would make more sense if
the journal had been on the full partition, but it wasn't.

cmds [-i node01|mds.node01|0] -c /etc/ceph/ceph.conf --journal_check -f
returns a usage message and no error, no matter if/how I specify the name
and/or -m IP:6789. What's the correct syntax for it?

The full logs (what's now left of them) are at
http://www.provocation.net/tmp/mds_logs.tar.bz2 . Is there anything I can do
to solve this, apart from re-creating the file system?

ceph 0.26, kernel 2.38.2, ext4, ceph.conf as posted yesterday in the "Mounting"
thread but without auth.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 14:48 Suicide Zenon Panoussis
@ 2011-04-15 15:16 ` Zenon Panoussis
  2011-04-15 15:38 ` Suicide Zenon Panoussis
  2011-04-15 20:21 ` Suicide Gregory Farnum
  2 siblings, 0 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-15 15:16 UTC (permalink / raw)
  To: ceph-devel

BTW, man cmds says "--journal_check" and cmds says

usage: cmds -i name [flags] [[--journal_check]|[--hot-standby][rank]]
<snip>
  --journal-check rank
        replay the journal for rank, then exit

with underscore in the summary and dash in the option details. Trying all
possibble combinations I found the right syntax:

cmds -i node01 --journal-check 0

The problem wasn't dash or underscore; both of them work. Omitting the rank
is what doesn't, so that's an error in the man page: [--journal_check [rank]]
should be [--journal_check rank].

Anyway, cmds --journal-check on both nodes didn't fix the problem. I also
tried --journal-reset. It returned

2011-04-15 17:07:44.494238 7f372f185720 -- 0.0.0.0:6804/11266 messenger.start BUG: there
are 1 child threads already started that will now die!  call messenger.start() sooner.

but didn't fix the suicide either.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 14:48 Suicide Zenon Panoussis
  2011-04-15 15:16 ` Suicide Zenon Panoussis
@ 2011-04-15 15:38 ` Zenon Panoussis
  2011-04-15 20:21 ` Suicide Gregory Farnum
  2 siblings, 0 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-15 15:38 UTC (permalink / raw)
  To: ceph-devel


Sorry for spamming. I found http://irclogs.ceph.widodh.nl/index.php?date=2011-01-07
and put my journals at http://www.provocation.net/tmp/journals.tar.bz2 .

Z



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 14:48 Suicide Zenon Panoussis
  2011-04-15 15:16 ` Suicide Zenon Panoussis
  2011-04-15 15:38 ` Suicide Zenon Panoussis
@ 2011-04-15 20:21 ` Gregory Farnum
  2011-04-15 22:38   ` Suicide Zenon Panoussis
  2 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2011-04-15 20:21 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

I looked at your logs and they're very interesting. I suspect the reason your journal broke is because your logging partition filled up, although I can't be sure.
We have seen a similar error once before but were unable to diagnose it because there wasn't enough logging, and it looks like we don't have enough here either. The real problem isn't the failure to finish replay (that's a symptom), but the crash on node01 at 4:35:16.

Can you give me more details about how your cluster was behaving at that time? I can't quite match up what I'm seeing in the logs with your description -- looks like node02 ran out of disk space for logging at ~4:06:30 but I can't see from the logs why they were passing leadership back and forth, or why nothing happened between 4:06 and 4:33.

I'll take a look at the documentation bits later today, too.
-Greg
On Friday, April 15, 2011 at 7:48 AM, Zenon Panoussis wrote:

> Hello again.
> 
> A new day, a new problem:
> 
> ceph version 0.26.commit: 9981ff90968398da43c63106694d661f5e3d07d5. process: cmds. pid: 8613
> 2011-04-15 15:50:11.945105 7fd081c07700 mds-1.0 ms_handle_connect on 192.168.178.100:6789/0
> 2011-04-15 15:50:16.277166 7fd081c07700 mds-1.0 handle_mds_map standby
> 2011-04-15 15:50:17.662254 7fd081c07700 mds-1.0 handle_mds_map standby
> 2011-04-15 15:50:33.050509 7fd081c07700 mds0.41 handle_mds_map i am now mds0.41
> 2011-04-15 15:50:33.050614 7fd081c07700 mds0.41 handle_mds_map state change up:standby --> up:replay
> 2011-04-15 15:50:33.050812 7fd081c07700 mds0.41 replay_start
> 2011-04-15 15:50:33.050908 7fd081c07700 mds0.41 recovery set is
> 2011-04-15 15:50:33.050962 7fd081c07700 mds0.41 need osdmap epoch 184, have 178
> 2011-04-15 15:50:33.051143 7fd081c07700 mds0.cache handle_mds_failure mds0 : recovery peers are
> 2011-04-15 15:50:33.057263 7fd081c07700 mds0.41 ms_handle_connect on 192.168.178.100:6801/8729
> 2011-04-15 15:50:33.071290 7fd081c07700 mds0.41 ms_handle_connect on 192.168.178.101:6801/4572
> 2011-04-15 15:50:33.972268 7fd081c07700 mds0.cache creating system inode with ino:100
> 2011-04-15 15:50:33.972783 7fd081c07700 mds0.cache creating system inode with ino:1
> 2011-04-15 15:50:44.568883 7fd0735fc700 mds0.journaler try_read_entry got 0 len entry at offset 390071095
> 2011-04-15 15:50:44.569039 7fd0735fc700 mds0.log _replay journaler got error -22, aborting
> 2011-04-15 15:50:44.576781 7fd0735fc700 mds0.41 boot_start encountered an error, failing
> 2011-04-15 15:50:44.576871 7fd0735fc700 mds0.41 suicide. wanted up:replay, now down:dne
> 2011-04-15 15:50:44.580226 7fd082c0a720 stopped.
> 
> Last night I mounted the ceph file system OK and started copying data to it.
> Logging was at the sample.ceph.conf defaults, so pretty hefty. The logs go
> to /var/log and the journal is in /var, which in this case are on different
> partitions. At some point copying hung. It was really very late, so instead
> of troubleshooting I switched off the machines and went to bed.
> 
> Today I tried to mount the ceph file system and failed, several times. I then
> found that /var/log was full to the last byte on (only) one of my two nodes.
> I cleaned up and restarted ceph, but trying to mount the file system continued
> giving me "error 5" and the mds log entries above. It looks very much like issue
> 451 reproduced, but I don't grasp the logic of it. It would make more sense if
> the journal had been on the full partition, but it wasn't.
> 
> cmds [-i node01|mds.node01|0] -c /etc/ceph/ceph.conf --journal_check -f
> returns a usage message and no error, no matter if/how I specify the name
> and/or -m IP:6789. What's the correct syntax for it?
> 
> The full logs (what's now left of them) are at
> http://www.provocation.net/tmp/mds_logs.tar.bz2 . Is there anything I can do
> to solve this, apart from re-creating the file system?
> 
> ceph 0.26, kernel 2.38.2, ext4, ceph.conf as posted yesterday in the "Mounting"
> thread but without auth.
> 
> Z
> 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 20:21 ` Suicide Gregory Farnum
@ 2011-04-15 22:38   ` Zenon Panoussis
  2011-04-15 23:06     ` Suicide Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-15 22:38 UTC (permalink / raw)
  To: ceph-devel

On 04/15/2011 10:21 PM, Gregory Farnum wrote:

> Can you give me more details about how your cluster was behaving at 
> that time? I can't quite match up what I'm seeing in the logs with 
> your description -- looks like node02 ran out of disk space for 
> logging at ~4:06:30 but I can't see from the logs why they were 
> passing leadership back and forth...

I have a hunch what might have happened, but please keep in mind that
I don't know ceph at all, so I'm much more likely to bark up a wrong
tree than the right one.

node01 is a Fujitsu Siemens C620, P4 3GHz, 2 GB RAM retrofitted with a
40-euro 2,5" SATA disk. It's a small and silent little box perfect for
testing stuff on, but not really a great performer. It runs the ceph
monitor and the only client. It looks like this:

[root@node01 ~]# df -m
Filesystem           1M-blocks      Used Available Use% Mounted on
/dev/sda1                15019      8267      5990  58% /
tmpfs                      993         0       993   0% /dev/shm
/dev/sda2                 7510      1308      5821  19% /var
/dev/sda5                15019       547     13710   4% /var/log
/dev/mapper/sda6        261029      1408    246362   1% /mnt/osd
192.168.178.100:6789:/
                        514984      2815    486009   1% /vol

where /mnt/osd is the raw ext4 store and /vol is the ceph-mount of it.

node02 is a Fujitsu Siemens RX-200 S2 with two dual-core 3GHz Xeon CPUs,
4 GB RAM and two SCSI 320 disks. If the amount of noise it makes is any
measure of its performance, it's in the LINPACK top 10.

My CRUSH rules say min_size 2 for all data types, meant to force a full
replication of everything to both nodes.

Last night I was copying an archived news spool of text-only newsgroups
to the ceph store with 'rsync -a spooldir node01:/vol/'. I have just over
a million files there, most of them between 1 and 2 KB in size. I noticed
a huge load on node01, top reporting a load of 15 to 25, but I let it grind
on anyway, until rsync got stuck and I gave up and went to bed. I never
checked the load on node02, but we can take it for granted that it was
way much lower.

Now, the cause of this tremendous load is a separate issue in itself, but
my hunch is that it might have triggered the chain of events that lead to
journal corruption. At 04:28 last night I wrote the following to this list,
but saved it as a draft instead of sending it because I wasn't sure I was
interpreting the log correctly:

===
2011-04-15 04:14:29.810679    pg v1605: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
2011-04-15 04:14:34.937142   log 2011-04-15 04:14:34.581267 mon0 192.168.178.100:6789/0 18 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
2011-04-15 04:14:35.118217    pg v1606: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
2011-04-15 04:14:39.899001   log 2011-04-15 04:14:39.645393 mon0 192.168.178.100:6789/0 19 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
2011-04-15 04:14:40.046747    pg v1607: 528 pgs: 528 active+clean; 453 MB data, 2011 MB used, 475 GB / 502 GB avail

While external host X is copying files to ceph node 0 which replicates on
node 1, node 1 claims that node 0 is dead, which it it isn't, and node 0
believes it and goes belly up.
===

Reading a bit of http://ceph.newdream.net/weil-thesis.pdf today, I noticed
this (page 127):

 The cluster initially elects a leader to serialize map updates and manage
 consistency. Once elected, the leader begins by requesting the map epochs
 stored by each monitor. Monitors have a fixed amount of time T (currently
 two seconds) to respond to the probe and join the quorum. The leader ensures
 that a majority of the monitors are active and that it has the most recent
 map epoch (requesting incremental updates from other monitors as necessary),
 and then begins distributing short-term leases to active monitors.

So I'm thinking now, what if node02, which I assume was alert and responsive,
was querying node01 and node01, because of the huge load that it was under,
was not responding within 2 seconds? Would node02 then tell all other nodes
that node01 is dead and node01 actually believe it, although the obituary was
about itself and it very well knew that it was alive and struggling?

If that is what happened, it would explain the passing of control back and
fro. In my mind it still doesn't explain the journal corruption, but that
might just mean that I need to do more reading.

As for the load, I'm copying the same news spool again today and I found
that 'rsync -a --bwlimit=20 spooldir node01:/vol/' gives a steady load of
2-3 on node01 and just around 1 on node02. That 20 is KBps, a ridiculously
low transfer speed, but the only way to space apart ceph's write operations.

> ...or why nothing happened between 4:06 and 4:33.

I had half an eye on ceph while doing other unrelated things. I saw rsync
hanging, checked some, but not all the logs, wrote the unposted posting
above, and then gave up and called it a day. 04:06 is probably the time
when the log partition filled up on node02 and 04:33 is likely the time
when I stopped ceph and shut down the machines. I think everything was hung
during that time, or rather deadlocked.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 22:38   ` Suicide Zenon Panoussis
@ 2011-04-15 23:06     ` Gregory Farnum
  2011-04-15 23:29       ` Suicide Zenon Panoussis
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2011-04-15 23:06 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel


On Friday, April 15, 2011 at 3:38 PM, Zenon Panoussis wrote: 
> 
> On 04/15/2011 10:21 PM, Gregory Farnum wrote:
> 
> > Can you give me more details about how your cluster was behaving at 
> > that time? I can't quite match up what I'm seeing in the logs with 
> > your description -- looks like node02 ran out of disk space for 
> > logging at ~4:06:30 but I can't see from the logs why they were 
> > passing leadership back and forth...
> 
> I have a hunch what might have happened, but please keep in mind that
> I don't know ceph at all, so I'm much more likely to bark up a wrong
> tree than the right one.
> 
> node01 is a Fujitsu Siemens C620, P4 3GHz, 2 GB RAM retrofitted with a
> 40-euro 2,5" SATA disk. It's a small and silent little box perfect for
> testing stuff on, but not really a great performer. It runs the ceph
> monitor and the only client. It looks like this:
> 
> [root@node01 ~]# df -m
> Filesystem 1M-blocks Used Available Use% Mounted on
> /dev/sda1 15019 8267 5990 58% /
> tmpfs 993 0 993 0% /dev/shm
> /dev/sda2 7510 1308 5821 19% /var
> /dev/sda5 15019 547 13710 4% /var/log
> /dev/mapper/sda6 261029 1408 246362 1% /mnt/osd
> 192.168.178.100:6789:/
>  514984 2815 486009 1% /vol
> 
> where /mnt/osd is the raw ext4 store and /vol is the ceph-mount of it.
> 
> node02 is a Fujitsu Siemens RX-200 S2 with two dual-core 3GHz Xeon CPUs,
> 4 GB RAM and two SCSI 320 disks. If the amount of noise it makes is any
> measure of its performance, it's in the LINPACK top 10.
> 
> My CRUSH rules say min_size 2 for all data types, meant to force a full
> replication of everything to both nodes.
> 
> Last night I was copying an archived news spool of text-only newsgroups
> to the ceph store with 'rsync -a spooldir node01:/vol/'. I have just over
> a million files there, most of them between 1 and 2 KB in size. I noticed
> a huge load on node01, top reporting a load of 15 to 25, but I let it grind
> on anyway, until rsync got stuck and I gave up and went to bed. I never
> checked the load on node02, but we can take it for granted that it was
> way much lower.
> 
> Now, the cause of this tremendous load is a separate issue in itself, but
> my hunch is that it might have triggered the chain of events that lead to
> journal corruption. At 04:28 last night I wrote the following to this list,
> but saved it as a draft instead of sending it because I wasn't sure I was
> interpreting the log correctly:
> 
> ===
> 2011-04-15 04:14:29.810679 pg v1605: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
> 2011-04-15 04:14:34.937142 log 2011-04-15 04:14:34.581267 mon0 192.168.178.100:6789/0 18 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
> 2011-04-15 04:14:35.118217 pg v1606: 528 pgs: 528 active+clean; 452 MB data, 2011 MB used, 475 GB / 502 GB avail
> 2011-04-15 04:14:39.899001 log 2011-04-15 04:14:39.645393 mon0 192.168.178.100:6789/0 19 : [INF] osd0 192.168.178.100:6801/2615 failed (by osd1 192.168.178.101:6801/4630)
> 2011-04-15 04:14:40.046747 pg v1607: 528 pgs: 528 active+clean; 453 MB data, 2011 MB used, 475 GB / 502 GB avail
> 
> While external host X is copying files to ceph node 0 which replicates on
> node 1, node 1 claims that node 0 is dead, which it it isn't, and node 0
> believes it and goes belly up.
> ===
> 
> Reading a bit of http://ceph.newdream.net/weil-thesis.pdf today, I noticed
> this (page 127):
> 
>  The cluster initially elects a leader to serialize map updates and manage
>  consistency. Once elected, the leader begins by requesting the map epochs
>  stored by each monitor. Monitors have a fixed amount of time T (currently
>  two seconds) to respond to the probe and join the quorum. The leader ensures
>  that a majority of the monitors are active and that it has the most recent
>  map epoch (requesting incremental updates from other monitors as necessary),
>  and then begins distributing short-term leases to active monitors.
> 
> So I'm thinking now, what if node02, which I assume was alert and responsive,
> was querying node01 and node01, because of the huge load that it was under,
> was not responding within 2 seconds? Would node02 then tell all other nodes
> that node01 is dead and node01 actually believe it, although the obituary was
> about itself and it very well knew that it was alive and struggling?

Well, you're confusing the MDS and the Monitor here, but if there was a heavy load that could still cause the active MDS to pass back and forth if it starts taking too long to handle requests.

> 
> If that is what happened, it would explain the passing of control back and
> fro. In my mind it still doesn't explain the journal corruption, but that
> might just mean that I need to do more reading.
> 
> As for the load, I'm copying the same news spool again today and I found
> that 'rsync -a --bwlimit=20 spooldir node01:/vol/' gives a steady load of
> 2-3 on node01 and just around 1 on node02. That 20 is KBps, a ridiculously
> low transfer speed, but the only way to space apart ceph's write operations.
> 
> > ...or why nothing happened between 4:06 and 4:33.
> 
> I had half an eye on ceph while doing other unrelated things. I saw rsync
> hanging, checked some, but not all the logs, wrote the unposted posting
> above, and then gave up and called it a day. 04:06 is probably the time
> when the log partition filled up on node02 and 04:33 is likely the time
> when I stopped ceph and shut down the machines. I think everything was hung
> during that time, or rather deadlocked.

Hmm, that timeline doesn't quite make sense -- node01 takes over the MDS duties at 4:33 and crashes, but then it starts up again at 4:50. But it's possible that node02 took over in the interval there and we just don't see it because the log disk was full (I had erroneously thought that a filled disk would hang the daemon but that turns out not to be the case). So I'd guess you shut everything down sometime after 5:08, and that would make sense.

Unfortunately what we're really interested in is what caused the assert failure on node01 at 4:35 and the reasons for that aren't available in the logs we have. :(
This is the second time we've seen that assert but we've not been able to reproduce it or figure out how the invariant that it's checking against got broken. If you like we can come up with a hacky fix that should let your cluster come back up, but it's possible that you'd lose some data and this is a very rare condition so if it's not a big deal I'd just re-create your cluster.
-Greg




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 23:06     ` Suicide Gregory Farnum
@ 2011-04-15 23:29       ` Zenon Panoussis
  2011-04-16  0:00         ` Suicide Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-15 23:29 UTC (permalink / raw)
  To: ceph-devel

On 04/16/2011 01:06 AM, Gregory Farnum wrote:

> Hmm, that timeline doesn't quite make sense -- node01 takes over the MDS 
> duties at 4:33 and crashes, but then it starts up again at 4:50. But it's 
> possible that node02 took over in the interval there and we just don't see 
> it because the log disk was full (I had erroneously thought that a filled 
> disk would hang the daemon but that turns out not to be the case). So I'd 
> guess you shut everything down sometime after 5:08, and that would make 
> sense.

Indeed, you're probably right.

> Unfortunately what we're really interested in is what caused the assert 
> failure on node01 at 4:35 and the reasons for that aren't available in the 
> logs we have. :(

> This is the second time we've seen that assert but we've not been able to 
> reproduce it or figure out how the invariant that it's checking against got 
> broken. If you like we can come up with a hacky fix that should let your 
> cluster come back up, but it's possible that you'd lose some data and this 
> is a very rare condition so if it's not a big deal I'd just re-create your 
> cluster.

My data has been safe elsewhere all along and I have already re-created the
cluster. In other words I don't need the hacky fix, but someone else might
be desperate for it in the future, so creating it could be a good idea anyway.

However, the cause of the corruption is still an open issue that ought to be
understood and solved. The most likely place to be able to reproduce it at is
right here, so if you think it's useful, I'm willing to try to crash it again.
If you want me to, let's make a plan for it. These are just test boxes and
I have no problem even giving you root on them, if that can help pinpoint
the cause of the corruption.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-15 23:29       ` Suicide Zenon Panoussis
@ 2011-04-16  0:00         ` Gregory Farnum
  2011-04-16  9:53           ` Suicide Zenon Panoussis
  2011-04-16 23:50           ` Suicide Zenon Panoussis
  0 siblings, 2 replies; 18+ messages in thread
From: Gregory Farnum @ 2011-04-16  0:00 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

On Friday, April 15, 2011 at 4:29 PM, Zenon Panoussis wrote:

> On 04/16/2011 01:06 AM, Gregory Farnum wrote:
> 
> > Hmm, that timeline doesn't quite make sense -- node01 takes over the MDS 
> > duties at 4:33 and crashes, but then it starts up again at 4:50. But it's 
> > possible that node02 took over in the interval there and we just don't see 
> > it because the log disk was full (I had erroneously thought that a filled 
> > disk would hang the daemon but that turns out not to be the case). So I'd 
> > guess you shut everything down sometime after 5:08, and that would make 
> > sense.
> 
> Indeed, you're probably right.
> 
> > Unfortunately what we're really interested in is what caused the assert 
> > failure on node01 at 4:35 and the reasons for that aren't available in the 
> > logs we have. :(
> 
> > This is the second time we've seen that assert but we've not been able to 
> > reproduce it or figure out how the invariant that it's checking against got 
> > broken. If you like we can come up with a hacky fix that should let your 
> > cluster come back up, but it's possible that you'd lose some data and this 
> > is a very rare condition so if it's not a big deal I'd just re-create your 
> > cluster.
> 
> My data has been safe elsewhere all along and I have already re-created the
> cluster. In other words I don't need the hacky fix, but someone else might
> be desperate for it in the future, so creating it could be a good idea anyway.

Well, we'd like to develop a proper fix in the form of our fsck tools. That, however, will take some time and a hacky fix isn't really safe so we don't want to put it out there and have people use it while thinking it's safe. ;) 


> However, the cause of the corruption is still an open issue that ought to be
> understood and solved. The most likely place to be able to reproduce it at is
> right here, so if you think it's useful, I'm willing to try to crash it again.
> If you want me to, let's make a plan for it. These are just test boxes and
> I have no problem even giving you root on them, if that can help pinpoint
> the cause of the corruption.
We'd love to reproduce it and track it down! Unfortunately, the two times we've seen it so far (you're the second one) are on external users who have very sparse logging. :(

If you want to try to reproduce it for us in a useful fashion we'd love that. You'll need to add debug output to your MDS config. At a minimum we will need "debug journaler = 20". You should also add "debug ms = 1" and probably "debug mds = 10". Be warned that this will use a LOT of disk space, though. If you ran out before you're going to do so again and we will really need the logs that generated the journal and the logs that were replaying the journal to figure out what happened, so you'll need to come up with some way of handling them (writing to a big NFS disk -- though that'll impact networking, different disk, log rotation, etc). Then try and reproduce your previous conditions as exactly as possible and see if you run into that assert again.

Thanks!
-Greg




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-16  0:00         ` Suicide Gregory Farnum
@ 2011-04-16  9:53           ` Zenon Panoussis
  2011-04-16 23:50           ` Suicide Zenon Panoussis
  1 sibling, 0 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-16  9:53 UTC (permalink / raw)
  To: ceph-devel

On 04/16/2011 02:00 AM, Gregory Farnum wrote:

> You'll need to add debug output to your MDS config. At a minimum we will 
> need "debug journaler = 20". You should also add "debug ms = 1" and probably 
> "debug mds = 10". 

http://ceph.newdream.net/wiki/Debugging puts "debug journal" in the osd
section and "debug journaler" in the userspace clients section, but I don't
have any userspace clients; only the kernel modules. Just to make 100% sure
that I get it right, which debug levels should I put in which section(s)?

> Be warned that this will use a LOT of disk space, though. If you ran out before 
> you're going to do so again and we will really need the logs that generated the 
> journal and the logs that were replaying the journal to figure out what happened, 
> so you'll need to come up with some way of handling them (writing to a big NFS 
> disk -- though that'll impact networking, different disk, log rotation, etc). 
> Then try and reproduce your previous conditions as exactly as possible...

The two most probable causes of death were (a) irresponsivness because of
the load and (b) the filling up of the log partition.

Now, if it was the filling up of the logs and I now put them where they won't
fill up, then the error won't be reproduced. Then again, if I put them where
they will fill up, we won't have any node02 logs after they fill up, which is
precisely those we need.

On the other hand, if it was irresponsiveness that caused it, more logging
will lead to more I/O, more trashing of that poor 2,5" drive and yet higher
load, so we might run into the error before the logs fill up.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-16  0:00         ` Suicide Gregory Farnum
  2011-04-16  9:53           ` Suicide Zenon Panoussis
@ 2011-04-16 23:50           ` Zenon Panoussis
  2011-04-17  0:14             ` Suicide Zenon Panoussis
  2011-04-18 16:40             ` Suicide Gregory Farnum
  1 sibling, 2 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-16 23:50 UTC (permalink / raw)
  To: ceph-devel

On 04/16/2011 02:00 AM, Gregory Farnum wrote:

> If you want to try to reproduce it for us in a useful fashion we'd 
> love that. 

I think I reproduced it, kind of. I got the hang and "error 5" on mount,
but not the journal corruption. This is the timeline:

ca 19:19: Restarted ceph, mounted it, started rsync. There were already
          a few hundred thousand files on the ceph mount, forcing rsync
          to do a read/compare for them all for about 20 minutes while
          ceph was producing gigabytes of useless logs.

ca 19:50: rsync begun copying new files at full speed.

ca 19:56: rsync hung; node01 load 6+, node02 load 0.15, osd partitions
          (i.e. the underlying ext4 partitions) stuck at differing sizes
          and not growing.

ca 20:01: rsync still hanging, node01 load 7.80, node02 load 0.08, osd
          partitions still different and not growing.

ca 20:07: rsync still hanging, node01 load 8.00, node02 load 0.00, osd
          partitions still different and not growing. Killed rsync:

feednews/time-00/40/05/7bb1-4696
        1697 100%    3.61kB/s    0:00:00 (xfer#3469, to-check=2515/339251)
feednews/time-00/40/05/7bb2-4696
        1691 100%    3.59kB/s    0:00:00 (xfer#3470, to-check=2514/339251)
feednews/time-00/40/05/7bb3-4696
        3476 100%    7.38kB/s    0:00:00 (xfer#3471, to-check=2513/339251)
feednews/time-00/40/05/7bb4-4696
^CKilled by signal 2.

ca 20:09: umount /ceph: device busy; lsof |grep /ceph stuck, not even
          responding to ^C.

ca 20:11: ca 20:09: umount /ceph OK, service stop -a ceph OK, osd partitions
          still at the same size as at 19:56.

ca 20:12: service ceph -a start OK, both osd partitions now started growing,
          mount mon:port /ceph gives error 5.

A couple of minutes later the osd partitions stabilized at the same size
and mounting ceph succeeded without problems.

I'm speculating that something deadlocked unrecoverably at 19:56, leaving
different data in the journals. Restarting ceph removed the deadlock and
allowed the journals to flush, restoring proper operation.

I no longer remember exactly what I did last time, but it's very likely
that I unmounted and stopped ceph before switching off the machines on
Thursday night, but kept restarting it on Friday when I couldn't mount it,
not giving it enough time to flush its journals between each failed mount
attempt and the next restart, until somehow the journals got corrupted.

You see above that the last file that appeared to be successfully transferred
by rsync this evening was feednews/time-00/40/05/7bb3-4696, which shows up
in the logs at 19:55:45.746146. However, the logs also say that another ~4000
files were transferred between rsync's apparent hang and my ^C. Those must be
the ones that ended up lingering in the journal. If they were written to the
journal but not to the OSD, I believe rsync should still have gotten a "write
OK" back from ceph, but it seems it didn't.

Anyway, trying to read and understand huge logs when you don't even know
what the log entries mean or what you're looking for, is very tiring and
not very efficient, so I'll call it a day for now.  I'd appreciate it if
you could give me some hints on what to look for, so I can go hunting again
in a more focused manner. In the meanwhile I've put the new logs at
http://www.provocation.net/tmp/logs_take_2.tar.bz2 for you.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-16 23:50           ` Suicide Zenon Panoussis
@ 2011-04-17  0:14             ` Zenon Panoussis
  2011-04-18 16:40             ` Suicide Gregory Farnum
  1 sibling, 0 replies; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-17  0:14 UTC (permalink / raw)
  To: ceph-devel


On 04/17/2011 01:50 AM, I wrote:

> However, the logs also say that another ~4000
> files were transferred between rsync's apparent hang and my ^C. 

A new rsync file push shows that some - but very far from all - of
those files had to be pushed again, either because they didn't exist
on the ceph target or because of size/time/other attribute missmatches.

Z


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-16 23:50           ` Suicide Zenon Panoussis
  2011-04-17  0:14             ` Suicide Zenon Panoussis
@ 2011-04-18 16:40             ` Gregory Farnum
  2011-04-18 21:21               ` Suicide Gregory Farnum
  1 sibling, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2011-04-18 16:40 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

On Sat, Apr 16, 2011 at 2:53 AM, Zenon Panoussis <oracle@provocation.net> wrote:
> http://ceph.newdream.net/wiki/Debugging puts "debug journal" in the osd
> section and "debug journaler" in the userspace clients section, but I don't
> have any userspace clients; only the kernel modules. Just to make 100% sure
> that I get it right, which debug levels should I put in which section(s)?
In this case it's referring to the client for the OSD, and the MDS is
an OSD client like any other. So it goes in the MDS section. :)

> The two most probable causes of death were (a) irresponsivness because of
> the load and (b) the filling up of the log partition.
>
> Now, if it was the filling up of the logs and I now put them where they won't
> fill up, then the error won't be reproduced. Then again, if I put them where
> they will fill up, we won't have any node02 logs after they fill up, which is
> precisely those we need.
>
> On the other hand, if it was irresponsiveness that caused it, more logging
> will lead to more I/O, more trashing of that poor 2,5" drive and yet higher
> load, so we might run into the error before the logs fill up.
You're right, but if the cause of the problem was the disk filling up
then we're basically screwed anyway. Also, the debug output is done
well enough that that's not really possible. So I don't think
providing enough space for the debug output will change anything in
those terms. :)



On Sat, Apr 16, 2011 at 4:50 PM, Zenon Panoussis <oracle@provocation.net> wrote:
>
>
> On 04/16/2011 02:00 AM, Gregory Farnum wrote:
>
>> If you want to try to reproduce it for us in a useful fashion we'd
>> love that.
>
> I think I reproduced it, kind of. I got the hang and "error 5" on mount,
> but not the journal corruption. This is the timeline:
Okay. I don't think you'll get the journal corruption again as it
really is quite rare. Though we still want to handle all the other
problems too, of course! ;)


> A couple of minutes later the osd partitions stabilized at the same size
> and mounting ceph succeeded without problems.
>
> I'm speculating that something deadlocked unrecoverably at 19:56, leaving
> different data in the journals. Restarting ceph removed the deadlock and
> allowed the journals to flush, restoring proper operation.
Some operations got hung for some reason. I'll check out your logs to
see if I can figure out what. That's why you couldn't umount -- due to
the hanging request the client couldn't flush out all its data so it
couldn't umount safely. It's also why you saw that some of the rsync
didn't complete -- you probably lost everything that was cached in
client-side buffers.

> I no longer remember exactly what I did last time, but it's very likely
> that I unmounted and stopped ceph before switching off the machines on
> Thursday night, but kept restarting it on Friday when I couldn't mount it,
> not giving it enough time to flush its journals between each failed mount
> attempt and the next restart, until somehow the journals got corrupted.
That's probably basically correct -- we have no idea how the invariant
got broken but unsuccessful restarts are the most likely scenario in
terms of what modifies the appropriate data.

> You see above that the last file that appeared to be successfully transferred
> by rsync this evening was feednews/time-00/40/05/7bb3-4696, which shows up
> in the logs at 19:55:45.746146. However, the logs also say that another ~4000
> files were transferred between rsync's apparent hang and my ^C. Those must be
> the ones that ended up lingering in the journal. If they were written to the
> journal but not to the OSD, I believe rsync should still have gotten a "write
> OK" back from ceph, but it seems it didn't.
I think you're misunderstanding the journal here -- there is an OSD
journal which is part of the OSD, and there's an MDS log which records
metadata changes but not full file data.
Anyway, if what you mean is you see 4k files transferred in the Ceph
logs then that's probably data being flushed out from client-side
buffers.

> Anyway, trying to read and understand huge logs when you don't even know
> what the log entries mean or what you're looking for, is very tiring and
> not very efficient, so I'll call it a day for now.  I'd appreciate it if
> you could give me some hints on what to look for, so I can go hunting again
> in a more focused manner. In the meanwhile I've put the new logs at
> http://www.provocation.net/tmp/logs_take_2.tar.bz2 for you.
Yep, I'm checking these out today. :)
Thanks!
-Greg
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-18 16:40             ` Suicide Gregory Farnum
@ 2011-04-18 21:21               ` Gregory Farnum
  2011-04-18 22:38                 ` Suicide Zenon Panoussis
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Farnum @ 2011-04-18 21:21 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

I looked through your logs a bit and noticed that the OSD on node01 is crashing due to high latencies on disk access (I think the defaults for this case are it asserts out if there's no progress after 10 minutes or something). 

Based on that, I pretty much have to guess that there's just too much stress on your disk and it's going to cause problems. You can try loosening the various configurable timeouts to let it run longer but it seems like really you just need beefier disks for the amount of stuff you're doing to them. IIRC you're running a monitor and an OSD on the same 2.5" physical disk, which means they're colliding on stuff like sync() calls.

This general slowness doesn't explain the mds log corruption, although it might be one of the trigger conditions. I added another assert in the Journaler code which might have caused the problem (though I don't think it could have) but don't have any other new ideas.
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-18 21:21               ` Suicide Gregory Farnum
@ 2011-04-18 22:38                 ` Zenon Panoussis
  2011-04-18 23:02                   ` Suicide Gregory Farnum
  0 siblings, 1 reply; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-18 22:38 UTC (permalink / raw)
  To: ceph-devel

On 04/18/2011 11:21 PM, Gregory Farnum wrote:

> I looked through your logs a bit and noticed that the OSD on node01 is 
> crashing due to high latencies on disk access (I think the defaults for
> this case are it asserts out if there's no progress after 10 minutes or
> something). 

First of all, thank you for plowing through those huge logs. It's a feat
all and by itself.

Could you please post an example where you found the OSD crashing, so that
I and others know what log entries to look for?

> Based on that, I pretty much have to guess that there's just too much 
> stress on your disk and it's going to cause problems. You can try loosening
> the various configurable timeouts to let it run longer but it seems like
> really you just need beefier disks for the amount of stuff you're doing to
> them. 

My hardware is indeed very primitive, but in order to prevent this from
happening I would have to make sure that the disks always have more capacity
than the network. In a real-world setup, with gigabit or muti-gigabit
networking and multiple applications doing disk I/O simultaneously, this
is unfeasible. Also, I suspect that it would go against the hierarchy of
O/S subsystem layering.

What I mean is this: if an application tries to write data to the file
system and fails, the application should either hang or time out and
bail out; the file system itself should still not crash. The application
is always agnostic about the file system, so therefore the file system
should never acknowledge more data than it can promise to actually
process.

In the case of ceph things get complicated by the fact that ceph appears
as a file system to the applications using it, but depends itself on an
underlying file system for its disk access. As a result, ceph is responsible
for the data it accepts from applications, but has no way to meet this
responsibility if the underlying file system lets it down.

I don't know how this problem can be truly solved, but some trickery with
I/O buffers might go a long way towards mitigating it. Or perhaps some
available capacity calls between the monitor and the client. Every other
networked file system has a similar problem, so looking at how NFS or samba
deal with it could provide ideas or even ready code.

> IIRC you're running a monitor and an OSD on the same 2.5" physical disk, 
> which means they're colliding on stuff like sync() calls.

Indeed, I'm runing the entire system on a dirt cheap 2,5" disk. Still, good
software on bad hardware should run slow or not at all, but not try to run
fast and then crash and corrupt its data.

> This general slowness doesn't explain the mds log corruption, although it 
> might be one of the trigger conditions. I added another assert in the 
> Journaler code which might have caused the problem (though I don't think 
> it could have) but don't have any other new ideas.

I'll test again as soon as 0.27 is out (BTW, is 0.27 blocked by 0.26.1 or
do they run independent of each-other?).

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-18 22:38                 ` Suicide Zenon Panoussis
@ 2011-04-18 23:02                   ` Gregory Farnum
  2011-04-19  0:17                     ` Suicide Colin McCabe
  2011-04-19 10:45                     ` Suicide Zenon Panoussis
  0 siblings, 2 replies; 18+ messages in thread
From: Gregory Farnum @ 2011-04-18 23:02 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

On Monday, April 18, 2011 at 3:38 PM, Zenon Panoussis wrote:

> On 04/18/2011 11:21 PM, Gregory Farnum wrote:
> 
> > I looked through your logs a bit and noticed that the OSD on node01 is 
> > crashing due to high latencies on disk access (I think the defaults for
> > this case are it asserts out if there's no progress after 10 minutes or
> > something). 
> 
> First of all, thank you for plowing through those huge logs. It's a feat
> all and by itself.
> 
> Could you please post an example where you found the OSD crashing, so that
> I and others know what log entries to look for?

In this case what I was actually searching for was client requests coming in and then getting lost on some wait list somewhere. I didn't see any of those, but I did see an OSD backtrace:
2011-04-16 20:11:30.304029 7fce00c24700 FileStore: sync_entry timed out after 600 seconds.
ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
2011-04-16 20:11:30.304483 1: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
2011-04-16 20:11:30.304517 2: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
2011-04-16 20:11:30.304545 3: /lib64/libpthread.so.0() [0x3ea6e077e1]
2011-04-16 20:11:30.304573 4: (clone()+0x6d) [0x3ea6ae18ed]
2011-04-16 20:11:30.304599 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7fce00c24700'
os/FileStore.cc: 2573: FAILED assert(0)
ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
1: (SyncEntryTimeout::finish(int)+0xf4) [0x5953d4]
2: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
3: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
4: /lib64/libpthread.so.0() [0x3ea6e077e1]
5: (clone()+0x6d) [0x3ea6ae18ed]
ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
1: (SyncEntryTimeout::finish(int)+0xf4) [0x5953d4]
2: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
3: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
4: /lib64/libpthread.so.0() [0x3ea6e077e1]
5: (clone()+0x6d) [0x3ea6ae18ed]




> > Based on that, I pretty much have to guess that there's just too much 
> > stress on your disk and it's going to cause problems. You can try loosening
> > the various configurable timeouts to let it run longer but it seems like
> > really you just need beefier disks for the amount of stuff you're doing to
> > them. 
> 
> My hardware is indeed very primitive, but in order to prevent this from
> happening I would have to make sure that the disks always have more capacity
> than the network. In a real-world setup, with gigabit or muti-gigabit
> networking and multiple applications doing disk I/O simultaneously, this
> is unfeasible. Also, I suspect that it would go against the hierarchy of
> O/S subsystem layering.
It's a bit more complicated than that. While we could probably do a better job of controlling bandwidths, there are a lot of pieces devoted to handling changes in disk performance and preventing the OSD from accepting more data than it can handle -- much of this is tunable (it's not well-documented but off the top of my head the ones to look at are osd_client_message_size_cap, for how much client data to hold in-memory waiting on disk [defaults to 200MB], and filestore_op_threads, which defaults to 2 but might be better at 1 or 0 for a very slow disk) . The specific crash that I saw here meant that the OSD called sync on its underlying filesystem and 6 minutes later the sync hadn't completed! The system can handle slow disks but it became basically unresponsive, at which point the proper r
 esponse in a replicated scenario is to kill itself (the data should all exist elsewhere). I think the problem you ran into is just that because the box was so overloaded your monitor fell behind and
 the timeouts that should have handled transferring everything got bogged down (or, possibly, something broke and the timeouts didn't trigger properly everywhere but I didn't see anything obvious in your logs).

> What I mean is this: if an application tries to write data to the file
> system and fails, the application should either hang or time out and
> bail out; the file system itself should still not crash. The application
> is always agnostic about the file system, so therefore the file system
> should never acknowledge more data than it can promise to actually
> process.
> 
> In the case of ceph things get complicated by the fact that ceph appears
> as a file system to the applications using it, but depends itself on an
> underlying file system for its disk access. As a result, ceph is responsible
> for the data it accepts from applications, but has no way to meet this
> responsibility if the underlying file system lets it down.
> 
> I don't know how this problem can be truly solved, but some trickery with
> I/O buffers might go a long way towards mitigating it. Or perhaps some
> available capacity calls between the monitor and the client. Every other
> networked file system has a similar problem, so looking at how NFS or samba
> deal with it could provide ideas or even ready code.

This is something we've put some thought into. Basically the reason that you're having problems is because
1) you've got a very small system using configuration defaults that don't match up with the capabilities of the hardware you're using, and
2) Ceph's still young so it turns all failures into node crashes (this is by far the simplest failure model and it's generally the appropriate one when dealing with an inherently-distributed system).

I'd recommend tuning down the configuration and seeing if that helps you at all.
-Greg




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-18 23:02                   ` Suicide Gregory Farnum
@ 2011-04-19  0:17                     ` Colin McCabe
  2011-04-19 10:45                     ` Suicide Zenon Panoussis
  1 sibling, 0 replies; 18+ messages in thread
From: Colin McCabe @ 2011-04-19  0:17 UTC (permalink / raw)
  To: Gregory Farnum; +Cc: Zenon Panoussis, ceph-devel

On Mon, Apr 18, 2011 at 4:02 PM, Gregory Farnum <gfarnum@gmail.com> wrote:
> On Monday, April 18, 2011 at 3:38 PM, Zenon Panoussis wrote:
>
>> On 04/18/2011 11:21 PM, Gregory Farnum wrote:
>>
>> > I looked through your logs a bit and noticed that the OSD on node01 is
>> > crashing due to high latencies on disk access (I think the defaults for
>> > this case are it asserts out if there's no progress after 10 minutes or
>> > something).
>>
>> First of all, thank you for plowing through those huge logs. It's a feat
>> all and by itself.
>>
>> Could you please post an example where you found the OSD crashing, so that
>> I and others know what log entries to look for?
>
> In this case what I was actually searching for was client requests coming in and then getting lost on some wait list somewhere. I didn't see any of those, but I did see an OSD backtrace:
> 2011-04-16 20:11:30.304029 7fce00c24700 FileStore: sync_entry timed out after 600 seconds.
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 2011-04-16 20:11:30.304483 1: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
> 2011-04-16 20:11:30.304517 2: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
> 2011-04-16 20:11:30.304545 3: /lib64/libpthread.so.0() [0x3ea6e077e1]
> 2011-04-16 20:11:30.304573 4: (clone()+0x6d) [0x3ea6ae18ed]
> 2011-04-16 20:11:30.304599 os/FileStore.cc: In function 'virtual void SyncEntryTimeout::finish(int)', in thread '0x7fce00c24700'
> os/FileStore.cc: 2573: FAILED assert(0)
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5953d4]
> 2: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
> 3: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
> 4: /lib64/libpthread.so.0() [0x3ea6e077e1]
> 5: (clone()+0x6d) [0x3ea6ae18ed]
> ceph version 0.26 (commit:9981ff90968398da43c63106694d661f5e3d07d5)
> 1: (SyncEntryTimeout::finish(int)+0xf4) [0x5953d4]
> 2: (SafeTimer::timer_thread()+0x36b) [0x5f365b]
> 3: (SafeTimerThread::entry()+0xd) [0x5f5dfd]
> 4: /lib64/libpthread.so.0() [0x3ea6e077e1]
> 5: (clone()+0x6d) [0x3ea6ae18ed]

This particular timeout was added in response to a specific bug where
someone did a drive pull, and it was never reported to userspace.
Instead, the btrfs filesystem just blocked forever.

http://tracker.newdream.net/issues/735

We decided just to add a timeout that would trigger if a commit took
too long. This is basically a dirty hack. It would be better just to
have the drive pull event properly communicated to userspace. However,
we thought that it wouldn't really cause problems in practice since
the timeout was so long.

There are other timeouts in the code, like the OSD heartbeat timeout.
In general, as Greg pointed out, Ceph is supposed to assure data
security through redundant copies on multiple OSDs. If we let
unresponsive OSDs linger forever, we could get in a situation where
the upper layer (the Ceph filesystem) continues to wait forever for an
unresponsive OSD, even though there are lots of other OSDs that could
easily take up the slack. This is like starving to death at the
all-you-can-eat buffet, just because they're out of jell-o.

There is some logic in FileStore::queue_transactions  that is supposed
to throttle the requests we're making to the FileStore. I haven't ever
really examined that code. I guess it's possible that it is not doing
a very good job of preventing work from piling up faster than we can
take care of it.

In general, there is a lot of performance tuning and optimization work
left to do. We're working towards having better metrics so that it
will be easier to measure our progress in this area. Hopefully we can
get something like a nightly autotest that will email out performance
numbers for the master branch.

cheers,
Colin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-18 23:02                   ` Suicide Gregory Farnum
  2011-04-19  0:17                     ` Suicide Colin McCabe
@ 2011-04-19 10:45                     ` Zenon Panoussis
  2011-04-19 16:29                       ` Suicide Gregory Farnum
  1 sibling, 1 reply; 18+ messages in thread
From: Zenon Panoussis @ 2011-04-19 10:45 UTC (permalink / raw)
  To: ceph-devel

On 04/19/2011 01:02 AM, Gregory Farnum wrote:

> It's a bit more complicated than that. While we could probably do a better 
> job of controlling bandwidths, there are a lot of pieces devoted to handling 
> changes in disk performance and preventing the OSD from accepting more data 
> than it can handle -- much of this is tunable (it's not well-documented but 
> off the top of my head the ones to look at are osd_client_message_size_cap, 
> for how much client data to hold in-memory waiting on disk [defaults to 200MB], 
> and filestore_op_threads, which defaults to 2 but might be better at 1 or 0 
> for a very slow disk) . 

Indeed, I've seen cosd threads fighting each-other to write to disk. Do
I understand it correctly that OPTION(filestore_op_threads, 0, OPT_INT, 0)
in src/config.cc and recompile is the only way to change this?

> The specific crash that I saw here meant that the OSD called sync on its 
> underlying filesystem and 6 minutes later the sync hadn't completed! The 
> system can handle slow disks but it became basically unresponsive, at which 
> point the proper response in a replicated scenario is to kill itself (the 
> data should all exist elsewhere). 

It seems the right thing to do, provided that the data does exist elsewhere
in sufficient replicas. This touches something that Colin wrote in this same
thread, so I'll merge it in:

> If we let
> unresponsive OSDs linger forever, we could get in a situation where
> the upper layer (the Ceph filesystem) continues to wait forever for an
> unresponsive OSD, even though there are lots of other OSDs that could
> easily take up the slack.

The requirements and availability of OSDs at any given moment are known,
so the reaction to an unresponsive OSD can be calculated. Let's say,
given 20 OSDs and a CRUSH rule that says data {min_data 3; max_data 10},
ceph could acknowledge a write as long as committed to replicas =< 3 .
If, on the other hand, it can't commit the data to at least 3 OSDs (or
3 journals, as the case might be), it should throw an error back and tell
the application that it can't write to disk, so that the application can
react appropriately.

I have to admit though that I'm still rather confused as to how and in
which order data and metadata are passed around through memory and journal
to disk. The wiki says "The OSD does not acknowledge a write until it is
committed (stable) on disk" but does that mean "committed once" or "committed
on as many copies as data min_data"? I suspect the former, because that would
explain how the bottlenecks I'm seeing can build up. On the other hand, a
stricter flush-min_data-before-acknowledging strategy as I describe above
would automatically solve most network and disk bandwidth calculation problems
by blocking writes unless and until there is sufficient bandwidth to commit
all required copies to disk.

Of course these are design considerations that I'm sure you must have gone
through time and over again, so I could very well be missing some essential
point. I hope you bear with me.

> This is like starving to death at the
> all-you-can-eat buffet, just because they're out of jell-o.

:)) That's the funniest comparison I've seen in a long time, but I'm not
sure it fully applies. Yes, waiting forever for an unresponsive OSD when
there are others around would be exactly that, so ceph should (quickly)
time out and try elsewhere. But acknowledging a write before it knows
for sure whether it can meet its min_data requirements is another story;
that's stuffing data and hoping that some OSDs somewhere will accept it,
when in fact they might all have called it a day and gone fishing.

Z

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Suicide
  2011-04-19 10:45                     ` Suicide Zenon Panoussis
@ 2011-04-19 16:29                       ` Gregory Farnum
  0 siblings, 0 replies; 18+ messages in thread
From: Gregory Farnum @ 2011-04-19 16:29 UTC (permalink / raw)
  To: Zenon Panoussis; +Cc: ceph-devel

On Tue, Apr 19, 2011 at 3:45 AM, Zenon Panoussis <oracle@provocation.net> wrote:
>> It's a bit more complicated than that. While we could probably do a better
>> job of controlling bandwidths, there are a lot of pieces devoted to handling
>> changes in disk performance and preventing the OSD from accepting more data
>> than it can handle -- much of this is tunable (it's not well-documented but
>> off the top of my head the ones to look at are osd_client_message_size_cap,
>> for how much client data to hold in-memory waiting on disk [defaults to 200MB],
>> and filestore_op_threads, which defaults to 2 but might be better at 1 or 0
>> for a very slow disk) .
>
> Indeed, I've seen cosd threads fighting each-other to write to disk. Do
> I understand it correctly that OPTION(filestore_op_threads, 0, OPT_INT, 0)
> in src/config.cc and recompile is the only way to change this?
Heavens no! You can set any of those options in your config file:
filestore op threads = 0
or on the command line when starting the osd:
cosd -i osd.a -c ceph.conf --filestore_op_threads 0


>> The specific crash that I saw here meant that the OSD called sync on its
>> underlying filesystem and 6 minutes later the sync hadn't completed! The
>> system can handle slow disks but it became basically unresponsive, at which
>> point the proper response in a replicated scenario is to kill itself (the
>> data should all exist elsewhere).
>
> It seems the right thing to do, provided that the data does exist elsewhere
> in sufficient replicas. This touches something that Colin wrote in this same
> thread, so I'll merge it in:
>
>> If we let
>> unresponsive OSDs linger forever, we could get in a situation where
>> the upper layer (the Ceph filesystem) continues to wait forever for an
>> unresponsive OSD, even though there are lots of other OSDs that could
>> easily take up the slack.
>
> The requirements and availability of OSDs at any given moment are known,
> so the reaction to an unresponsive OSD can be calculated. Let's say,
> given 20 OSDs and a CRUSH rule that says data {min_data 3; max_data 10},
> ceph could acknowledge a write as long as committed to replicas =< 3 .
> If, on the other hand, it can't commit the data to at least 3 OSDs (or
> 3 journals, as the case might be), it should throw an error back and tell
> the application that it can't write to disk, so that the application can
> react appropriately.
Mmm. There are problems with that too: applications often don't deal
well with the filesystem spitting back errors; you don't want errors
to propagate back to the application based on which OSD they're trying
to access, etc.
In general Ceph behaves about like a RAID array in recovery under
these kinds of situations. It counts on the replication for safety and
we could maybe add more smarts to prevent kicking out the last copy of
data but that would be complex and error-prone without a big gain (ie,
not appropriate at this stage of development). If at any point it
can't commit the data to disk then after a timeout the mapping of data
to OSDs will change and the write will get redirected to an active
OSD.

> I have to admit though that I'm still rather confused as to how and in
> which order data and metadata are passed around through memory and journal
> to disk. The wiki says "The OSD does not acknowledge a write until it is
> committed (stable) on disk" but does that mean "committed once" or "committed
> on as many copies as data min_data"? I suspect the former, because that would
> explain how the bottlenecks I'm seeing can build up. On the other hand, a
> stricter flush-min_data-before-acknowledging strategy as I describe above
> would automatically solve most network and disk bandwidth calculation problems
> by blocking writes unless and until there is sufficient bandwidth to commit
> all required copies to disk.
Actually it's neither. All data is mapped to a placement group for
reading and writing. The placement group is made up of a certain
number of OSDs which is set by the pool/CRUSH rules (by default, 2).
The data is sent to the first OSD and that then replicates it to the
other OSDs in its placement group and a write is considered safe when
it's on-disk on each OSD. However, there are complications. Because we
don't want to start moving data around when an OSD gets rebooted,
there's a distinction between an OSD being up or down (ie, accessible
or not accessible) and an OSD being in or out (ie, included in the
placement group). So the placement group might temporarily have only
one up OSD in which case the data is considered safe after that OSD
gets it.
In the typical case though then all the OSDs are up, and all of them
need to have the data before it's acked in any way.

I believe you're experiencing difficulty because (IIRC) you have an
OSD and a monitor both using the same (very slow) disk. Both of these
daemons call fsync et al *very frequently* for safety purposes and
that all on its own can really kill performance on a slow disk. It's
just a compounding cycle in your case which I suspect would be helped
a lot by putting your OSD and your monitor on separate disks.
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2011-04-19 16:29 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-15 14:48 Suicide Zenon Panoussis
2011-04-15 15:16 ` Suicide Zenon Panoussis
2011-04-15 15:38 ` Suicide Zenon Panoussis
2011-04-15 20:21 ` Suicide Gregory Farnum
2011-04-15 22:38   ` Suicide Zenon Panoussis
2011-04-15 23:06     ` Suicide Gregory Farnum
2011-04-15 23:29       ` Suicide Zenon Panoussis
2011-04-16  0:00         ` Suicide Gregory Farnum
2011-04-16  9:53           ` Suicide Zenon Panoussis
2011-04-16 23:50           ` Suicide Zenon Panoussis
2011-04-17  0:14             ` Suicide Zenon Panoussis
2011-04-18 16:40             ` Suicide Gregory Farnum
2011-04-18 21:21               ` Suicide Gregory Farnum
2011-04-18 22:38                 ` Suicide Zenon Panoussis
2011-04-18 23:02                   ` Suicide Gregory Farnum
2011-04-19  0:17                     ` Suicide Colin McCabe
2011-04-19 10:45                     ` Suicide Zenon Panoussis
2011-04-19 16:29                       ` Suicide Gregory Farnum

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.