Unable to restart Mon after reboot

All of lore.kernel.org
 help / color / mirror / Atom feed

* Unable to restart Mon after reboot
@ 2012-06-23  0:31 David Blundell
  2012-06-23  5:34 ` Dan Mick
  0 siblings, 1 reply; 11+ messages in thread
From: David Blundell @ 2012-06-23  0:31 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Hi all,

I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17.  Following a reboot of the servers, one of the mon daemons crashes on startup with "FAILED assert(r>0)"

MDS and the OSD start and run fine as do the mon daemons on the other two servers.

The debug log is at http://pastebin.com/tXwvd44Z

I would really appreciate any comments - especially if I am missing something obvious.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-06-23  0:31 Unable to restart Mon after reboot David Blundell
@ 2012-06-23  5:34 ` Dan Mick
  2012-06-23  5:48   ` Dan Mick
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Mick @ 2012-06-23  5:34 UTC (permalink / raw)
  To: David Blundell; +Cc: ceph-devel@vger.kernel.org

Hi David:

The code there is trying to read some stuff off the monitor's storage to 
initialize, and apparently failing in an odd way.  It's trying to read 
the file 'latest' from the monitor directory (/data/mon0);  the file can 
be opened, and stat says it's 4289 bytes long, but apparently the read 
is succeeding without error, but only getting back 0 bytes (i.e., not an 
error, but apparently end of file).

See if there's a file /data/mon0/latest of length 4289, and see if 
something is odd about its permissions (like maybe the read bits are 
turned off, or maybe the filesystem it's on has errors).

On 06/22/2012 05:31 PM, David Blundell wrote:
> Hi all,
>
> I am testing Ceph 0.47.2 on btrfs with three servers running Fedora 17.  Following a reboot of the servers, one of the mon daemons crashes on startup with "FAILED assert(r>0)"
>
> MDS and the OSD start and run fine as do the mon daemons on the other two servers.
>
> The debug log is at http://pastebin.com/tXwvd44Z
>
> I would really appreciate any comments - especially if I am missing something obvious.
>
> David--
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-06-23  5:34 ` Dan Mick
@ 2012-06-23  5:48   ` Dan Mick
  2012-06-23 10:43     ` David Blundell
  0 siblings, 1 reply; 11+ messages in thread
From: Dan Mick @ 2012-06-23  5:48 UTC (permalink / raw)
  To: David Blundell; +Cc: ceph-devel@vger.kernel.org

Sorry: that file would actually be /data/mon0/osdmap/latest

On 06/22/2012 10:34 PM, Dan Mick wrote:
> Hi David:
>
> The code there is trying to read some stuff off the monitor's storage to
> initialize, and apparently failing in an odd way. It's trying to read
> the file 'latest' from the monitor directory (/data/mon0); the file can
> be opened, and stat says it's 4289 bytes long, but apparently the read
> is succeeding without error, but only getting back 0 bytes (i.e., not an
> error, but apparently end of file).
>
> See if there's a file /data/mon0/latest of length 4289, and see if
> something is odd about its permissions (like maybe the read bits are
> turned off, or maybe the filesystem it's on has errors).
>
>
> On 06/22/2012 05:31 PM, David Blundell wrote:
>> Hi all,
>>
>> I am testing Ceph 0.47.2 on btrfs with three servers running Fedora
>> 17. Following a reboot of the servers, one of the mon daemons crashes
>> on startup with "FAILED assert(r>0)"
>>
>> MDS and the OSD start and run fine as do the mon daemons on the other
>> two servers.
>>
>> The debug log is at http://pastebin.com/tXwvd44Z
>>
>> I would really appreciate any comments - especially if I am missing
>> something obvious.
>>
>> David--
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-06-23  5:48   ` Dan Mick
@ 2012-06-23 10:43     ` David Blundell
  2012-06-25 15:58       ` Tommi Virtanen
  0 siblings, 1 reply; 11+ messages in thread
From: David Blundell @ 2012-06-23 10:43 UTC (permalink / raw)
  To: ceph-devel@vger.kernel.org

Thanks Dan,

The btrfs checksum had failed on that file.

All three servers are running Fedora 17 with kernel 3.4.3

The logs on all three servers are full of messages like:
Jun 23 04:02:19 Store2 kernel: [63811.494955] ceph-osd: page allocation failure: order:3, mode:0x4020

The difference between the lines is that order: varies between 2, 3, 4 or 5

Is this likely to be a btrfs bug?

David

On 23 Jun 2012, at 06:48, Dan Mick wrote:

> Sorry: that file would actually be /data/mon0/osdmap/latest
> 
> On 06/22/2012 10:34 PM, Dan Mick wrote:
>> Hi David:
>> 
>> The code there is trying to read some stuff off the monitor's storage to
>> initialize, and apparently failing in an odd way. It's trying to read
>> the file 'latest' from the monitor directory (/data/mon0); the file can
>> be opened, and stat says it's 4289 bytes long, but apparently the read
>> is succeeding without error, but only getting back 0 bytes (i.e., not an
>> error, but apparently end of file).
>> 
>> See if there's a file /data/mon0/latest of length 4289, and see if
>> something is odd about its permissions (like maybe the read bits are
>> turned off, or maybe the filesystem it's on has errors).


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-06-23 10:43     ` David Blundell
@ 2012-06-25 15:58       ` Tommi Virtanen
  2012-07-03 16:35         ` David Blundell
  0 siblings, 1 reply; 11+ messages in thread
From: Tommi Virtanen @ 2012-06-25 15:58 UTC (permalink / raw)
  To: David Blundell; +Cc: ceph-devel@vger.kernel.org

On Sat, Jun 23, 2012 at 3:43 AM, David Blundell
<David.Blundell@100percentit.com> wrote:
> The logs on all three servers are full of messages like:
> Jun 23 04:02:19 Store2 kernel: [63811.494955] ceph-osd: page allocation failure: order:3, mode:0x4020
>
> The difference between the lines is that order: varies between 2, 3, 4 or 5
>
> Is this likely to be a btrfs bug?

That means you're running out of memory, in kernelspace. The order is
the power-of-two (2**n) of how many 4kB pages were requested, 0x4020 =
GFP_COMP|GFP_HIGH (compound & access emergency pools). Btrfs may be
indirectly related, it's not clear what's consuming all the memory,
but that doesn't sound all that likely. That message should be
followed by a stack dump, that might tell us more.

Are you using the Ceph distributed filesystem, or just the RADOS
level, e.g. RBD images?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: Unable to restart Mon after reboot
  2012-06-25 15:58       ` Tommi Virtanen
@ 2012-07-03 16:35         ` David Blundell
  2012-07-03 16:44           ` Tommi Virtanen
  0 siblings, 1 reply; 11+ messages in thread
From: David Blundell @ 2012-07-03 16:35 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: ceph-devel@vger.kernel.org

> That means you're running out of memory, in kernelspace. The order is the
> power-of-two (2**n) of how many 4kB pages were requested, 0x4020 =
> GFP_COMP|GFP_HIGH (compound & access emergency pools). Btrfs may
> be indirectly related, it's not clear what's consuming all the memory, but
> that doesn't sound all that likely. That message should be followed by a
> stack dump, that might tell us more.
> 
> Are you using the Ceph distributed filesystem, or just the RADOS level, e.g.
> RBD images?

Hi Tommi,

Thanks for your help.  As an update to this thread, the problem proved to be btrfs on Fedora as scrubs showed bad inodes on all three servers.  We were just using RADOS to hold RBD images and there seemed to be plenty of free RAM - I'm unsure what could have consumed all of the memory.

We switched to Ubuntu 12.04 for the tests which stopped all btrfs problems.

We have now spent a week running the iotester corruption tests in KVM instances while live migrating them every 5 minutes (with and without cache), running iozone and trying everything we could to corrupt the VMs.  The tests on 0.47.x all performed flawlessly.

The upgrade to 0.48 went smoothly but since then we have had issues with slow requests showing up in the ceph logs and disk timeouts whenever we run iozone in the VMs.

I will wipe the current OSDs and start with fresh a 0.48 installation to see if I can reproduce the problem.

David

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-07-03 16:35         ` David Blundell
@ 2012-07-03 16:44           ` Tommi Virtanen
  2012-07-03 16:53             ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Tommi Virtanen @ 2012-07-03 16:44 UTC (permalink / raw)
  To: David Blundell; +Cc: ceph-devel@vger.kernel.org

On Tue, Jul 3, 2012 at 9:35 AM, David Blundell
<David.Blundell@100percentit.com> wrote:
> We switched to Ubuntu 12.04 for the tests which stopped all btrfs problems.
>
> We have now spent a week running the iotester corruption tests in KVM instances while live migrating them every 5 minutes (with and without cache), running iozone and trying everything we could to corrupt the VMs.  The tests on 0.47.x all performed flawlessly.

Nice!

> The upgrade to 0.48 went smoothly but since then we have had issues with slow requests showing up in the ceph logs and disk timeouts whenever we run iozone in the VMs.
>
> I will wipe the current OSDs and start with fresh a 0.48 installation to see if I can reproduce the problem.

We've seen similar issues with btrfs, and others have reported that
the large metadata btrfs option helps. We're still compiling
information, but as of right now I hear best performance tends to
happen with xfs; however, the lead position tends to shift around a
lot.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-07-03 16:44           ` Tommi Virtanen
@ 2012-07-03 16:53             ` Christoph Hellwig
  2012-07-03 17:04               ` Gregory Farnum
  2012-07-03 17:09               ` Sage Weil
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2012-07-03 16:53 UTC (permalink / raw)
  To: Tommi Virtanen; +Cc: David Blundell, ceph-devel@vger.kernel.org

On Tue, Jul 03, 2012 at 09:44:38AM -0700, Tommi Virtanen wrote:
> We've seen similar issues with btrfs, and others have reported that
> the large metadata btrfs option helps. We're still compiling
> information, but as of right now I hear best performance tends to
> happen with xfs; however, the lead position tends to shift around a
> lot.

Btw, does anyone know which part of the btrfs metadata is hit hard?
It's been a while that I looked at the OSD code, but IIRC it didn't
create too big directories, does it?  For heavy directory operations
XFS filesystems created using large directorit blocks (mkfs.xfs -n
size=64k) will also provide additional benefits.

Also IIRC the OSDs have a directory per VDI image - for that kind of
usage pattern the -o filestreams mount option of XFS should provide
even more performance advatages.  Either way make sure to mount with
-o inode64, and for not so recent kernels -o delaylog.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-07-03 16:53             ` Christoph Hellwig
@ 2012-07-03 17:04               ` Gregory Farnum
  2012-07-03 17:09               ` Sage Weil
  1 sibling, 0 replies; 11+ messages in thread
From: Gregory Farnum @ 2012-07-03 17:04 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tommi Virtanen, David Blundell, ceph-devel@vger.kernel.org

On Tue, Jul 3, 2012 at 9:53 AM, Christoph Hellwig <hch@infradead.org> wrote:
> On Tue, Jul 03, 2012 at 09:44:38AM -0700, Tommi Virtanen wrote:
>> We've seen similar issues with btrfs, and others have reported that
>> the large metadata btrfs option helps. We're still compiling
>> information, but as of right now I hear best performance tends to
>> happen with xfs; however, the lead position tends to shift around a
>> lot.
>
> Btw, does anyone know which part of the btrfs metadata is hit hard?
> It's been a while that I looked at the OSD code, but IIRC it didn't
> create too big directories, does it?  For heavy directory operations
> XFS filesystems created using large directorit blocks (mkfs.xfs -n
> size=64k) will also provide additional benefits.

I could be misremembering, but I believe the fragmentation had more to
do with our async snapshots and frequent updates driving the allocator
crazy than with directory sizes (which are limited to 255 entries or
something by default). My guess is our xattrs have more of an impact
on size, and the large metadata means everything gets copied together
instead of in pieces. But Sage or Sam might want to correct me.

> Also IIRC the OSDs have a directory per VDI image
I think you misunderstood when you looked into that. There's a
directory per placement group, but those are in no way tied to the RBD
images.

> - for that kind of
> usage pattern the -o filestreams mount option of XFS should provide
> even more performance advatages.  Either way make sure to mount with
> -o inode64, and for not so recent kernels -o delaylog.
I'm not familiar enough with xfs' allocation mechanisms to know if
filestreams is useful given the above correction.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-07-03 16:53             ` Christoph Hellwig
  2012-07-03 17:04               ` Gregory Farnum
@ 2012-07-03 17:09               ` Sage Weil
  2012-07-03 17:24                 ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Sage Weil @ 2012-07-03 17:09 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Tommi Virtanen, David Blundell, ceph-devel@vger.kernel.org

On Tue, 3 Jul 2012, Christoph Hellwig wrote:
> On Tue, Jul 03, 2012 at 09:44:38AM -0700, Tommi Virtanen wrote:
> > We've seen similar issues with btrfs, and others have reported that
> > the large metadata btrfs option helps. We're still compiling
> > information, but as of right now I hear best performance tends to
> > happen with xfs; however, the lead position tends to shift around a
> > lot.
> 
> Btw, does anyone know which part of the btrfs metadata is hit hard?
> It's been a while that I looked at the OSD code, but IIRC it didn't
> create too big directories, does it?  For heavy directory operations
> XFS filesystems created using large directorit blocks (mkfs.xfs -n
> size=64k) will also provide additional benefits.

The OSD keeps directories small on its own by breaking the contents of 
large directories into smaller subdirectories.

That said, on one system we did see what looked like crazy bad 
fragmentation on an XFS directory... it had maybe 5 subdirs in it and many 
many blocks.  That was probably shortly after it had been big and rehashed 
its contents into the subdirs.  Yehuda probably remembers more.

In any case, is there a way to prod XFS into defragging a specific 
directory?

sage

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Unable to restart Mon after reboot
  2012-07-03 17:09               ` Sage Weil
@ 2012-07-03 17:24                 ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2012-07-03 17:24 UTC (permalink / raw)
  To: Sage Weil
  Cc: Christoph Hellwig, Tommi Virtanen, David Blundell,
	ceph-devel@vger.kernel.org

On Tue, Jul 03, 2012 at 10:09:33AM -0700, Sage Weil wrote:
> The OSD keeps directories small on its own by breaking the contents of 
> large directories into smaller subdirectories.

Right, that's what I remembered.  At least for XFS that'll actually
give you much worse allocation patters as each new directory rotates
to a new allocation group.

> That said, on one system we did see what looked like crazy bad 
> fragmentation on an XFS directory... it had maybe 5 subdirs in it and many 
> many blocks.  That was probably shortly after it had been big and rehashed 
> its contents into the subdirs.  Yehuda probably remembers more.

Another reason why not doing the artifical directories is better...

> In any case, is there a way to prod XFS into defragging a specific 
> directory?

No.  XFS can only defragment regular files at the moment.


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2012-07-03 17:24 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-06-23  0:31 Unable to restart Mon after reboot David Blundell
2012-06-23  5:34 ` Dan Mick
2012-06-23  5:48   ` Dan Mick
2012-06-23 10:43     ` David Blundell
2012-06-25 15:58       ` Tommi Virtanen
2012-07-03 16:35         ` David Blundell
2012-07-03 16:44           ` Tommi Virtanen
2012-07-03 16:53             ` Christoph Hellwig
2012-07-03 17:04               ` Gregory Farnum
2012-07-03 17:09               ` Sage Weil
2012-07-03 17:24                 ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.