Fwd: Kernel 3.0.0 + ext4 + ceph == ...

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fwd: Kernel 3.0.0 + ext4 + ceph == ...
       [not found] ` <CAO47_-_EC4s1HF1pOGNzPRitYGyigOd1hfgz1qDPy6dqwGMMQA@mail.gmail.com>
@ 2011-07-30 14:53   ` Christian Brunner
  2011-11-15 15:46     ` Eric Sandeen
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Brunner @ 2011-07-30 14:53 UTC (permalink / raw)
  To: linux-ext4, ceph-devel

Fyodor and I are struggling to get a fully stable ceph cluster up and running.

When we run an Ceph-Objectstore (OSD) ontop of an ext4 filesystem, we
get fsck errors, when we check the filesystem (see below).

Fyodor is running 3.0.
I am running a RHEL6.1 Kernel (2.6.32-131.6.1.el6.x86_64).

Any help or hints on how to trace the bug would be appreciated.

Thanks,
Christian

2011/7/30 Fyodor Ustinov <ufm@ufm.su>:
> fail. Epic fail.
>
> Absolutely reproducible.
>
> I have ceph cluster with this configuration:
>
> 8 physical servers
> 14 osd servers.
> Each osd server have personal fs.
> 48T total size of ceph cluster.
> 17T used.
>
> Now, step by step:
>
> 1. Stop ceph server osd0
> /etc/init.d/ceph stop
>
> 2. Make fresh fs for osd
> umount /osd.0
> mkfs.ext4 /dev/sdc1
> tune2fs -o journal_data_writeback /dev/sdc1
> mount -a
> # string from /etc/fstab:
> # /dev/sdc1 /osd.0          ext4
>  user_xattr,rw,noexec,nodev,noatime,nodiratime,data=writeback,barrier=0
>    0       2
> ceph mon getmap -o /tmp/monmap
> cosd --mkfs -i 0 --monmap /tmp/monmap
>
> 3. Start ceph server osd0
> /etc/init.d/ceph start
>
> Now, make a big cup of coffee and begin to wait.
>
> After completion of rebalancing do:
> /etc/init.d/ceph stop
> umount /osd.0
> fsck.ext4 -fy /dev/sdc1
>
> and see many-many messages like:
>
> Inode 238551053, i_blocks is 24, should be 32.  Fix? yes
>
> Inode 238551054, i_blocks is 40, should be 32.  Fix? yes
>
> Inode 238551066, i_blocks is 24, should be 32.  Fix? yes
>
> Inode 238944257, i_blocks is 8, should be 16.  Fix? yes
>
> Inode 239206414, i_blocks is 8, should be 16.  Fix? yes
>
> Inode 239206416, i_blocks is 40, should be 32.  Fix? yes
>
> Inode 239206431, i_blocks is 8, should be 16.  Fix? yes
>
> Inode 239206441, i_blocks is 24, should be 32.  Fix? yes
>
> Voila.
>
> P.S. No any message in syslog. No any message in console.
>
> WBR,
>    Fyodor.
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fwd: Kernel 3.0.0 + ext4 + ceph == ...
  2011-07-30 14:53   ` Fwd: Kernel 3.0.0 + ext4 + ceph == Christian Brunner
@ 2011-11-15 15:46     ` Eric Sandeen
  0 siblings, 0 replies; 4+ messages in thread
From: Eric Sandeen @ 2011-11-15 15:46 UTC (permalink / raw)
  To: chb; +Cc: linux-ext4, ceph-devel

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 7/30/11 9:53 AM, Christian Brunner wrote:
> Fyodor and I are struggling to get a fully stable ceph cluster up and running.
> 
> When we run an Ceph-Objectstore (OSD) ontop of an ext4 filesystem, we
> get fsck errors, when we check the filesystem (see below).

BTW, this should be fixed now as of my commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4
ext4: fix race in xattr block allocation path

I think it made its way to a couple older -stable kernels, too.

- -Eric


> Fyodor is running 3.0.
> I am running a RHEL6.1 Kernel (2.6.32-131.6.1.el6.x86_64).
> 
> Any help or hints on how to trace the bug would be appreciated.
> 
> Thanks,
> Christian
> 
> 2011/7/30 Fyodor Ustinov <ufm@ufm.su>:
>> fail. Epic fail.
>>
>> Absolutely reproducible.
>>
>> I have ceph cluster with this configuration:
>>
>> 8 physical servers
>> 14 osd servers.
>> Each osd server have personal fs.
>> 48T total size of ceph cluster.
>> 17T used.
>>
>> Now, step by step:
>>
>> 1. Stop ceph server osd0
>> /etc/init.d/ceph stop
>>
>> 2. Make fresh fs for osd
>> umount /osd.0
>> mkfs.ext4 /dev/sdc1
>> tune2fs -o journal_data_writeback /dev/sdc1
>> mount -a
>> # string from /etc/fstab:
>> # /dev/sdc1 /osd.0          ext4
>>  user_xattr,rw,noexec,nodev,noatime,nodiratime,data=writeback,barrier=0
>>    0       2
>> ceph mon getmap -o /tmp/monmap
>> cosd --mkfs -i 0 --monmap /tmp/monmap
>>
>> 3. Start ceph server osd0
>> /etc/init.d/ceph start
>>
>> Now, make a big cup of coffee and begin to wait.
>>
>> After completion of rebalancing do:
>> /etc/init.d/ceph stop
>> umount /osd.0
>> fsck.ext4 -fy /dev/sdc1
>>
>> and see many-many messages like:
>>
>> Inode 238551053, i_blocks is 24, should be 32.  Fix? yes
>>
>> Inode 238551054, i_blocks is 40, should be 32.  Fix? yes
>>
>> Inode 238551066, i_blocks is 24, should be 32.  Fix? yes
>>
>> Inode 238944257, i_blocks is 8, should be 16.  Fix? yes
>>
>> Inode 239206414, i_blocks is 8, should be 16.  Fix? yes
>>
>> Inode 239206416, i_blocks is 40, should be 32.  Fix? yes
>>
>> Inode 239206431, i_blocks is 8, should be 16.  Fix? yes
>>
>> Inode 239206441, i_blocks is 24, should be 32.  Fix? yes
>>
>> Voila.
>>
>> P.S. No any message in syslog. No any message in console.
>>
>> WBR,
>>    Fyodor.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQIcBAEBAgAGBQJOwoljAAoJECCuFpLhPd7gjxIQAJ7B+f7EYxBZ+48gUrncmB5r
Izkkv2ACza+27g/CUi9ku9j1o3pjZwLNhzo3Fj0gwweB3WaY9T+JMXnfInSFegeR
GCT/8XQqGWFVoRQKKc4wUBKGgW5f+3HTgYLqUY0Z38MqMHpIMXYswXdOSB1Wc4MC
p+jEjHmTWftklpIjv+Vm61AejpoUO93SFE5gUuBeKSZxwjifV1uTUXtaZCQXUG5N
EFz+sS7YvGrttAldK+lbiq7sa7IKINnB5lbDs5ChSZoytSF9hPIRgDOTLrkAZ+k8
YovLWbu2gwGMcZEhu3ZLJ7NdtZbn45A/fh/grNU8nezTo0cTHBTYZCLqtjsUDuMr
mwUIDNUEAv6LIz0OyeJMftDX4TzxjQyEQOgYg5wyCKCjE2Nyktyap2T5sAFKamJJ
pgTUt0JSpXgDnDBL7Y3M6RbY8DQsDHIir3A7aOwdINGKweNiJXBYC3LWYHIXY0bd
yoKXT6e/Bentlj+Peugg51bw91JtlqxJT4qJfk6HMF00uxrfWHlvzht7Lu61YxrW
LBQgNyQ+Gu1drHIHyIFu95UePhzEGQcLXB3YUe7BKFGe4Vde8Jcrwn1RSFmILU6H
o9jPncZVanQYy9URQqnrcHzqpRfViuVeyhuAUh3lPt4Q7jIrr+2Ug6xWxIkBrtTt
/iKT0p8+aR3HhakrGqp4
=VbZG
-----END PGP SIGNATURE-----

^ permalink raw reply	[flat|nested] 4+ messages in thread

[parent not found: <9BF9E529-C532-4A94-8362-93C2D1B778DB@mit.edu>]

[parent not found: <4E3432FC.9030204@ufm.su>]

[parent not found: <20110730165001.GI7361@thunk.org>]

[parent not found: <Pine.LNX.4.64.1107301016120.23447@cobra.newdream.net>]

[parent not found: <20110730221900.GK7361@thunk.org>]

[parent not found: <Pine.LNX.4.64.1107302149430.23447@cobra.newdream.net>]

[parent not found: <4E353D9E.5080802@ufm.su>]

[parent not found: <Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net>]

[parent not found: <4E35B833.6070304@ufm.su>]

[parent not found: <Pine.LNX.4.64.1107311339530.23447@cobra.newdream.net>]

[parent not found: <80E3795B-C981-492F-9312-DC91D57E4017@mit.edu>]

[parent not found: <Pine.LNX.4.64.1108010918580.6290@cobra.newdream.net>]

[parent not found: <CAO47_-9DmxqfBsBF2K_8ScX_4d-HPz01QeQ-2FFwZS-nCDEOsw@mail.gmail.com>]

[parent not found: <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com>]

* Re: Kernel 3.0.0 + ext4 + ceph == ...
       [not found]                           ` <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com>
@ 2011-08-08 20:07                             ` Christian Brunner
  2011-08-18  9:19                               ` Christian Brunner
  0 siblings, 1 reply; 4+ messages in thread
From: Christian Brunner @ 2011-08-08 20:07 UTC (permalink / raw)
  To: Yehuda Sadeh Weinraub
  Cc: Sage Weil, Theodore Tso, Fyodor Ustinov, ceph-devel, linux-ext4

I tried 3.0.1 today, which contains the commit Theodore suggested and
was no longer able to reproduce the problem.

So I think the corruption we have seen is indeed related to:

commit 7132de744ba76930d13033061018ddd7e3e8cd91
Author: Maxim Patlasov <maxim.patlasov@gmail.com>
Date:   Sun Jul 10 19:37:48 2011 -0400

   ext4: fix i_blocks/quota accounting when extent insertion fails


I will now try to apply this patch to the RHEL6.1 kernel and see what
happens...

Thanks for your help.

Christian


2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>:
> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote:
> ...
>> I tried to reproduce this without ceph, but wasn't able to...
>>
>> In the meantime it seams, that I can also see the side effects on the
>> librbd side: I get an "librbd: data error!" when I do an "rbd copy".
>>
>> When I look at the librbd code this is related to a sparse_read not
>> returning the right size of the object.
>>
>> I don't know if it helps, but I think that the problem is also related
>> to sparse file usage.
>>
>
> There were a few sparse-read issues that we fixed not too long ago,
> but should have been fixed for at least the previous ceph version. I'm
> not sure what version you're using.
> There was a ext4 fiemap issue that I was hitting on specific
> environments but couldn't determine whether it was fixed in later
> kernel versions (I was using 2.6.32). Now is a good time to try and
> get to the bottom of it. Here's a script I was using to reproduce it:
>
> #!/bin/sh
> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync
> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync
>
> You can compile and run the following utility to dump all the extents:
> http://pastebin.com/h2Cnpk2Q
>
> Thanks,
> Yehuda
>
> Oh, btw, You can effectively disable the use of fiemap by setting the
> 'filestore fiemap threshold' config option with large enough value
> (e.g., anything bigger than 4 MB should be enough for rbd).
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Kernel 3.0.0 + ext4 + ceph == ...
  2011-08-08 20:07                             ` Christian Brunner
@ 2011-08-18  9:19                               ` Christian Brunner
  0 siblings, 0 replies; 4+ messages in thread
From: Christian Brunner @ 2011-08-18  9:19 UTC (permalink / raw)
  To: linux-ext4; +Cc: Sage Weil, Theodore Tso, Fyodor Ustinov, ceph-devel

I'm sorry, that I have to correct this:

The problem is still happening with 3.0.1. Although it only seems to
happen under high load now.

I also did some tracing (with 3.0.0 as the problem is easier to
reproduce here). What might be interesting to note is, that the
corruption does not occur, when I do an "strace -f cosd". (Maybe a
race condition?).

To reproduce the problem I have now setup a ceph cluster on a single machine
with replication between /ceph/osd.000 and /ceph/osd.001.

My setup now has only two active placement groups with 2 objects.

The corruption is happening, when I start replication from osd.000 to
osd.001. It is reproducible most of the time (but not allways), when I
do the following:

# mkfs.ext4 -T largefile /dev/sdb1
# mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/
# cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap
# /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf


### wait until replication has finished and then stop the cosd

# umount /dev/sdb1
# fsck.ext4 -f /dev/sdb
e2fsck 1.41.12 (17-May-2010)
Pass 1: Checking inodes, blocks, and sizes
Inode 43, i_blocks is 8, should be 16.  Fix<y>? no

Inode 2078, i_blocks is 24, should be 16.  Fix<y>? no



I can also provide an e2image with the metadata and the strace output
of the cosd, if this would be helpful.

Regards,
Christian


2011/8/8 Christian Brunner <chb@muc.de>:
> I tried 3.0.1 today, which contains the commit Theodore suggested and
> was no longer able to reproduce the problem.
>
> So I think the corruption we have seen is indeed related to:
>
> commit 7132de744ba76930d13033061018ddd7e3e8cd91
> Author: Maxim Patlasov <maxim.patlasov@gmail.com>
> Date:   Sun Jul 10 19:37:48 2011 -0400
>
>   ext4: fix i_blocks/quota accounting when extent insertion fails
>
>
> I will now try to apply this patch to the RHEL6.1 kernel and see what
> happens...
>
> Thanks for your help.
>
> Christian
>
>
> 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>:
>> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote:
>> ...
>>> I tried to reproduce this without ceph, but wasn't able to...
>>>
>>> In the meantime it seams, that I can also see the side effects on the
>>> librbd side: I get an "librbd: data error!" when I do an "rbd copy".
>>>
>>> When I look at the librbd code this is related to a sparse_read not
>>> returning the right size of the object.
>>>
>>> I don't know if it helps, but I think that the problem is also related
>>> to sparse file usage.
>>>
>>
>> There were a few sparse-read issues that we fixed not too long ago,
>> but should have been fixed for at least the previous ceph version. I'm
>> not sure what version you're using.
>> There was a ext4 fiemap issue that I was hitting on specific
>> environments but couldn't determine whether it was fixed in later
>> kernel versions (I was using 2.6.32). Now is a good time to try and
>> get to the bottom of it. Here's a script I was using to reproduce it:
>>
>> #!/bin/sh
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync
>> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync
>>
>> You can compile and run the following utility to dump all the extents:
>> http://pastebin.com/h2Cnpk2Q
>>
>> Thanks,
>> Yehuda
>>
>> Oh, btw, You can effectively disable the use of fiemap by setting the
>> 'filestore fiemap threshold' config option with large enough value
>> (e.g., anything bigger than 4 MB should be enough for rbd).
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-11-15 15:46 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <4E33D101.1050504@ufm.su>
     [not found] ` <CAO47_-_EC4s1HF1pOGNzPRitYGyigOd1hfgz1qDPy6dqwGMMQA@mail.gmail.com>
2011-07-30 14:53   ` Fwd: Kernel 3.0.0 + ext4 + ceph == Christian Brunner
2011-11-15 15:46     ` Eric Sandeen
     [not found] ` <9BF9E529-C532-4A94-8362-93C2D1B778DB@mit.edu>
     [not found]   ` <4E3432FC.9030204@ufm.su>
     [not found]     ` <20110730165001.GI7361@thunk.org>
     [not found]       ` <Pine.LNX.4.64.1107301016120.23447@cobra.newdream.net>
     [not found]         ` <20110730221900.GK7361@thunk.org>
     [not found]           ` <Pine.LNX.4.64.1107302149430.23447@cobra.newdream.net>
     [not found]             ` <4E353D9E.5080802@ufm.su>
     [not found]               ` <Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net>
     [not found]                 ` <4E35B833.6070304@ufm.su>
     [not found]                   ` <Pine.LNX.4.64.1107311339530.23447@cobra.newdream.net>
     [not found]                     ` <80E3795B-C981-492F-9312-DC91D57E4017@mit.edu>
     [not found]                       ` <Pine.LNX.4.64.1108010918580.6290@cobra.newdream.net>
     [not found]                         ` <CAO47_-9DmxqfBsBF2K_8ScX_4d-HPz01QeQ-2FFwZS-nCDEOsw@mail.gmail.com>
     [not found]                           ` <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com>
2011-08-08 20:07                             ` Christian Brunner
2011-08-18  9:19                               ` Christian Brunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).