* Fwd: Kernel 3.0.0 + ext4 + ceph == ... [not found] ` <CAO47_-_EC4s1HF1pOGNzPRitYGyigOd1hfgz1qDPy6dqwGMMQA@mail.gmail.com> @ 2011-07-30 14:53 ` Christian Brunner 2011-11-15 15:46 ` Eric Sandeen 0 siblings, 1 reply; 4+ messages in thread From: Christian Brunner @ 2011-07-30 14:53 UTC (permalink / raw) To: linux-ext4, ceph-devel Fyodor and I are struggling to get a fully stable ceph cluster up and running. When we run an Ceph-Objectstore (OSD) ontop of an ext4 filesystem, we get fsck errors, when we check the filesystem (see below). Fyodor is running 3.0. I am running a RHEL6.1 Kernel (2.6.32-131.6.1.el6.x86_64). Any help or hints on how to trace the bug would be appreciated. Thanks, Christian 2011/7/30 Fyodor Ustinov <ufm@ufm.su>: > fail. Epic fail. > > Absolutely reproducible. > > I have ceph cluster with this configuration: > > 8 physical servers > 14 osd servers. > Each osd server have personal fs. > 48T total size of ceph cluster. > 17T used. > > Now, step by step: > > 1. Stop ceph server osd0 > /etc/init.d/ceph stop > > 2. Make fresh fs for osd > umount /osd.0 > mkfs.ext4 /dev/sdc1 > tune2fs -o journal_data_writeback /dev/sdc1 > mount -a > # string from /etc/fstab: > # /dev/sdc1 /osd.0 ext4 > user_xattr,rw,noexec,nodev,noatime,nodiratime,data=writeback,barrier=0 > 0 2 > ceph mon getmap -o /tmp/monmap > cosd --mkfs -i 0 --monmap /tmp/monmap > > 3. Start ceph server osd0 > /etc/init.d/ceph start > > Now, make a big cup of coffee and begin to wait. > > After completion of rebalancing do: > /etc/init.d/ceph stop > umount /osd.0 > fsck.ext4 -fy /dev/sdc1 > > and see many-many messages like: > > Inode 238551053, i_blocks is 24, should be 32. Fix? yes > > Inode 238551054, i_blocks is 40, should be 32. Fix? yes > > Inode 238551066, i_blocks is 24, should be 32. Fix? yes > > Inode 238944257, i_blocks is 8, should be 16. Fix? yes > > Inode 239206414, i_blocks is 8, should be 16. Fix? yes > > Inode 239206416, i_blocks is 40, should be 32. Fix? yes > > Inode 239206431, i_blocks is 8, should be 16. Fix? yes > > Inode 239206441, i_blocks is 24, should be 32. Fix? yes > > Voila. > > P.S. No any message in syslog. No any message in console. > > WBR, > Fyodor. > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Fwd: Kernel 3.0.0 + ext4 + ceph == ... 2011-07-30 14:53 ` Fwd: Kernel 3.0.0 + ext4 + ceph == Christian Brunner @ 2011-11-15 15:46 ` Eric Sandeen 0 siblings, 0 replies; 4+ messages in thread From: Eric Sandeen @ 2011-11-15 15:46 UTC (permalink / raw) To: chb; +Cc: linux-ext4, ceph-devel -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 7/30/11 9:53 AM, Christian Brunner wrote: > Fyodor and I are struggling to get a fully stable ceph cluster up and running. > > When we run an Ceph-Objectstore (OSD) ontop of an ext4 filesystem, we > get fsck errors, when we check the filesystem (see below). BTW, this should be fixed now as of my commit 6d6a435190bdf2e04c9465cde5bdc3ac68cf11a4 ext4: fix race in xattr block allocation path I think it made its way to a couple older -stable kernels, too. - -Eric > Fyodor is running 3.0. > I am running a RHEL6.1 Kernel (2.6.32-131.6.1.el6.x86_64). > > Any help or hints on how to trace the bug would be appreciated. > > Thanks, > Christian > > 2011/7/30 Fyodor Ustinov <ufm@ufm.su>: >> fail. Epic fail. >> >> Absolutely reproducible. >> >> I have ceph cluster with this configuration: >> >> 8 physical servers >> 14 osd servers. >> Each osd server have personal fs. >> 48T total size of ceph cluster. >> 17T used. >> >> Now, step by step: >> >> 1. Stop ceph server osd0 >> /etc/init.d/ceph stop >> >> 2. Make fresh fs for osd >> umount /osd.0 >> mkfs.ext4 /dev/sdc1 >> tune2fs -o journal_data_writeback /dev/sdc1 >> mount -a >> # string from /etc/fstab: >> # /dev/sdc1 /osd.0 ext4 >> user_xattr,rw,noexec,nodev,noatime,nodiratime,data=writeback,barrier=0 >> 0 2 >> ceph mon getmap -o /tmp/monmap >> cosd --mkfs -i 0 --monmap /tmp/monmap >> >> 3. Start ceph server osd0 >> /etc/init.d/ceph start >> >> Now, make a big cup of coffee and begin to wait. >> >> After completion of rebalancing do: >> /etc/init.d/ceph stop >> umount /osd.0 >> fsck.ext4 -fy /dev/sdc1 >> >> and see many-many messages like: >> >> Inode 238551053, i_blocks is 24, should be 32. Fix? yes >> >> Inode 238551054, i_blocks is 40, should be 32. Fix? yes >> >> Inode 238551066, i_blocks is 24, should be 32. Fix? yes >> >> Inode 238944257, i_blocks is 8, should be 16. Fix? yes >> >> Inode 239206414, i_blocks is 8, should be 16. Fix? yes >> >> Inode 239206416, i_blocks is 40, should be 32. Fix? yes >> >> Inode 239206431, i_blocks is 8, should be 16. Fix? yes >> >> Inode 239206441, i_blocks is 24, should be 32. Fix? yes >> >> Voila. >> >> P.S. No any message in syslog. No any message in console. >> >> WBR, >> Fyodor. >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -----BEGIN PGP SIGNATURE----- Version: GnuPG/MacGPG2 v2.0.17 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iQIcBAEBAgAGBQJOwoljAAoJECCuFpLhPd7gjxIQAJ7B+f7EYxBZ+48gUrncmB5r Izkkv2ACza+27g/CUi9ku9j1o3pjZwLNhzo3Fj0gwweB3WaY9T+JMXnfInSFegeR GCT/8XQqGWFVoRQKKc4wUBKGgW5f+3HTgYLqUY0Z38MqMHpIMXYswXdOSB1Wc4MC p+jEjHmTWftklpIjv+Vm61AejpoUO93SFE5gUuBeKSZxwjifV1uTUXtaZCQXUG5N EFz+sS7YvGrttAldK+lbiq7sa7IKINnB5lbDs5ChSZoytSF9hPIRgDOTLrkAZ+k8 YovLWbu2gwGMcZEhu3ZLJ7NdtZbn45A/fh/grNU8nezTo0cTHBTYZCLqtjsUDuMr mwUIDNUEAv6LIz0OyeJMftDX4TzxjQyEQOgYg5wyCKCjE2Nyktyap2T5sAFKamJJ pgTUt0JSpXgDnDBL7Y3M6RbY8DQsDHIir3A7aOwdINGKweNiJXBYC3LWYHIXY0bd yoKXT6e/Bentlj+Peugg51bw91JtlqxJT4qJfk6HMF00uxrfWHlvzht7Lu61YxrW LBQgNyQ+Gu1drHIHyIFu95UePhzEGQcLXB3YUe7BKFGe4Vde8Jcrwn1RSFmILU6H o9jPncZVanQYy9URQqnrcHzqpRfViuVeyhuAUh3lPt4Q7jIrr+2Ug6xWxIkBrtTt /iKT0p8+aR3HhakrGqp4 =VbZG -----END PGP SIGNATURE----- ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <9BF9E529-C532-4A94-8362-93C2D1B778DB@mit.edu>]
[parent not found: <4E3432FC.9030204@ufm.su>]
[parent not found: <20110730165001.GI7361@thunk.org>]
[parent not found: <Pine.LNX.4.64.1107301016120.23447@cobra.newdream.net>]
[parent not found: <20110730221900.GK7361@thunk.org>]
[parent not found: <Pine.LNX.4.64.1107302149430.23447@cobra.newdream.net>]
[parent not found: <4E353D9E.5080802@ufm.su>]
[parent not found: <Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net>]
[parent not found: <4E35B833.6070304@ufm.su>]
[parent not found: <Pine.LNX.4.64.1107311339530.23447@cobra.newdream.net>]
[parent not found: <80E3795B-C981-492F-9312-DC91D57E4017@mit.edu>]
[parent not found: <Pine.LNX.4.64.1108010918580.6290@cobra.newdream.net>]
[parent not found: <CAO47_-9DmxqfBsBF2K_8ScX_4d-HPz01QeQ-2FFwZS-nCDEOsw@mail.gmail.com>]
[parent not found: <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com>]
* Re: Kernel 3.0.0 + ext4 + ceph == ... [not found] ` <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com> @ 2011-08-08 20:07 ` Christian Brunner 2011-08-18 9:19 ` Christian Brunner 0 siblings, 1 reply; 4+ messages in thread From: Christian Brunner @ 2011-08-08 20:07 UTC (permalink / raw) To: Yehuda Sadeh Weinraub Cc: Sage Weil, Theodore Tso, Fyodor Ustinov, ceph-devel, linux-ext4 I tried 3.0.1 today, which contains the commit Theodore suggested and was no longer able to reproduce the problem. So I think the corruption we have seen is indeed related to: commit 7132de744ba76930d13033061018ddd7e3e8cd91 Author: Maxim Patlasov <maxim.patlasov@gmail.com> Date: Sun Jul 10 19:37:48 2011 -0400 ext4: fix i_blocks/quota accounting when extent insertion fails I will now try to apply this patch to the RHEL6.1 kernel and see what happens... Thanks for your help. Christian 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>: > On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote: > ... >> I tried to reproduce this without ceph, but wasn't able to... >> >> In the meantime it seams, that I can also see the side effects on the >> librbd side: I get an "librbd: data error!" when I do an "rbd copy". >> >> When I look at the librbd code this is related to a sparse_read not >> returning the right size of the object. >> >> I don't know if it helps, but I think that the problem is also related >> to sparse file usage. >> > > There were a few sparse-read issues that we fixed not too long ago, > but should have been fixed for at least the previous ceph version. I'm > not sure what version you're using. > There was a ext4 fiemap issue that I was hitting on specific > environments but couldn't determine whether it was fixed in later > kernel versions (I was using 2.6.32). Now is a good time to try and > get to the bottom of it. Here's a script I was using to reproduce it: > > #!/bin/sh > dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync > dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync > > You can compile and run the following utility to dump all the extents: > http://pastebin.com/h2Cnpk2Q > > Thanks, > Yehuda > > Oh, btw, You can effectively disable the use of fiemap by setting the > 'filestore fiemap threshold' config option with large enough value > (e.g., anything bigger than 4 MB should be enough for rbd). > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Kernel 3.0.0 + ext4 + ceph == ... 2011-08-08 20:07 ` Christian Brunner @ 2011-08-18 9:19 ` Christian Brunner 0 siblings, 0 replies; 4+ messages in thread From: Christian Brunner @ 2011-08-18 9:19 UTC (permalink / raw) To: linux-ext4; +Cc: Sage Weil, Theodore Tso, Fyodor Ustinov, ceph-devel I'm sorry, that I have to correct this: The problem is still happening with 3.0.1. Although it only seems to happen under high load now. I also did some tracing (with 3.0.0 as the problem is easier to reproduce here). What might be interesting to note is, that the corruption does not occur, when I do an "strace -f cosd". (Maybe a race condition?). To reproduce the problem I have now setup a ceph cluster on a single machine with replication between /ceph/osd.000 and /ceph/osd.001. My setup now has only two active placement groups with 2 objects. The corruption is happening, when I start replication from osd.000 to osd.001. It is reproducible most of the time (but not allways), when I do the following: # mkfs.ext4 -T largefile /dev/sdb1 # mount -o noatime,user_xattr /dev/sdb1 /ceph/osd.001/ # cosd -i 001 --mkjournal --mkfs --monmap /tmp/monmap # /usr/bin/cosd -d -i 001 -c /etc/ceph/ceph.conf ### wait until replication has finished and then stop the cosd # umount /dev/sdb1 # fsck.ext4 -f /dev/sdb e2fsck 1.41.12 (17-May-2010) Pass 1: Checking inodes, blocks, and sizes Inode 43, i_blocks is 8, should be 16. Fix<y>? no Inode 2078, i_blocks is 24, should be 16. Fix<y>? no I can also provide an e2image with the metadata and the strace output of the cosd, if this would be helpful. Regards, Christian 2011/8/8 Christian Brunner <chb@muc.de>: > I tried 3.0.1 today, which contains the commit Theodore suggested and > was no longer able to reproduce the problem. > > So I think the corruption we have seen is indeed related to: > > commit 7132de744ba76930d13033061018ddd7e3e8cd91 > Author: Maxim Patlasov <maxim.patlasov@gmail.com> > Date: Sun Jul 10 19:37:48 2011 -0400 > > ext4: fix i_blocks/quota accounting when extent insertion fails > > > I will now try to apply this patch to the RHEL6.1 kernel and see what > happens... > > Thanks for your help. > > Christian > > > 2011/8/3 Yehuda Sadeh Weinraub <yehuda.sadeh@dreamhost.com>: >> On Wed, Aug 3, 2011 at 7:16 AM, Christian Brunner <chb@muc.de> wrote: >> ... >>> I tried to reproduce this without ceph, but wasn't able to... >>> >>> In the meantime it seams, that I can also see the side effects on the >>> librbd side: I get an "librbd: data error!" when I do an "rbd copy". >>> >>> When I look at the librbd code this is related to a sparse_read not >>> returning the right size of the object. >>> >>> I don't know if it helps, but I think that the problem is also related >>> to sparse file usage. >>> >> >> There were a few sparse-read issues that we fixed not too long ago, >> but should have been fixed for at least the previous ceph version. I'm >> not sure what version you're using. >> There was a ext4 fiemap issue that I was hitting on specific >> environments but couldn't determine whether it was fixed in later >> kernel versions (I was using 2.6.32). Now is a good time to try and >> get to the bottom of it. Here's a script I was using to reproduce it: >> >> #!/bin/sh >> dd if=/dev/urandom of=bla bs=1 seek=$((0x6f000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x70000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x71000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x72000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x73000)) count=$((0x1000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x74000)) count=$((0x2000)); sync >> dd if=/dev/urandom of=bla bs=1 seek=$((0x2ae000)) count=$((0x2000)); sync >> >> You can compile and run the following utility to dump all the extents: >> http://pastebin.com/h2Cnpk2Q >> >> Thanks, >> Yehuda >> >> Oh, btw, You can effectively disable the use of fiemap by setting the >> 'filestore fiemap threshold' config option with large enough value >> (e.g., anything bigger than 4 MB should be enough for rbd). >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2011-11-15 15:46 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <4E33D101.1050504@ufm.su>
[not found] ` <CAO47_-_EC4s1HF1pOGNzPRitYGyigOd1hfgz1qDPy6dqwGMMQA@mail.gmail.com>
2011-07-30 14:53 ` Fwd: Kernel 3.0.0 + ext4 + ceph == Christian Brunner
2011-11-15 15:46 ` Eric Sandeen
[not found] ` <9BF9E529-C532-4A94-8362-93C2D1B778DB@mit.edu>
[not found] ` <4E3432FC.9030204@ufm.su>
[not found] ` <20110730165001.GI7361@thunk.org>
[not found] ` <Pine.LNX.4.64.1107301016120.23447@cobra.newdream.net>
[not found] ` <20110730221900.GK7361@thunk.org>
[not found] ` <Pine.LNX.4.64.1107302149430.23447@cobra.newdream.net>
[not found] ` <4E353D9E.5080802@ufm.su>
[not found] ` <Pine.LNX.4.64.1107310951550.2348@cobra.newdream.net>
[not found] ` <4E35B833.6070304@ufm.su>
[not found] ` <Pine.LNX.4.64.1107311339530.23447@cobra.newdream.net>
[not found] ` <80E3795B-C981-492F-9312-DC91D57E4017@mit.edu>
[not found] ` <Pine.LNX.4.64.1108010918580.6290@cobra.newdream.net>
[not found] ` <CAO47_-9DmxqfBsBF2K_8ScX_4d-HPz01QeQ-2FFwZS-nCDEOsw@mail.gmail.com>
[not found] ` <CAC-hyiHzmn25ryJkNUdzQvk7c7chwVDfmwDeo8X2+4zTbDuFGQ@mail.gmail.com>
2011-08-08 20:07 ` Christian Brunner
2011-08-18 9:19 ` Christian Brunner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).