* fsck.ext4: Group descriptors look bad... trying backup blocks...
@ 2009-04-17 11:03 Jeremy Sanders
2009-04-17 11:26 ` Jeremy Sanders
` (2 more replies)
0 siblings, 3 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 11:03 UTC (permalink / raw)
To: linux-ext4
Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On
rebooting (cleanly unmounting), I tried an fsck on the device. I get the
following:
[root@xback2 ~]# fsck /dev/md0
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid. Fix<y>?
It then finds lots of bad group descriptors.
This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and
e2fsprogs-1.41.4-4.fc10.x86_64.
Jeremy
--
Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders @ 2009-04-17 11:26 ` Jeremy Sanders 2009-04-17 11:56 ` Theodore Tso 2009-04-17 17:00 ` Eric Sandeen 2 siblings, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-17 11:26 UTC (permalink / raw) To: linux-ext4 Some more information about the device: root@xback2 ~]# dumpe2fs /dev/md0| head -100 dumpe2fs 1.41.4 (27-Jan-2009) Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 508aee62-79af-4b4c-95a6-222b3868834c Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 549314560 Block count: 2197239840 Reserved block count: 0 Free blocks: 1508301443 Free inodes: 545311753 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 500 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 8 RAID stripe width: 72 Flex block group size: 16 Filesystem created: Fri Apr 10 17:13:08 2009 Last mount time: Mon Apr 13 11:00:22 2009 Last write time: Fri Apr 17 11:53:05 2009 Mount count: 1 Maximum mount count: -1 Last checked: Fri Apr 10 17:13:08 2009 Check interval: 0 (<none>) Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 9c9b9fd6-5af2-4ee0-bceb-25827cb008f9 Journal backup: inode blocks Journal size: 128M Group 0: (Blocks 0-32767) [ITABLE_ZEROED] Checksum 0xd3b2, unused inodes 4032 Primary superblock at 0, Group descriptors at 1-524 Reserved GDT blocks at 525-1024 Block bitmap at 1025 (+1025), Inode bitmap at 1041 (+1041) Inode table at 1057-1568 (+1057) 856 free blocks, 4032 free inodes, 268 directories, 4032 unused inodes Group 1: (Blocks 32768-65535) [INODE_UNINIT, ITABLE_ZEROED] Checksum 0xc586, unused inodes 8192 Backup superblock at 32768, Group descriptors at 32769-33292 Reserved GDT blocks at 33293-33792 Block bitmap at 1026 (+4294935554), Inode bitmap at 1042 (+4294935570) Inode table at 1569-2080 (+4294936097) 939 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes Group 2: (Blocks 65536-98303) [INODE_UNINIT, ITABLE_ZEROED] ... ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders 2009-04-17 11:26 ` Jeremy Sanders @ 2009-04-17 11:56 ` Theodore Tso 2009-04-17 12:16 ` Jeremy Sanders 2009-04-17 12:24 ` Jeremy Sanders 2009-04-17 17:00 ` Eric Sandeen 2 siblings, 2 replies; 31+ messages in thread From: Theodore Tso @ 2009-04-17 11:56 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 On Fri, Apr 17, 2009 at 12:03:33PM +0100, Jeremy Sanders wrote: > Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On > rebooting (cleanly unmounting), I tried an fsck on the device. I get the > following: > > [root@xback2 ~]# fsck /dev/md0 > fsck 1.41.4 (27-Jan-2009) > e2fsck 1.41.4 (27-Jan-2009) > fsck.ext4: Group descriptors look bad... trying backup blocks... > Group descriptor 0 checksum is invalid. Fix<y>? > > It then finds lots of bad group descriptors. What happened afterwards? Did fsck complete successfully? I see from the dumpe2fs that you sent it had only been in use for a week. How were you using the filesystem? Did you try using the online resize feature at any time? The problem is that any number of things could have caused the block group descriptors to be corrupted. - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 11:56 ` Theodore Tso @ 2009-04-17 12:16 ` Jeremy Sanders 2009-04-17 17:10 ` Eric Sandeen 2009-04-17 12:24 ` Jeremy Sanders 1 sibling, 1 reply; 31+ messages in thread From: Jeremy Sanders @ 2009-04-17 12:16 UTC (permalink / raw) To: linux-ext4 Theodore Tso wrote: > What happened afterwards? Did fsck complete successfully? I was waiting to see whether you wanted me to do something else. I've just tried it and it didn't: [root@xback2 ~]# fsck -a /dev/md0 fsck 1.41.4 (27-Jan-2009) /dev/md0: Group descriptor 384 checksum is invalid. FIXED. /dev/md0: Group descriptor 385 checksum is invalid. FIXED. /dev/md0: Group descriptor 386 checksum is invalid. FIXED. /dev/md0: Group descriptor 387 checksum is invalid. FIXED. /dev/md0: Group descriptor 388 checksum is invalid. FIXED. /dev/md0: Group descriptor 389 checksum is invalid. FIXED. /dev/md0: Group descriptor 390 checksum is invalid. FIXED. /dev/md0: Group descriptor 391 checksum is invalid. FIXED. /dev/md0: Group descriptor 392 checksum is invalid. FIXED. /dev/md0: Group descriptor 393 checksum is invalid. FIXED. /dev/md0: Group descriptor 394 checksum is invalid. FIXED. /dev/md0: Group descriptor 395 checksum is invalid. FIXED. /dev/md0: Group descriptor 396 checksum is invalid. FIXED. /dev/md0: Group descriptor 397 checksum is invalid. FIXED. /dev/md0: Group descriptor 398 checksum is invalid. FIXED. /dev/md0: Group descriptor 399 checksum is invalid. FIXED. /dev/md0: Group descriptor 400 checksum is invalid. FIXED. /dev/md0: Group descriptor 401 checksum is invalid. FIXED. /dev/md0: Group descriptor 402 checksum is invalid. FIXED. /dev/md0: Group descriptor 403 checksum is invalid. FIXED. /dev/md0: Group descriptor 404 checksum is invalid. FIXED. /dev/md0: Note: if several inode or block bitmap blocks or part of the inode table require relocation, you may wish to try running e2fsck with the '-b 32768' option first. The problem may lie only with the primary block group descriptors, and the backup block group descriptors may be OK. /dev/md0: Block bitmap for group 405 is not in group. (block 3393946179) /dev/md0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY. (i.e., without -a or -p options) ** When I run it manually I get: Pass 1: Checking inodes, blocks, and sizes Inode 8355 has imagic flag set. Clear<y>? yes Inode 8355 has a extra size (62017) which is invalid Fix<y>? yes Inode 8355 has compression flag set on filesystem without compression support. Clear<y>? yes Inode 8355 has a bad extended attribute block 2170352193. Clear<y>? yes Inode 8355 has INDEX_FL flag set but is not a directory. Clear HTree index<y>? yes Inode 8355, i_size is 9321591691907232321, should be 0. Fix<y>? yes Inode 8355, i_blocks is 266363157148225, should be 0. Fix<y>? yes Inode 8356 is in use, but has dtime set. Fix<y>? yes Inode 8356 has imagic flag set. Clear<y>? yes Inode 8356 has a extra size (62017) which is invalid Fix<y>? yes Inode 8356 has compression flag set on filesystem without compression support. Clear<y>? yes Inode 8356 has a bad extended attribute block 2170352193. Clear<y>? yes Inode 8356 has INDEX_FL flag set but is not a directory. Clear HTree index<y>? yes Inode 8356, i_size is 9321591691907232321, should be 0. Fix<y>? yes Inode 8356, i_blocks is 266363157148225, should be 0. Fix<y>? yes Inode 8357 is in use, but has dtime set. Fix<y>? yes Inode 8357 has imagic flag set. Clear<y>? yes Inode 8357 has a extra size (62017) which is invalid Fix<y>? yes Inode 8357 has compression flag set on filesystem without compression support. Clear<y>? yes Inode 8357 has a bad extended attribute block 2170352193. Clear<y>? yes Inode 8357 has INDEX_FL flag set but is not a directory. Clear HTree index<y>? yes > I see from the dumpe2fs that you sent it had only been in use for a > week. How were you using the filesystem? Did you try using the > online resize feature at any time? No. The filesystem was used to store rsync snapshots of other file systems (using the hard link feature). I had only rsynced the initial data and run a couple of rsync backups on to it. The filesystem was created using: mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0 > The problem is that any number of things could have caused the block > group descriptors to be corrupted. Oh dear. The system has ECC ram (though linux doesn't know about it, so it may not be working) and the md device is using 10 drives on raid5 and a 3ware controller. Maybe I should force a md raid5 resync to check the drives agree with each other. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 12:16 ` Jeremy Sanders @ 2009-04-17 17:10 ` Eric Sandeen 2009-04-17 18:51 ` Jeremy Sanders 0 siblings, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-17 17:10 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 Jeremy Sanders wrote: > Theodore Tso wrote: ... >> I see from the dumpe2fs that you sent it had only been in use for a >> week. How were you using the filesystem? Did you try using the >> online resize feature at any time? > > No. The filesystem was used to store rsync snapshots of other file systems > (using the hard link feature). I had only rsynced the initial data and run a > couple of rsync backups on to it. The filesystem was created using: > > mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0 Can you show us exactly how you're using rsync? is this with rdiff-backup or some similar tool? Thanks, -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 17:10 ` Eric Sandeen @ 2009-04-17 18:51 ` Jeremy Sanders 0 siblings, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-17 18:51 UTC (permalink / raw) To: Eric Sandeen; +Cc: linux-ext4 On Fri, 17 Apr 2009, Eric Sandeen wrote: > Jeremy Sanders wrote: >> Theodore Tso wrote: > > ... > >>> I see from the dumpe2fs that you sent it had only been in use for a >>> week. How were you using the filesystem? Did you try using the >>> online resize feature at any time? >> >> No. The filesystem was used to store rsync snapshots of other file systems >> (using the hard link feature). I had only rsynced the initial data and run a >> couple of rsync backups on to it. The filesystem was created using: >> >> mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0 > > Can you show us exactly how you're using rsync? is this with > rdiff-backup or some similar tool? No, plain rsync. We have a script which does something like rsync -raHSx --stats --while-file --numeric-ids \ --link-dest=/mnt/username/20090418/ host:/data/username/ \ /mnt/username/20090419/ for a set of users. This command copies the files from /data/username on host to /mnt/username/20090419, but creates hard links to the previous copy (/mnt/username/20090418/) for unchanged files. It worked fine on ext3, at least for a 2.4TB device. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 11:56 ` Theodore Tso 2009-04-17 12:16 ` Jeremy Sanders @ 2009-04-17 12:24 ` Jeremy Sanders 2009-04-17 16:36 ` Theodore Tso 1 sibling, 1 reply; 31+ messages in thread From: Jeremy Sanders @ 2009-04-17 12:24 UTC (permalink / raw) To: linux-ext4 Theodore Tso wrote: > I see from the dumpe2fs that you sent it had only been in use for a > week. How were you using the filesystem? Did you try using the > online resize feature at any time? I assume that this isn't enough to corrupt the filesystem? [root@xback2 ~]# tune2fs -i -1 /dev/md0 tune2fs 1.41.4 (27-Jan-2009) Setting interval between checks to 18446744073709465216 seconds Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 12:24 ` Jeremy Sanders @ 2009-04-17 16:36 ` Theodore Tso 0 siblings, 0 replies; 31+ messages in thread From: Theodore Tso @ 2009-04-17 16:36 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 On Fri, Apr 17, 2009 at 01:24:16PM +0100, Jeremy Sanders wrote: > Theodore Tso wrote: > > > I see from the dumpe2fs that you sent it had only been in use for a > > week. How were you using the filesystem? Did you try using the > > online resize feature at any time? > > I assume that this isn't enough to corrupt the filesystem? > > [root@xback2 ~]# tune2fs -i -1 /dev/md0 > tune2fs 1.41.4 (27-Jan-2009) > Setting interval between checks to 18446744073709465216 seconds No, but it won't do what you want, either. To disable time-based checks, you should use "tune2fs -i 0 /dev/md0". Tune2fs should have flagged an error when you specified -1; I'll have to fix that. - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders 2009-04-17 11:26 ` Jeremy Sanders 2009-04-17 11:56 ` Theodore Tso @ 2009-04-17 17:00 ` Eric Sandeen 2009-04-20 9:33 ` Jeremy Sanders 2 siblings, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-17 17:00 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 Jeremy Sanders wrote: > Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On > rebooting (cleanly unmounting), I tried an fsck on the device. I get the > following: > > [root@xback2 ~]# fsck /dev/md0 > fsck 1.41.4 (27-Jan-2009) > e2fsck 1.41.4 (27-Jan-2009) > fsck.ext4: Group descriptors look bad... trying backup blocks... > Group descriptor 0 checksum is invalid. Fix<y>? > > It then finds lots of bad group descriptors. > > This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and > e2fsprogs-1.41.4-4.fc10.x86_64. > Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's in F10 updates-testing? That way the ext4 code is a bit more of a recent, common codebase. Also, if this is a test fs, re-mkfs'ing from scratch might not be a bad way to go. Depending on how hard it is to reproduce, it may also be interesting to try a filesystem just shy of 8TB (2^31) blocks in case there is some 32-bit wrap-around there, since you're at 8.2T.... -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-17 17:00 ` Eric Sandeen @ 2009-04-20 9:33 ` Jeremy Sanders 2009-04-20 11:35 ` Theodore Tso 2009-04-21 15:14 ` Thierry Vignaud 0 siblings, 2 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 9:33 UTC (permalink / raw) To: linux-ext4 Eric Sandeen wrote: > Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's > in F10 updates-testing? That way the ext4 code is a bit more of a > recent, common codebase. Also, if this is a test fs, re-mkfs'ing from > scratch might not be a bad way to go. > > Depending on how hard it is to reproduce, it may also be interesting to > try a filesystem just shy of 8TB (2^31) blocks in case there is some > 32-bit wrap-around there, since you're at 8.2T.... I wasn't able to trivially reproduce the problem with the old kernel, but I updated to 2.6.29.1-30.fc10.x86_64 in updates testing. This introduced some further problems with a USB issue and some sort of stack dump probably associated with the r8169 driver (see bugzilla). However, the system seems to mostly work, so I recreated the ext4 device, I've just run my backup script again and fsck'd the device. It seems the problem is reproducible with the new kernel: [root@xback2 ~]# fsck /dev/md0 fsck 1.41.4 (27-Jan-2009) e2fsck 1.41.4 (27-Jan-2009) fsck.ext4: Group descriptors look bad... trying backup blocks... Group descriptor 0 checksum is invalid. Fix<y>? Looks like there's a real problem in ext4 causing this under certain circumstances (unless an obscure hardware error is somehow giving the same problem). To cause this, all I did was rsync a set of directories to the disk. No hard link trees were created. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 9:33 ` Jeremy Sanders @ 2009-04-20 11:35 ` Theodore Tso 2009-04-20 11:43 ` Jeremy Sanders 2009-04-24 8:27 ` Jeremy Sanders 2009-04-21 15:14 ` Thierry Vignaud 1 sibling, 2 replies; 31+ messages in thread From: Theodore Tso @ 2009-04-20 11:35 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote: > > However, the system seems to mostly work, so I recreated the ext4 device, > I've just run my backup script again and fsck'd the device. It seems the > problem is reproducible with the new kernel: When you say reproducible, how many times have you tried it, and were you able to reproduce it every single time? 50% of time? I do believe there is a problem, but we haven't been able to something where it's easily reproducible. So if you can easily reproduce this, this is definitely very exciting. > [root@xback2 ~]# fsck /dev/md0 > fsck 1.41.4 (27-Jan-2009) > e2fsck 1.41.4 (27-Jan-2009) > fsck.ext4: Group descriptors look bad... trying backup blocks... > Group descriptor 0 checksum is invalid. Fix<y>? Do you have to reboot to see this, or is it enough to unmount the filesystem? How big is the ext4 filesystem, and how big was the amount of data that you rsync'ed? One thing that would be worth trying if you can easily reproduce is whether it happens on a single device disk, or whether it only shows up when you use a /dev/mdX device. Thanks, - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 11:35 ` Theodore Tso @ 2009-04-20 11:43 ` Jeremy Sanders 2009-04-20 12:48 ` Theodore Tso 2009-04-24 8:27 ` Jeremy Sanders 1 sibling, 1 reply; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 11:43 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Mon, 20 Apr 2009, Theodore Tso wrote: > On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote: >> >> However, the system seems to mostly work, so I recreated the ext4 device, >> I've just run my backup script again and fsck'd the device. It seems the >> problem is reproducible with the new kernel: > > When you say reproducible, how many times have you tried it, and were > you able to reproduce it every single time? 50% of time? I do > believe there is a problem, but we haven't been able to something > where it's easily reproducible. So if you can easily reproduce this, > this is definitely very exciting. It takes a day or two to do the sync. I've only done it twice (one with the old kernel, once with the new fedora testing kernel) and it happened both times. I'm afraid the statistics are rather low number here. I did a different faster test (just copying my home directory lots of times), but I wasn't able to get it to fail. That test didn't use much disk space, however. Maybe it's worth just dd'ing a few TB of data onto the device and seeing whether that fails. >> [root@xback2 ~]# fsck /dev/md0 >> fsck 1.41.4 (27-Jan-2009) >> e2fsck 1.41.4 (27-Jan-2009) >> fsck.ext4: Group descriptors look bad... trying backup blocks... >> Group descriptor 0 checksum is invalid. Fix<y>? > > Do you have to reboot to see this, or is it enough to unmount the > filesystem? How big is the ext4 filesystem, and how big was the > amount of data that you rsync'ed? One thing that would be worth > trying if you can easily reproduce is whether it happens on a single > device disk, or whether it only shows up when you use a /dev/mdX > device. I didn't reboot this time - I did last time. I just unmounted the file system and fsckd it. The filesystem is 8.2TB and the data is around 2.5TB. The drives on a 3ware card, so I could configure the card as a single raid5 device and try to reproduce it there. It may take a day or two to copy the data if I try this. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 11:43 ` Jeremy Sanders @ 2009-04-20 12:48 ` Theodore Tso 2009-04-20 12:54 ` Jeremy Sanders 2009-04-20 14:49 ` Eric Sandeen 0 siblings, 2 replies; 31+ messages in thread From: Theodore Tso @ 2009-04-20 12:48 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote: > It takes a day or two to do the sync. I've only done it twice (one with > the old kernel, once with the new fedora testing kernel) and it happened > both times. I'm afraid the statistics are rather low number here. > > I did a different faster test (just copying my home directory lots of > times), but I wasn't able to get it to fail. That test didn't use much > disk space, however. Maybe it's worth just dd'ing a few TB of data onto > the device and seeing whether that fails. > > I didn't reboot this time - I did last time. I just unmounted the file > system and fsckd it. The filesystem is 8.2TB and the data is around > 2.5TB. That's that's useful data. I wish we could make it fail more quickly on a smaller rsync, but the fact that you didn't need to reboot is definitely useful information. And this is a fresh rsync so no files were being deleted, rsync should have just been writing new files to .filename.XXXXX and then renaming the filename to filename.XXXXX when it is done, right? OK, let me think about this a little. I think we can create a patch which checks for writes to the block group descriptors and dumps a stack trace. That would allow us catch the failing code in question in the act, and maybe figure out what is going on. - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 12:48 ` Theodore Tso @ 2009-04-20 12:54 ` Jeremy Sanders 2009-04-20 14:49 ` Eric Sandeen 1 sibling, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 12:54 UTC (permalink / raw) To: Theodore Tso; +Cc: linux-ext4 On Mon, 20 Apr 2009, Theodore Tso wrote: > That's that's useful data. I wish we could make it fail more quickly > on a smaller rsync, but the fact that you didn't need to reboot is > definitely useful information. > > And this is a fresh rsync so no files were being deleted, rsync should > have just been writing new files to .filename.XXXXX and then renaming > the filename to filename.XXXXX when it is done, right? That's what I'd guess. It was onto a clean filesystem, so there shouldn't be any deletions. > OK, let me think about this a little. I think we can create a patch > which checks for writes to the block group descriptors and dumps a > stack trace. That would allow us catch the failing code in question > in the act, and maybe figure out what is going on. Ok. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 12:48 ` Theodore Tso 2009-04-20 12:54 ` Jeremy Sanders @ 2009-04-20 14:49 ` Eric Sandeen 2009-04-20 15:51 ` Eric Sandeen 2009-04-22 9:07 ` Jeremy Sanders 1 sibling, 2 replies; 31+ messages in thread From: Eric Sandeen @ 2009-04-20 14:49 UTC (permalink / raw) To: Theodore Tso; +Cc: Jeremy Sanders, linux-ext4 Theodore Tso wrote: > On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote: >> It takes a day or two to do the sync. I've only done it twice (one with >> the old kernel, once with the new fedora testing kernel) and it happened >> both times. I'm afraid the statistics are rather low number here. >> >> I did a different faster test (just copying my home directory lots of >> times), but I wasn't able to get it to fail. That test didn't use much >> disk space, however. Maybe it's worth just dd'ing a few TB of data onto >> the device and seeing whether that fails. >> >> I didn't reboot this time - I did last time. I just unmounted the file >> system and fsckd it. The filesystem is 8.2TB and the data is around >> 2.5TB. I think trying a filesystem with just under 8T would be a useful test too. > That's that's useful data. I wish we could make it fail more quickly > on a smaller rsync, but the fact that you didn't need to reboot is > definitely useful information. > > And this is a fresh rsync so no files were being deleted, rsync should > have just been writing new files to .filename.XXXXX and then renaming > the filename to filename.XXXXX when it is done, right? > > OK, let me think about this a little. I think we can create a patch > which checks for writes to the block group descriptors and dumps a > stack trace. That would allow us catch the failing code in question > in the act, and maybe figure out what is going on. XFS has block-zero tests, because there was once a bug where uninitialized block numbers in buffers were clobbering the superblock at block 0. It was helpful, so I think this is a good idea, Ted. -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 14:49 ` Eric Sandeen @ 2009-04-20 15:51 ` Eric Sandeen 2009-04-20 15:53 ` Jeremy Sanders 2009-04-22 9:07 ` Jeremy Sanders 1 sibling, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-20 15:51 UTC (permalink / raw) To: Theodore Tso; +Cc: Jeremy Sanders, linux-ext4 Eric Sandeen wrote: > Theodore Tso wrote: >> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote: >>> It takes a day or two to do the sync. I've only done it twice (one with >>> the old kernel, once with the new fedora testing kernel) and it happened >>> both times. I'm afraid the statistics are rather low number here. >>> >>> I did a different faster test (just copying my home directory lots of >>> times), but I wasn't able to get it to fail. That test didn't use much >>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto >>> the device and seeing whether that fails. >>> >>> I didn't reboot this time - I did last time. I just unmounted the file >>> system and fsckd it. The filesystem is 8.2TB and the data is around >>> 2.5TB. > > I think trying a filesystem with just under 8T would be a useful test too. One other question - do you make use of xattrs on this filesystem? In case it's not obvious we are very interested in this reproducible testcase, thank you for being so willing to provide feedback and testing .... -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 15:51 ` Eric Sandeen @ 2009-04-20 15:53 ` Jeremy Sanders 2009-04-20 16:26 ` Eric Sandeen 2009-04-20 18:28 ` Andreas Dilger 0 siblings, 2 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 15:53 UTC (permalink / raw) To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4 On Mon, 20 Apr 2009, Eric Sandeen wrote: > Eric Sandeen wrote: >> Theodore Tso wrote: >>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote: >>>> It takes a day or two to do the sync. I've only done it twice (one with >>>> the old kernel, once with the new fedora testing kernel) and it happened >>>> both times. I'm afraid the statistics are rather low number here. >>>> >>>> I did a different faster test (just copying my home directory lots of >>>> times), but I wasn't able to get it to fail. That test didn't use much >>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto >>>> the device and seeing whether that fails. >>>> >>>> I didn't reboot this time - I did last time. I just unmounted the file >>>> system and fsckd it. The filesystem is 8.2TB and the data is around >>>> 2.5TB. >> >> I think trying a filesystem with just under 8T would be a useful test too. > > One other question - do you make use of xattrs on this filesystem? No. Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 15:53 ` Jeremy Sanders @ 2009-04-20 16:26 ` Eric Sandeen 2009-04-20 16:40 ` Jeremy Sanders 2009-04-20 18:28 ` Andreas Dilger 1 sibling, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-20 16:26 UTC (permalink / raw) To: Jeremy Sanders; +Cc: Theodore Tso, linux-ext4 Jeremy Sanders wrote: > On Mon, 20 Apr 2009, Eric Sandeen wrote: ... >> One other question - do you make use of xattrs on this filesystem? > > No. I've commandeered about 10T of disk space to see if I can hit this. Would you mind providing dumpe2fs -h output for your 8.2T filesystem so I can exactly replicate the geometry? Thanks, -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 16:26 ` Eric Sandeen @ 2009-04-20 16:40 ` Jeremy Sanders 0 siblings, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 16:40 UTC (permalink / raw) To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4 On Mon, 20 Apr 2009, Eric Sandeen wrote: > Jeremy Sanders wrote: >> On Mon, 20 Apr 2009, Eric Sandeen wrote: > ... > >>> One other question - do you make use of xattrs on this filesystem? >> >> No. > > I've commandeered about 10T of disk space to see if I can hit this. > Would you mind providing dumpe2fs -h output for your 8.2T filesystem so > I can exactly replicate the geometry? I formatted with mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72 /dev/md0 [root@xback2 ~]# dumpe2fs -h /dev/md0 Filesystem volume name: <none> Last mounted on: <not available> Filesystem UUID: 34fefacb-0494-4df7-b189-e11b2064dd90 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 549314560 Block count: 2197239840 Reserved block count: 0 Free blocks: 2162717221 Free inodes: 549314549 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 500 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 RAID stride: 8 RAID stripe width: 72 Flex block group size: 16 Filesystem created: Mon Apr 20 17:29:14 2009 Last mount time: n/a Last write time: Mon Apr 20 17:38:28 2009 Mount count: 0 Maximum mount count: 38 Last checked: Mon Apr 20 17:29:14 2009 Check interval: 15552000 (6 months) Next check after: Sat Oct 17 17:29:14 2009 Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 Default directory hash: half_md4 Directory Hash Seed: 06d43af3-a75c-405a-8f25-e51517dae7f6 Journal backup: inode blocks Journal size: 128M ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 15:53 ` Jeremy Sanders 2009-04-20 16:26 ` Eric Sandeen @ 2009-04-20 18:28 ` Andreas Dilger 2009-04-20 18:55 ` Jeremy Sanders 1 sibling, 1 reply; 31+ messages in thread From: Andreas Dilger @ 2009-04-20 18:28 UTC (permalink / raw) To: Jeremy Sanders; +Cc: Eric Sandeen, Theodore Tso, linux-ext4 On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote: > On Mon, 20 Apr 2009, Eric Sandeen wrote: >> Eric Sandeen wrote: >>> Theodore Tso wrote: >>>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote: >>>>> It takes a day or two to do the sync. I've only done it twice (one with >>>>> the old kernel, once with the new fedora testing kernel) and it happened >>>>> both times. I'm afraid the statistics are rather low number here. >>>>> >>>>> I did a different faster test (just copying my home directory lots of >>>>> times), but I wasn't able to get it to fail. That test didn't use much >>>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto >>>>> the device and seeing whether that fails. >>>>> >>>>> I didn't reboot this time - I did last time. I just unmounted the file >>>>> system and fsckd it. The filesystem is 8.2TB and the data is around >>>>> 2.5TB. >>> >>> I think trying a filesystem with just under 8T would be a useful test too. >> >> One other question - do you make use of xattrs on this filesystem? > > No. If you use anything like SELinux or ACLs you would also (indirectly) be using xattrs. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 18:28 ` Andreas Dilger @ 2009-04-20 18:55 ` Jeremy Sanders 2009-04-20 20:45 ` Andreas Dilger 0 siblings, 1 reply; 31+ messages in thread From: Jeremy Sanders @ 2009-04-20 18:55 UTC (permalink / raw) To: Andreas Dilger; +Cc: Eric Sandeen, Theodore Tso, linux-ext4 On Mon, 20 Apr 2009, Andreas Dilger wrote: > On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote: >> On Mon, 20 Apr 2009, Eric Sandeen wrote: >>> One other question - do you make use of xattrs on this filesystem? >> >> No. > > If you use anything like SELinux or ACLs you would also (indirectly) be > using xattrs. SELinux is switched off and we haven't (knowingly) been using xattrs, but I remember rsync might copy copy xattrs, so perhaps they get written in some way... Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 18:55 ` Jeremy Sanders @ 2009-04-20 20:45 ` Andreas Dilger 2009-04-22 9:34 ` Jeremy Sanders 0 siblings, 1 reply; 31+ messages in thread From: Andreas Dilger @ 2009-04-20 20:45 UTC (permalink / raw) To: Jeremy Sanders; +Cc: Eric Sandeen, Theodore Tso, linux-ext4 On Apr 20, 2009 19:55 +0100, Jeremy Sanders wrote: > On Mon, 20 Apr 2009, Andreas Dilger wrote: >> On Apr 20, 2009 16:53 +0100, Jeremy Sanders wrote: >>> On Mon, 20 Apr 2009, Eric Sandeen wrote: >>>> One other question - do you make use of xattrs on this filesystem? >>> >>> No. >> >> If you use anything like SELinux or ACLs you would also (indirectly) be >> using xattrs. > > SELinux is switched off and we haven't (knowingly) been using xattrs, but > I remember rsync might copy copy xattrs, so perhaps they get written in > some way... You can check this with: debugfs -c -R "stat {path to file inside filesystem}" /dev/XXX and check if the "File ACL" field is non-zero: debugfs -c -R "stat etc/hosts" /dev/sda2 debugfs 1.40.11.sun1 (17-June-2008) /dev/sda2: catastrophic mode - not reading inode or group bitmaps Inode: 259128 Type: regular Mode: 0644 Flags: 0x0 Generation: 2075236634 User: 0 Group: 0 Size: 2258 File ACL: 0 Directory ACL: 0 ^^^^^^^^^^^ ##### this would be non-zero ##### Links: 2 Blockcount: 8 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009 atime: 0x49ebdce3 -- Sun Apr 19 20:24:35 2009 mtime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009 Size of extra inode fields: 4 Inode version: 0 BLOCKS: (0):534546 TOTAL: 1 Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 20:45 ` Andreas Dilger @ 2009-04-22 9:34 ` Jeremy Sanders 0 siblings, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-22 9:34 UTC (permalink / raw) To: linux-ext4 Andreas Dilger wrote: > File ACL: 0 Directory ACL: 0 > ^^^^^^^^^^^ ##### this would be non-zero ##### These are zero on this device. Jerey -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 14:49 ` Eric Sandeen 2009-04-20 15:51 ` Eric Sandeen @ 2009-04-22 9:07 ` Jeremy Sanders 2009-04-22 9:59 ` Thierry Vignaud 1 sibling, 1 reply; 31+ messages in thread From: Jeremy Sanders @ 2009-04-22 9:07 UTC (permalink / raw) To: linux-ext4 Eric Sandeen wrote: > I think trying a filesystem with just under 8T would be a useful test too. Okay, I tried partitioning the md device so that it was under 8T (7.1T in fact). Unfortunately I wasn't able to reproduce it in this configuration. So, either it is a 8T+ problem, which disagrees with the other report, or the geometry has some sort of impact, or it is because the files I'm copying keep changing, so it may have gone away, or it is not 100% reproducible. Shall I wait to see if you have a useful testcase from Thierry Vignaud before trying something else? Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-22 9:07 ` Jeremy Sanders @ 2009-04-22 9:59 ` Thierry Vignaud 0 siblings, 0 replies; 31+ messages in thread From: Thierry Vignaud @ 2009-04-22 9:59 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 Jeremy Sanders <jss@ast.cam.ac.uk> writes: > Shall I wait to see if you have a useful testcase from Thierry Vignaud > before trying something else? I wasn't able to reproduce it yet :-( ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 11:35 ` Theodore Tso 2009-04-20 11:43 ` Jeremy Sanders @ 2009-04-24 8:27 ` Jeremy Sanders 1 sibling, 0 replies; 31+ messages in thread From: Jeremy Sanders @ 2009-04-24 8:27 UTC (permalink / raw) To: linux-ext4 Theodore Tso wrote: > Do you have to reboot to see this, or is it enough to unmount the > filesystem? How big is the ext4 filesystem, and how big was the > amount of data that you rsync'ed? One thing that would be worth > trying if you can easily reproduce is whether it happens on a single > device disk, or whether it only shows up when you use a /dev/mdX > device. I've been able to reproduce it on a single device disk (It was partitioned to have the same number of blocks as the md device). Jeremy -- Jeremy Sanders <jss@ast.cam.ac.uk> http://www-xray.ast.cam.ac.uk/~jss/ X-Ray Group, Institute of Astronomy, University of Cambridge, UK. Public Key Server PGP Key ID: E1AAE053 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-20 9:33 ` Jeremy Sanders 2009-04-20 11:35 ` Theodore Tso @ 2009-04-21 15:14 ` Thierry Vignaud 2009-04-21 15:52 ` Eric Sandeen 2009-04-21 16:43 ` Theodore Tso 1 sibling, 2 replies; 31+ messages in thread From: Thierry Vignaud @ 2009-04-21 15:14 UTC (permalink / raw) To: Jeremy Sanders; +Cc: linux-ext4 Jeremy Sanders <jss@ast.cam.ac.uk> writes: > However, the system seems to mostly work, so I recreated the ext4 device, > I've just run my backup script again and fsck'd the device. It seems the > problem is reproducible with the new kernel: > > [root@xback2 ~]# fsck /dev/md0 > fsck 1.41.4 (27-Jan-2009) > e2fsck 1.41.4 (27-Jan-2009) > fsck.ext4: Group descriptors look bad... trying backup blocks... > Group descriptor 0 checksum is invalid. Fix<y>? > > Looks like there's a real problem in ext4 causing this under certain > circumstances (unless an obscure hardware error is somehow giving the same > problem). > > To cause this, all I did was rsync a set of directories to the disk. No hard > link trees were created. For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb. On reboot, I got such errors. The hd was partitionned (all ext4) as: / (5Gb) | /usr (20Gb) | /pub (1.5Tb) The smaller system fses didn't saw those errors. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-21 15:14 ` Thierry Vignaud @ 2009-04-21 15:52 ` Eric Sandeen [not found] ` <m23ac255pp.fsf@vador.mandriva.com> 2009-04-21 16:43 ` Theodore Tso 1 sibling, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-21 15:52 UTC (permalink / raw) To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4 Thierry Vignaud wrote: > Jeremy Sanders <jss@ast.cam.ac.uk> writes: > >> However, the system seems to mostly work, so I recreated the ext4 device, >> I've just run my backup script again and fsck'd the device. It seems the >> problem is reproducible with the new kernel: >> >> [root@xback2 ~]# fsck /dev/md0 >> fsck 1.41.4 (27-Jan-2009) >> e2fsck 1.41.4 (27-Jan-2009) >> fsck.ext4: Group descriptors look bad... trying backup blocks... >> Group descriptor 0 checksum is invalid. Fix<y>? >> >> Looks like there's a real problem in ext4 causing this under certain >> circumstances (unless an obscure hardware error is somehow giving the same >> problem). >> >> To cause this, all I did was rsync a set of directories to the disk. No hard >> link trees were created. > > For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new > 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb. > On reboot, I got such errors. > The hd was partitionned (all ext4) as: > / (5Gb) | /usr (20Gb) | /pub (1.5Tb) > > The smaller system fses didn't saw those errors. Can you provide a little more info on how you copied the 20Gb, and exactly what the errors were? Thanks, -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
[parent not found: <m23ac255pp.fsf@vador.mandriva.com>]
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... [not found] ` <m23ac255pp.fsf@vador.mandriva.com> @ 2009-04-21 16:40 ` Eric Sandeen 2009-04-21 16:56 ` Thierry Vignaud 0 siblings, 1 reply; 31+ messages in thread From: Eric Sandeen @ 2009-04-21 16:40 UTC (permalink / raw) To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4 Thierry Vignaud wrote: > Eric Sandeen <sandeen@redhat.com> writes: > >>>> However, the system seems to mostly work, so I recreated the ext4 device, >>>> I've just run my backup script again and fsck'd the device. It seems the >>>> problem is reproducible with the new kernel: >>>> >>>> [root@xback2 ~]# fsck /dev/md0 >>>> fsck 1.41.4 (27-Jan-2009) >>>> e2fsck 1.41.4 (27-Jan-2009) >>>> fsck.ext4: Group descriptors look bad... trying backup blocks... >>>> Group descriptor 0 checksum is invalid. Fix<y>? >>>> >>>> Looks like there's a real problem in ext4 causing this under certain >>>> circumstances (unless an obscure hardware error is somehow giving the same >>>> problem). >>>> >>>> To cause this, all I did was rsync a set of directories to the disk. No hard >>>> link trees were created. >>> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new >>> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb. >>> On reboot, I got such errors. >>> The hd was partitionned (all ext4) as: >>> / (5Gb) | /usr (20Gb) | /pub (1.5Tb) >>> >>> The smaller system fses didn't saw those errors. >> Can you provide a little more info on how you copied the 20Gb, and >> exactly what the errors were? > > I just copied some files from an USB hard disc with cp on the big > partition (the one that showed the issues). > For other system partitions (that showed _no_ problems) were filled with > something like "rsync -rvltpx / /where/it/was/mounted" > > Here's the fsck log: > > >----------------------------------------------------------------------- Wow, awful. Could you send me dumpe2fs -h output of the large target device, as well as an "e2image -r" image of the source filesystem? That way I can hopefully perfectly replicate your target filesystem as well as the data you're using to populate it, try the cp myself, and see if I hit the same thing. e2image only sends metadata information, not data. If you are concerned about filenames, use -s to scramble them, though this *might* impact my ability to reproduce it... Thanks, -Eric ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-21 16:40 ` Eric Sandeen @ 2009-04-21 16:56 ` Thierry Vignaud 0 siblings, 0 replies; 31+ messages in thread From: Thierry Vignaud @ 2009-04-21 16:56 UTC (permalink / raw) To: Eric Sandeen; +Cc: Jeremy Sanders, linux-ext4 Eric Sandeen <sandeen@redhat.com> writes: > Could you send me dumpe2fs -h output of the large target device, as > well as an "e2image -r" image of the source filesystem? That way I > can hopefully perfectly replicate your target filesystem as well as > the data you're using to populate it, try the cp myself, and see if I > hit the same thing. > > e2image only sends metadata information, not data. If you are > concerned about filenames, use -s to scramble them, though this > *might* impact my ability to reproduce it... I'll do (disk's at home). Filesystems were formatted with standard mkfs.ext4 (some were formated with mkfs.ext4 -F which is why diskdrake default to), that us using std /etc/mke2fs.conf. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: fsck.ext4: Group descriptors look bad... trying backup blocks... 2009-04-21 15:14 ` Thierry Vignaud 2009-04-21 15:52 ` Eric Sandeen @ 2009-04-21 16:43 ` Theodore Tso 1 sibling, 0 replies; 31+ messages in thread From: Theodore Tso @ 2009-04-21 16:43 UTC (permalink / raw) To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4 On Tue, Apr 21, 2009 at 05:14:40PM +0200, Thierry Vignaud wrote: > For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new > 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb. > On reboot, I got such errors. > The hd was partitionned (all ext4) as: > / (5Gb) | /usr (20Gb) | /pub (1.5Tb) Theirry, are you willing to try to see if you can get a reliable reproduction case? That's what we need, very badly. The fact that you only copied 20GB is very good; better than than 2TB. If you can reliably reproduce the failure 2 or 3 times, can you give us exact reproduction instructions? That would be extremely useful. Thanks in advance, - Ted ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2009-04-24 9:13 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders
2009-04-17 11:26 ` Jeremy Sanders
2009-04-17 11:56 ` Theodore Tso
2009-04-17 12:16 ` Jeremy Sanders
2009-04-17 17:10 ` Eric Sandeen
2009-04-17 18:51 ` Jeremy Sanders
2009-04-17 12:24 ` Jeremy Sanders
2009-04-17 16:36 ` Theodore Tso
2009-04-17 17:00 ` Eric Sandeen
2009-04-20 9:33 ` Jeremy Sanders
2009-04-20 11:35 ` Theodore Tso
2009-04-20 11:43 ` Jeremy Sanders
2009-04-20 12:48 ` Theodore Tso
2009-04-20 12:54 ` Jeremy Sanders
2009-04-20 14:49 ` Eric Sandeen
2009-04-20 15:51 ` Eric Sandeen
2009-04-20 15:53 ` Jeremy Sanders
2009-04-20 16:26 ` Eric Sandeen
2009-04-20 16:40 ` Jeremy Sanders
2009-04-20 18:28 ` Andreas Dilger
2009-04-20 18:55 ` Jeremy Sanders
2009-04-20 20:45 ` Andreas Dilger
2009-04-22 9:34 ` Jeremy Sanders
2009-04-22 9:07 ` Jeremy Sanders
2009-04-22 9:59 ` Thierry Vignaud
2009-04-24 8:27 ` Jeremy Sanders
2009-04-21 15:14 ` Thierry Vignaud
2009-04-21 15:52 ` Eric Sandeen
[not found] ` <m23ac255pp.fsf@vador.mandriva.com>
2009-04-21 16:40 ` Eric Sandeen
2009-04-21 16:56 ` Thierry Vignaud
2009-04-21 16:43 ` Theodore Tso
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).