fsck.ext4: Group descriptors look bad... trying backup blocks...

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* fsck.ext4: Group descriptors look bad... trying backup blocks...
@ 2009-04-17 11:03 Jeremy Sanders
  2009-04-17 11:26 ` Jeremy Sanders
                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 11:03 UTC (permalink / raw)
  To: linux-ext4

Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On 
rebooting (cleanly unmounting), I tried an fsck on the device. I get the 
following:

[root@xback2 ~]# fsck /dev/md0
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid.  Fix<y>?

It then finds lots of bad group descriptors.

This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and 
e2fsprogs-1.41.4-4.fc10.x86_64.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders
@ 2009-04-17 11:26 ` Jeremy Sanders
  2009-04-17 11:56 ` Theodore Tso
  2009-04-17 17:00 ` Eric Sandeen
  2 siblings, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 11:26 UTC (permalink / raw)
  To: linux-ext4

Some more information about the device:

root@xback2 ~]# dumpe2fs /dev/md0| head -100                                                              
dumpe2fs 1.41.4 (27-Jan-2009)                                                                              
Filesystem volume name:   <none>                                                                           
Last mounted on:          <not available>                                                                  
Filesystem UUID:          508aee62-79af-4b4c-95a6-222b3868834c                                             
Filesystem magic number:  0xEF53                                                                           
Filesystem revision #:    1 (dynamic)                                                                      
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent flex_bg sparse_super large_file huge_file uninit_bg 
dir_nlink extra_isize                                                         
Filesystem flags:         signed_directory_hash                                                             
Default mount options:    (none)                                                                            
Filesystem state:         clean                                                                             
Errors behavior:          Continue                                                                          
Filesystem OS type:       Linux                                                                             
Inode count:              549314560                                                                         
Block count:              2197239840                                                                        
Reserved block count:     0                                                                                 
Free blocks:              1508301443                                                                        
Free inodes:              545311753                                                                         
First block:              0                                                                                 
Block size:               4096                                                                              
Fragment size:            4096                                                                              
Reserved GDT blocks:      500                                                                               
Blocks per group:         32768                                                                             
Fragments per group:      32768                                                                             
Inodes per group:         8192                                                                              
Inode blocks per group:   512                                                                               
RAID stride:              8                                                                                 
RAID stripe width:        72                                                                                
Flex block group size:    16                                                                                
Filesystem created:       Fri Apr 10 17:13:08 2009                                                          
Last mount time:          Mon Apr 13 11:00:22 2009                                                          
Last write time:          Fri Apr 17 11:53:05 2009                                                          
Mount count:              1                                                                                 
Maximum mount count:      -1                                                                                
Last checked:             Fri Apr 10 17:13:08 2009                                                          
Check interval:           0 (<none>)                                                                        
Reserved blocks uid:      0 (user root)                                                                     
Reserved blocks gid:      0 (group root)                                                                    
First inode:              11                                                                                
Inode size:               256                                                                               
Required extra isize:     28                                                                                
Desired extra isize:      28                                                                                
Journal inode:            8                                                                                 
Default directory hash:   half_md4                                                                          
Directory Hash Seed:      9c9b9fd6-5af2-4ee0-bceb-25827cb008f9                                              
Journal backup:           inode blocks                                                                      
Journal size:             128M                                                                              


Group 0: (Blocks 0-32767) [ITABLE_ZEROED]
  Checksum 0xd3b2, unused inodes 4032    
  Primary superblock at 0, Group descriptors at 1-524
  Reserved GDT blocks at 525-1024                    
  Block bitmap at 1025 (+1025), Inode bitmap at 1041 (+1041)
  Inode table at 1057-1568 (+1057)                          
  856 free blocks, 4032 free inodes, 268 directories, 4032 unused inodes
Group 1: (Blocks 32768-65535) [INODE_UNINIT, ITABLE_ZEROED]             
  Checksum 0xc586, unused inodes 8192
  Backup superblock at 32768, Group descriptors at 32769-33292
  Reserved GDT blocks at 33293-33792
  Block bitmap at 1026 (+4294935554), Inode bitmap at 1042 (+4294935570)
  Inode table at 1569-2080 (+4294936097)
  939 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
Group 2: (Blocks 65536-98303) [INODE_UNINIT, ITABLE_ZEROED]
...




^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders
  2009-04-17 11:26 ` Jeremy Sanders
@ 2009-04-17 11:56 ` Theodore Tso
  2009-04-17 12:16   ` Jeremy Sanders
  2009-04-17 12:24   ` Jeremy Sanders
  2009-04-17 17:00 ` Eric Sandeen
  2 siblings, 2 replies; 31+ messages in thread
From: Theodore Tso @ 2009-04-17 11:56 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

On Fri, Apr 17, 2009 at 12:03:33PM +0100, Jeremy Sanders wrote:
> Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On 
> rebooting (cleanly unmounting), I tried an fsck on the device. I get the 
> following:
> 
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid.  Fix<y>?
> 
> It then finds lots of bad group descriptors.

What happened afterwards?   Did fsck complete successfully?

I see from the dumpe2fs that you sent it had only been in use for a
week.  How were you using the filesystem?  Did you try using the
online resize feature at any time?

The problem is that any number of things could have caused the block
group descriptors to be corrupted.

						- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 11:56 ` Theodore Tso
@ 2009-04-17 12:16   ` Jeremy Sanders
  2009-04-17 17:10     ` Eric Sandeen
  2009-04-17 12:24   ` Jeremy Sanders
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 12:16 UTC (permalink / raw)
  To: linux-ext4

Theodore Tso wrote:

> What happened afterwards?   Did fsck complete successfully?

I was waiting to see whether you wanted me to do something else.  I've just 
tried it and it didn't:

[root@xback2 ~]# fsck -a /dev/md0
fsck 1.41.4 (27-Jan-2009)
/dev/md0: Group descriptor 384 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 385 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 386 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 387 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 388 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 389 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 390 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 391 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 392 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 393 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 394 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 395 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 396 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 397 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 398 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 399 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 400 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 401 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 402 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 403 checksum is invalid.  FIXED.
/dev/md0: Group descriptor 404 checksum is invalid.  FIXED.
/dev/md0: Note: if several inode or block bitmap blocks or part
of the inode table require relocation, you may wish to try
running e2fsck with the '-b 32768' option first.  The problem
may lie only with the primary block group descriptors, and
the backup block group descriptors may be OK.

/dev/md0: Block bitmap for group 405 is not in group.  (block 3393946179)

/dev/md0: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
        (i.e., without -a or -p options)

** When I run it manually I get:

Pass 1: Checking inodes, blocks, and sizes
Inode 8355 has imagic flag set.  Clear<y>? yes

Inode 8355 has a extra size (62017) which is invalid
Fix<y>? yes

Inode 8355 has compression flag set on filesystem without compression 
support.  Clear<y>? yes

Inode 8355 has a bad extended attribute block 2170352193.  Clear<y>? yes

Inode 8355 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes

Inode 8355, i_size is 9321591691907232321, should be 0.  Fix<y>? yes

Inode 8355, i_blocks is 266363157148225, should be 0.  Fix<y>? yes

Inode 8356 is in use, but has dtime set.  Fix<y>? yes

Inode 8356 has imagic flag set.  Clear<y>? yes

Inode 8356 has a extra size (62017) which is invalid
Fix<y>? yes

Inode 8356 has compression flag set on filesystem without compression 
support.  Clear<y>? yes

Inode 8356 has a bad extended attribute block 2170352193.  Clear<y>? yes

Inode 8356 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes

Inode 8356, i_size is 9321591691907232321, should be 0.  Fix<y>? yes

Inode 8356, i_blocks is 266363157148225, should be 0.  Fix<y>? yes

Inode 8357 is in use, but has dtime set.  Fix<y>? yes

Inode 8357 has imagic flag set.  Clear<y>? yes

Inode 8357 has a extra size (62017) which is invalid
Fix<y>? yes

Inode 8357 has compression flag set on filesystem without compression 
support.  Clear<y>? yes

Inode 8357 has a bad extended attribute block 2170352193.  Clear<y>? yes

Inode 8357 has INDEX_FL flag set but is not a directory.
Clear HTree index<y>? yes

> I see from the dumpe2fs that you sent it had only been in use for a
> week.  How were you using the filesystem?  Did you try using the
> online resize feature at any time?

No. The filesystem was used to store rsync snapshots of other file systems 
(using the hard link feature). I had only rsynced the initial data and run a 
couple of rsync backups on to it. The filesystem was created using:

mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72  /dev/md0

> The problem is that any number of things could have caused the block
> group descriptors to be corrupted.

Oh dear. The system has ECC ram (though linux doesn't know about it, so it 
may not be working) and the md device is using 10 drives on raid5 and a 
3ware controller.

Maybe I should force a md raid5 resync to check the drives agree with each 
other.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 12:16   ` Jeremy Sanders
@ 2009-04-17 17:10     ` Eric Sandeen
  2009-04-17 18:51       ` Jeremy Sanders
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-17 17:10 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

Jeremy Sanders wrote:
> Theodore Tso wrote:

...

>> I see from the dumpe2fs that you sent it had only been in use for a
>> week.  How were you using the filesystem?  Did you try using the
>> online resize feature at any time?
> 
> No. The filesystem was used to store rsync snapshots of other file systems 
> (using the hard link feature). I had only rsynced the initial data and run a 
> couple of rsync backups on to it. The filesystem was created using:
> 
> mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72  /dev/md0

Can you show us exactly how you're using rsync?  is this with
rdiff-backup or some similar tool?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 17:10     ` Eric Sandeen
@ 2009-04-17 18:51       ` Jeremy Sanders
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 18:51 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4

On Fri, 17 Apr 2009, Eric Sandeen wrote:

> Jeremy Sanders wrote:
>> Theodore Tso wrote:
>
> ...
>
>>> I see from the dumpe2fs that you sent it had only been in use for a
>>> week.  How were you using the filesystem?  Did you try using the
>>> online resize feature at any time?
>>
>> No. The filesystem was used to store rsync snapshots of other file systems
>> (using the hard link feature). I had only rsynced the initial data and run a
>> couple of rsync backups on to it. The filesystem was created using:
>>
>> mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72  /dev/md0
>
> Can you show us exactly how you're using rsync?  is this with
> rdiff-backup or some similar tool?

No, plain rsync. We have a script which does something like

rsync -raHSx --stats --while-file --numeric-ids  \
--link-dest=/mnt/username/20090418/ host:/data/username/ \ 
/mnt/username/20090419/

for a set of users.

This command copies the files from /data/username on host to 
/mnt/username/20090419, but creates hard links to the previous copy 
(/mnt/username/20090418/) for unchanged files.

It worked fine on ext3, at least for a 2.4TB device.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 11:56 ` Theodore Tso
  2009-04-17 12:16   ` Jeremy Sanders
@ 2009-04-17 12:24   ` Jeremy Sanders
  2009-04-17 16:36     ` Theodore Tso
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-17 12:24 UTC (permalink / raw)
  To: linux-ext4

Theodore Tso wrote:

> I see from the dumpe2fs that you sent it had only been in use for a
> week.  How were you using the filesystem?  Did you try using the
> online resize feature at any time?

I assume that this isn't enough to corrupt the filesystem?

[root@xback2 ~]# tune2fs -i -1 /dev/md0
tune2fs 1.41.4 (27-Jan-2009)
Setting interval between checks to 18446744073709465216 seconds

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 12:24   ` Jeremy Sanders
@ 2009-04-17 16:36     ` Theodore Tso
  0 siblings, 0 replies; 31+ messages in thread
From: Theodore Tso @ 2009-04-17 16:36 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

On Fri, Apr 17, 2009 at 01:24:16PM +0100, Jeremy Sanders wrote:
> Theodore Tso wrote:
> 
> > I see from the dumpe2fs that you sent it had only been in use for a
> > week.  How were you using the filesystem?  Did you try using the
> > online resize feature at any time?
> 
> I assume that this isn't enough to corrupt the filesystem?
> 
> [root@xback2 ~]# tune2fs -i -1 /dev/md0
> tune2fs 1.41.4 (27-Jan-2009)
> Setting interval between checks to 18446744073709465216 seconds

No, but it won't do what you want, either.  To disable time-based
checks, you should use "tune2fs -i 0 /dev/md0".

Tune2fs should have flagged an error when you specified -1; I'll have
to fix that.

	    	       		     - Ted



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders
  2009-04-17 11:26 ` Jeremy Sanders
  2009-04-17 11:56 ` Theodore Tso
@ 2009-04-17 17:00 ` Eric Sandeen
  2009-04-20  9:33   ` Jeremy Sanders
  2 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-17 17:00 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

Jeremy Sanders wrote:
> Hi - I'm trying out ext4 on a large 8.2 TB software raid device (md). On 
> rebooting (cleanly unmounting), I tried an fsck on the device. I get the 
> following:
> 
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid.  Fix<y>?
> 
> It then finds lots of bad group descriptors.
> 
> This is Fedora 10, 2.6.27.21-170.2.56.fc10.x86_64 and 
> e2fsprogs-1.41.4-4.fc10.x86_64.
> 

Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's
in F10 updates-testing?  That way the ext4 code is a bit more of a
recent, common codebase.  Also, if this is a test fs, re-mkfs'ing from
scratch might not be a bad way to go.

Depending on how hard it is to reproduce, it may also be interesting to
try a filesystem just shy of 8TB (2^31) blocks in case there is some
32-bit wrap-around there, since you're at 8.2T....

-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-17 17:00 ` Eric Sandeen
@ 2009-04-20  9:33   ` Jeremy Sanders
  2009-04-20 11:35     ` Theodore Tso
  2009-04-21 15:14     ` Thierry Vignaud
  0 siblings, 2 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20  9:33 UTC (permalink / raw)
  To: linux-ext4

Eric Sandeen wrote:

> Jeremy, if you're willing, could you upgrade to the 2.6.29 kernel that's
> in F10 updates-testing?  That way the ext4 code is a bit more of a
> recent, common codebase.  Also, if this is a test fs, re-mkfs'ing from
> scratch might not be a bad way to go.
> 
> Depending on how hard it is to reproduce, it may also be interesting to
> try a filesystem just shy of 8TB (2^31) blocks in case there is some
> 32-bit wrap-around there, since you're at 8.2T....

I wasn't able to trivially reproduce the problem with the old kernel, but I 
updated to 2.6.29.1-30.fc10.x86_64 in updates testing. This introduced some 
further problems with a USB issue and some sort of stack dump probably 
associated with the r8169 driver (see bugzilla).

However, the system seems to mostly work, so I recreated the ext4 device, 
I've just run my backup script again and fsck'd the device. It seems the 
problem is reproducible with the new kernel:

[root@xback2 ~]# fsck /dev/md0
fsck 1.41.4 (27-Jan-2009)
e2fsck 1.41.4 (27-Jan-2009)
fsck.ext4: Group descriptors look bad... trying backup blocks...
Group descriptor 0 checksum is invalid.  Fix<y>?

Looks like there's a real problem in ext4 causing this under certain 
circumstances (unless an obscure hardware error is somehow giving the same 
problem).

To cause this, all I did was rsync a set of directories to the disk. No hard 
link trees were created.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20  9:33   ` Jeremy Sanders
@ 2009-04-20 11:35     ` Theodore Tso
  2009-04-20 11:43       ` Jeremy Sanders
  2009-04-24  8:27       ` Jeremy Sanders
  2009-04-21 15:14     ` Thierry Vignaud
  1 sibling, 2 replies; 31+ messages in thread
From: Theodore Tso @ 2009-04-20 11:35 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote:
> 
> However, the system seems to mostly work, so I recreated the ext4 device, 
> I've just run my backup script again and fsck'd the device. It seems the 
> problem is reproducible with the new kernel:

When you say reproducible, how many times have you tried it, and were
you able to reproduce it every single time?  50% of time?  I do
believe there is a problem, but we haven't been able to something
where it's easily reproducible.  So if you can easily reproduce this,
this is definitely very exciting.

> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid.  Fix<y>?

Do you have to reboot to see this, or is it enough to unmount the
filesystem?  How big is the ext4 filesystem, and how big was the
amount of data that you rsync'ed?  One thing that would be worth
trying if you can easily reproduce is whether it happens on a single
device disk, or whether it only shows up when you use a /dev/mdX
device.

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 11:35     ` Theodore Tso
@ 2009-04-20 11:43       ` Jeremy Sanders
  2009-04-20 12:48         ` Theodore Tso
  2009-04-24  8:27       ` Jeremy Sanders
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20 11:43 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Mon, 20 Apr 2009, Theodore Tso wrote:

> On Mon, Apr 20, 2009 at 10:33:09AM +0100, Jeremy Sanders wrote:
>>
>> However, the system seems to mostly work, so I recreated the ext4 device,
>> I've just run my backup script again and fsck'd the device. It seems the
>> problem is reproducible with the new kernel:
>
> When you say reproducible, how many times have you tried it, and were
> you able to reproduce it every single time?  50% of time?  I do
> believe there is a problem, but we haven't been able to something
> where it's easily reproducible.  So if you can easily reproduce this,
> this is definitely very exciting.

It takes a day or two to do the sync. I've only done it twice (one with 
the old kernel, once with the new fedora testing kernel) and it happened 
both times. I'm afraid the statistics are rather low number here.

I did a different faster test (just copying my home directory lots of 
times), but I wasn't able to get it to fail. That test didn't use much 
disk space, however. Maybe it's worth just dd'ing a few TB of data onto 
the device and seeing whether that fails.

>> [root@xback2 ~]# fsck /dev/md0
>> fsck 1.41.4 (27-Jan-2009)
>> e2fsck 1.41.4 (27-Jan-2009)
>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>> Group descriptor 0 checksum is invalid.  Fix<y>?
>
> Do you have to reboot to see this, or is it enough to unmount the
> filesystem?  How big is the ext4 filesystem, and how big was the
> amount of data that you rsync'ed?  One thing that would be worth
> trying if you can easily reproduce is whether it happens on a single
> device disk, or whether it only shows up when you use a /dev/mdX
> device.

I didn't reboot this time - I did last time. I just unmounted the file 
system and fsckd it. The filesystem is 8.2TB and the data is around 2.5TB.

The drives on a 3ware card, so I could configure the card as a single 
raid5 device and try to reproduce it there. It may take a day or two to 
copy the data if I try this.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 11:43       ` Jeremy Sanders
@ 2009-04-20 12:48         ` Theodore Tso
  2009-04-20 12:54           ` Jeremy Sanders
  2009-04-20 14:49           ` Eric Sandeen
  0 siblings, 2 replies; 31+ messages in thread
From: Theodore Tso @ 2009-04-20 12:48 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
> It takes a day or two to do the sync. I've only done it twice (one with  
> the old kernel, once with the new fedora testing kernel) and it happened  
> both times. I'm afraid the statistics are rather low number here.
>
> I did a different faster test (just copying my home directory lots of  
> times), but I wasn't able to get it to fail. That test didn't use much  
> disk space, however. Maybe it's worth just dd'ing a few TB of data onto  
> the device and seeing whether that fails.
>
> I didn't reboot this time - I did last time. I just unmounted the file  
> system and fsckd it. The filesystem is 8.2TB and the data is around 
> 2.5TB.

That's that's useful data.  I wish we could make it fail more quickly
on a smaller rsync, but the fact that you didn't need to reboot is
definitely useful information.

And this is a fresh rsync so no files were being deleted, rsync should
have just been writing new files to .filename.XXXXX and then renaming
the filename to filename.XXXXX when it is done, right? 

OK, let me think about this a little.  I think we can create a patch
which checks for writes to the block group descriptors and dumps a
stack trace.  That would allow us catch the failing code in question
in the act, and maybe figure out what is going on.

					- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 12:48         ` Theodore Tso
@ 2009-04-20 12:54           ` Jeremy Sanders
  2009-04-20 14:49           ` Eric Sandeen
  1 sibling, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20 12:54 UTC (permalink / raw)
  To: Theodore Tso; +Cc: linux-ext4

On Mon, 20 Apr 2009, Theodore Tso wrote:

> That's that's useful data.  I wish we could make it fail more quickly
> on a smaller rsync, but the fact that you didn't need to reboot is
> definitely useful information.
>
> And this is a fresh rsync so no files were being deleted, rsync should
> have just been writing new files to .filename.XXXXX and then renaming
> the filename to filename.XXXXX when it is done, right?

That's what I'd guess. It was onto a clean filesystem, so there shouldn't 
be any deletions.

> OK, let me think about this a little.  I think we can create a patch
> which checks for writes to the block group descriptors and dumps a
> stack trace.  That would allow us catch the failing code in question
> in the act, and maybe figure out what is going on.

Ok.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 12:48         ` Theodore Tso
  2009-04-20 12:54           ` Jeremy Sanders
@ 2009-04-20 14:49           ` Eric Sandeen
  2009-04-20 15:51             ` Eric Sandeen
  2009-04-22  9:07             ` Jeremy Sanders
  1 sibling, 2 replies; 31+ messages in thread
From: Eric Sandeen @ 2009-04-20 14:49 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jeremy Sanders, linux-ext4

Theodore Tso wrote:
> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>> It takes a day or two to do the sync. I've only done it twice (one with  
>> the old kernel, once with the new fedora testing kernel) and it happened  
>> both times. I'm afraid the statistics are rather low number here.
>>
>> I did a different faster test (just copying my home directory lots of  
>> times), but I wasn't able to get it to fail. That test didn't use much  
>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto  
>> the device and seeing whether that fails.
>>
>> I didn't reboot this time - I did last time. I just unmounted the file  
>> system and fsckd it. The filesystem is 8.2TB and the data is around 
>> 2.5TB.

I think trying a filesystem with just under 8T would be a useful test too.

> That's that's useful data.  I wish we could make it fail more quickly
> on a smaller rsync, but the fact that you didn't need to reboot is
> definitely useful information.
> 
> And this is a fresh rsync so no files were being deleted, rsync should
> have just been writing new files to .filename.XXXXX and then renaming
> the filename to filename.XXXXX when it is done, right? 
> 
> OK, let me think about this a little.  I think we can create a patch
> which checks for writes to the block group descriptors and dumps a
> stack trace.  That would allow us catch the failing code in question
> in the act, and maybe figure out what is going on.

XFS has block-zero tests, because there was once a bug where
uninitialized block numbers in buffers were clobbering the superblock at
block 0.  It was helpful, so I think this is a good idea, Ted.

-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 14:49           ` Eric Sandeen
@ 2009-04-20 15:51             ` Eric Sandeen
  2009-04-20 15:53               ` Jeremy Sanders
  2009-04-22  9:07             ` Jeremy Sanders
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-20 15:51 UTC (permalink / raw)
  To: Theodore Tso; +Cc: Jeremy Sanders, linux-ext4

Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>> It takes a day or two to do the sync. I've only done it twice (one with  
>>> the old kernel, once with the new fedora testing kernel) and it happened  
>>> both times. I'm afraid the statistics are rather low number here.
>>>
>>> I did a different faster test (just copying my home directory lots of  
>>> times), but I wasn't able to get it to fail. That test didn't use much  
>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto  
>>> the device and seeing whether that fails.
>>>
>>> I didn't reboot this time - I did last time. I just unmounted the file  
>>> system and fsckd it. The filesystem is 8.2TB and the data is around 
>>> 2.5TB.
> 
> I think trying a filesystem with just under 8T would be a useful test too.

One other question - do you make use of xattrs on this filesystem?

In case it's not obvious we are very interested in this reproducible
testcase, thank you for being so willing to provide feedback and testing
....

-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 15:51             ` Eric Sandeen
@ 2009-04-20 15:53               ` Jeremy Sanders
  2009-04-20 16:26                 ` Eric Sandeen
  2009-04-20 18:28                 ` Andreas Dilger
  0 siblings, 2 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20 15:53 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4

On Mon, 20 Apr 2009, Eric Sandeen wrote:

> Eric Sandeen wrote:
>> Theodore Tso wrote:
>>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>>> It takes a day or two to do the sync. I've only done it twice (one with
>>>> the old kernel, once with the new fedora testing kernel) and it happened
>>>> both times. I'm afraid the statistics are rather low number here.
>>>>
>>>> I did a different faster test (just copying my home directory lots of
>>>> times), but I wasn't able to get it to fail. That test didn't use much
>>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>>>> the device and seeing whether that fails.
>>>>
>>>> I didn't reboot this time - I did last time. I just unmounted the file
>>>> system and fsckd it. The filesystem is 8.2TB and the data is around
>>>> 2.5TB.
>>
>> I think trying a filesystem with just under 8T would be a useful test too.
>
> One other question - do you make use of xattrs on this filesystem?

No.

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 15:53               ` Jeremy Sanders
@ 2009-04-20 16:26                 ` Eric Sandeen
  2009-04-20 16:40                   ` Jeremy Sanders
  2009-04-20 18:28                 ` Andreas Dilger
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-20 16:26 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: Theodore Tso, linux-ext4

Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Eric Sandeen wrote:
...

>> One other question - do you make use of xattrs on this filesystem?
> 
> No.

I've commandeered about 10T of disk space to see if I can hit this.
Would you mind providing dumpe2fs -h output for your 8.2T filesystem so
I can exactly replicate the geometry?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 16:26                 ` Eric Sandeen
@ 2009-04-20 16:40                   ` Jeremy Sanders
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20 16:40 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Theodore Tso, linux-ext4

On Mon, 20 Apr 2009, Eric Sandeen wrote:

> Jeremy Sanders wrote:
>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
> ...
>
>>> One other question - do you make use of xattrs on this filesystem?
>>
>> No.
>
> I've commandeered about 10T of disk space to see if I can hit this.
> Would you mind providing dumpe2fs -h output for your 8.2T filesystem so
> I can exactly replicate the geometry?

I formatted with

mkfs.ext4 -m0 -b 4096 -E stride=8,stripe-width=72  /dev/md0

[root@xback2 ~]#  dumpe2fs -h /dev/md0
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID:          34fefacb-0494-4df7-b189-e11b2064dd90
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              549314560
Block count:              2197239840
Reserved block count:     0
Free blocks:              2162717221
Free inodes:              549314549
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      500
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              8
RAID stripe width:        72
Flex block group size:    16
Filesystem created:       Mon Apr 20 17:29:14 2009
Last mount time:          n/a
Last write time:          Mon Apr 20 17:38:28 2009
Mount count:              0
Maximum mount count:      38
Last checked:             Mon Apr 20 17:29:14 2009
Check interval:           15552000 (6 months)
Next check after:         Sat Oct 17 17:29:14 2009
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      06d43af3-a75c-405a-8f25-e51517dae7f6
Journal backup:           inode blocks
Journal size:             128M


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 15:53               ` Jeremy Sanders
  2009-04-20 16:26                 ` Eric Sandeen
@ 2009-04-20 18:28                 ` Andreas Dilger
  2009-04-20 18:55                   ` Jeremy Sanders
  1 sibling, 1 reply; 31+ messages in thread
From: Andreas Dilger @ 2009-04-20 18:28 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: Eric Sandeen, Theodore Tso, linux-ext4

On Apr 20, 2009  16:53 +0100, Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Theodore Tso wrote:
>>>> On Mon, Apr 20, 2009 at 12:43:37PM +0100, Jeremy Sanders wrote:
>>>>> It takes a day or two to do the sync. I've only done it twice (one with
>>>>> the old kernel, once with the new fedora testing kernel) and it happened
>>>>> both times. I'm afraid the statistics are rather low number here.
>>>>>
>>>>> I did a different faster test (just copying my home directory lots of
>>>>> times), but I wasn't able to get it to fail. That test didn't use much
>>>>> disk space, however. Maybe it's worth just dd'ing a few TB of data onto
>>>>> the device and seeing whether that fails.
>>>>>
>>>>> I didn't reboot this time - I did last time. I just unmounted the file
>>>>> system and fsckd it. The filesystem is 8.2TB and the data is around
>>>>> 2.5TB.
>>>
>>> I think trying a filesystem with just under 8T would be a useful test too.
>>
>> One other question - do you make use of xattrs on this filesystem?
>
> No.

If you use anything like SELinux or ACLs you would also (indirectly) be
using xattrs.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 18:28                 ` Andreas Dilger
@ 2009-04-20 18:55                   ` Jeremy Sanders
  2009-04-20 20:45                     ` Andreas Dilger
  0 siblings, 1 reply; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-20 18:55 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Eric Sandeen, Theodore Tso, linux-ext4

On Mon, 20 Apr 2009, Andreas Dilger wrote:

> On Apr 20, 2009  16:53 +0100, Jeremy Sanders wrote:
>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>>> One other question - do you make use of xattrs on this filesystem?
>>
>> No.
>
> If you use anything like SELinux or ACLs you would also (indirectly) be
> using xattrs.

SELinux is switched off and we haven't (knowingly) been using xattrs, but 
I remember rsync might copy copy xattrs, so perhaps they get written in 
some way...

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 18:55                   ` Jeremy Sanders
@ 2009-04-20 20:45                     ` Andreas Dilger
  2009-04-22  9:34                       ` Jeremy Sanders
  0 siblings, 1 reply; 31+ messages in thread
From: Andreas Dilger @ 2009-04-20 20:45 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: Eric Sandeen, Theodore Tso, linux-ext4

On Apr 20, 2009  19:55 +0100, Jeremy Sanders wrote:
> On Mon, 20 Apr 2009, Andreas Dilger wrote:
>> On Apr 20, 2009  16:53 +0100, Jeremy Sanders wrote:
>>> On Mon, 20 Apr 2009, Eric Sandeen wrote:
>>>> One other question - do you make use of xattrs on this filesystem?
>>>
>>> No.
>>
>> If you use anything like SELinux or ACLs you would also (indirectly) be
>> using xattrs.
>
> SELinux is switched off and we haven't (knowingly) been using xattrs, but 
> I remember rsync might copy copy xattrs, so perhaps they get written in  
> some way...

You can check this with:

	debugfs -c -R "stat {path to file inside filesystem}" /dev/XXX

and check if the "File ACL" field is non-zero:

debugfs -c -R "stat etc/hosts" /dev/sda2
debugfs 1.40.11.sun1 (17-June-2008)
/dev/sda2: catastrophic mode - not reading inode or group bitmaps
Inode: 259128   Type: regular    Mode:  0644   Flags: 0x0   Generation:
2075236634
User:     0   Group:     0   Size: 2258
File ACL: 0    Directory ACL: 0
^^^^^^^^^^^ ##### this would be non-zero #####

Links: 2   Blockcount: 8
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009
atime: 0x49ebdce3 -- Sun Apr 19 20:24:35 2009
mtime: 0x49812ef4 -- Wed Jan 28 21:22:12 2009
Size of extra inode fields: 4
Inode version: 0
BLOCKS:
(0):534546
TOTAL: 1



Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 20:45                     ` Andreas Dilger
@ 2009-04-22  9:34                       ` Jeremy Sanders
  0 siblings, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-22  9:34 UTC (permalink / raw)
  To: linux-ext4

Andreas Dilger wrote:

> File ACL: 0    Directory ACL: 0
> ^^^^^^^^^^^ ##### this would be non-zero #####

These are zero on this device.

Jerey

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 14:49           ` Eric Sandeen
  2009-04-20 15:51             ` Eric Sandeen
@ 2009-04-22  9:07             ` Jeremy Sanders
  2009-04-22  9:59               ` Thierry Vignaud
  1 sibling, 1 reply; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-22  9:07 UTC (permalink / raw)
  To: linux-ext4

Eric Sandeen wrote:

> I think trying a filesystem with just under 8T would be a useful test too.

Okay, I tried partitioning the md device so that it was under 8T (7.1T in 
fact). Unfortunately I wasn't able to reproduce it in this configuration.

So, either it is a 8T+ problem, which disagrees with the other report, or 
the geometry has some sort of impact, or it is because the files I'm copying 
keep changing, so it may have gone away, or it is not 100% reproducible.

Shall I wait to see if you have a useful testcase from Thierry Vignaud 
before trying something else?

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-22  9:07             ` Jeremy Sanders
@ 2009-04-22  9:59               ` Thierry Vignaud
  0 siblings, 0 replies; 31+ messages in thread
From: Thierry Vignaud @ 2009-04-22  9:59 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

Jeremy Sanders <jss@ast.cam.ac.uk> writes:

> Shall I wait to see if you have a useful testcase from Thierry Vignaud
> before trying something else?

I wasn't able to reproduce it yet :-(

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20 11:35     ` Theodore Tso
  2009-04-20 11:43       ` Jeremy Sanders
@ 2009-04-24  8:27       ` Jeremy Sanders
  1 sibling, 0 replies; 31+ messages in thread
From: Jeremy Sanders @ 2009-04-24  8:27 UTC (permalink / raw)
  To: linux-ext4

Theodore Tso wrote:

> Do you have to reboot to see this, or is it enough to unmount the
> filesystem?  How big is the ext4 filesystem, and how big was the
> amount of data that you rsync'ed?  One thing that would be worth
> trying if you can easily reproduce is whether it happens on a single
> device disk, or whether it only shows up when you use a /dev/mdX
> device.

I've been able to reproduce it on a single device disk (It was partitioned 
to have the same number of blocks as the md device).

Jeremy

-- 
Jeremy Sanders <jss@ast.cam.ac.uk>   http://www-xray.ast.cam.ac.uk/~jss/
X-Ray Group, Institute of Astronomy, University of Cambridge, UK.
Public Key Server PGP Key ID: E1AAE053



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-20  9:33   ` Jeremy Sanders
  2009-04-20 11:35     ` Theodore Tso
@ 2009-04-21 15:14     ` Thierry Vignaud
  2009-04-21 15:52       ` Eric Sandeen
  2009-04-21 16:43       ` Theodore Tso
  1 sibling, 2 replies; 31+ messages in thread
From: Thierry Vignaud @ 2009-04-21 15:14 UTC (permalink / raw)
  To: Jeremy Sanders; +Cc: linux-ext4

Jeremy Sanders <jss@ast.cam.ac.uk> writes:

> However, the system seems to mostly work, so I recreated the ext4 device, 
> I've just run my backup script again and fsck'd the device. It seems the 
> problem is reproducible with the new kernel:
> 
> [root@xback2 ~]# fsck /dev/md0
> fsck 1.41.4 (27-Jan-2009)
> e2fsck 1.41.4 (27-Jan-2009)
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> Group descriptor 0 checksum is invalid.  Fix<y>?
> 
> Looks like there's a real problem in ext4 causing this under certain 
> circumstances (unless an obscure hardware error is somehow giving the same 
> problem).
> 
> To cause this, all I did was rsync a set of directories to the disk. No hard 
> link trees were created.

For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
On reboot, I got such errors.
The hd was partitionned (all ext4) as:
/ (5Gb)  |  /usr (20Gb)  |  /pub (1.5Tb)

The smaller system fses didn't saw those errors.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-21 15:14     ` Thierry Vignaud
@ 2009-04-21 15:52       ` Eric Sandeen
       [not found]         ` <m23ac255pp.fsf@vador.mandriva.com>
  2009-04-21 16:43       ` Theodore Tso
  1 sibling, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-21 15:52 UTC (permalink / raw)
  To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4

Thierry Vignaud wrote:
> Jeremy Sanders <jss@ast.cam.ac.uk> writes:
> 
>> However, the system seems to mostly work, so I recreated the ext4 device, 
>> I've just run my backup script again and fsck'd the device. It seems the 
>> problem is reproducible with the new kernel:
>>
>> [root@xback2 ~]# fsck /dev/md0
>> fsck 1.41.4 (27-Jan-2009)
>> e2fsck 1.41.4 (27-Jan-2009)
>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>> Group descriptor 0 checksum is invalid.  Fix<y>?
>>
>> Looks like there's a real problem in ext4 causing this under certain 
>> circumstances (unless an obscure hardware error is somehow giving the same 
>> problem).
>>
>> To cause this, all I did was rsync a set of directories to the disk. No hard 
>> link trees were created.
> 
> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
> On reboot, I got such errors.
> The hd was partitionned (all ext4) as:
> / (5Gb)  |  /usr (20Gb)  |  /pub (1.5Tb)
> 
> The smaller system fses didn't saw those errors.

Can you provide a little more info on how you copied the 20Gb, and
exactly what the errors were?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <m23ac255pp.fsf@vador.mandriva.com>]

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
       [not found]         ` <m23ac255pp.fsf@vador.mandriva.com>
@ 2009-04-21 16:40           ` Eric Sandeen
  2009-04-21 16:56             ` Thierry Vignaud
  0 siblings, 1 reply; 31+ messages in thread
From: Eric Sandeen @ 2009-04-21 16:40 UTC (permalink / raw)
  To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4

Thierry Vignaud wrote:
> Eric Sandeen <sandeen@redhat.com> writes:
> 
>>>> However, the system seems to mostly work, so I recreated the ext4 device, 
>>>> I've just run my backup script again and fsck'd the device. It seems the 
>>>> problem is reproducible with the new kernel:
>>>>
>>>> [root@xback2 ~]# fsck /dev/md0
>>>> fsck 1.41.4 (27-Jan-2009)
>>>> e2fsck 1.41.4 (27-Jan-2009)
>>>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>>>> Group descriptor 0 checksum is invalid.  Fix<y>?
>>>>
>>>> Looks like there's a real problem in ext4 causing this under certain 
>>>> circumstances (unless an obscure hardware error is somehow giving the same 
>>>> problem).
>>>>
>>>> To cause this, all I did was rsync a set of directories to the disk. No hard 
>>>> link trees were created.
>>> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
>>> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
>>> On reboot, I got such errors.
>>> The hd was partitionned (all ext4) as:
>>> / (5Gb)  |  /usr (20Gb)  |  /pub (1.5Tb)
>>>
>>> The smaller system fses didn't saw those errors.
>> Can you provide a little more info on how you copied the 20Gb, and
>> exactly what the errors were?
> 
> I just copied some files from an USB hard disc with cp on the big
> partition (the one that showed the issues).
> For other system partitions (that showed _no_ problems) were filled with
> something like "rsync -rvltpx / /where/it/was/mounted"
> 
> Here's the fsck log:
> 
> 
>-----------------------------------------------------------------------

Wow, awful.

Could you send me dumpe2fs -h output of the large target device, as well
as an "e2image -r" image of the source filesystem?  That way I can
hopefully perfectly replicate your target filesystem as well as the data
you're using to populate it, try the cp myself, and see if I hit the
same thing.

e2image only sends metadata information, not data.  If you are concerned
about filenames, use -s to scramble them, though this *might* impact my
ability to reproduce it...

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-21 16:40           ` Eric Sandeen
@ 2009-04-21 16:56             ` Thierry Vignaud
  0 siblings, 0 replies; 31+ messages in thread
From: Thierry Vignaud @ 2009-04-21 16:56 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Jeremy Sanders, linux-ext4

Eric Sandeen <sandeen@redhat.com> writes:

> Could you send me dumpe2fs -h output of the large target device, as
> well as an "e2image -r" image of the source filesystem?  That way I
> can hopefully perfectly replicate your target filesystem as well as
> the data you're using to populate it, try the cp myself, and see if I
> hit the same thing.
> 
> e2image only sends metadata information, not data.  If you are
> concerned about filenames, use -s to scramble them, though this
> *might* impact my ability to reproduce it...

I'll do (disk's at home).

Filesystems were formatted with standard mkfs.ext4 (some were formated
with mkfs.ext4 -F which is why diskdrake default to), that us using std
/etc/mke2fs.conf.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: fsck.ext4: Group descriptors look bad... trying backup blocks...
  2009-04-21 15:14     ` Thierry Vignaud
  2009-04-21 15:52       ` Eric Sandeen
@ 2009-04-21 16:43       ` Theodore Tso
  1 sibling, 0 replies; 31+ messages in thread
From: Theodore Tso @ 2009-04-21 16:43 UTC (permalink / raw)
  To: Thierry Vignaud; +Cc: Jeremy Sanders, linux-ext4

On Tue, Apr 21, 2009 at 05:14:40PM +0200, Thierry Vignaud wrote:
> For the record, I reproduced this bug with 2.6.30-rc2-git6 on a new
> 1.5Tb disk. Formated as ext4, using relatime, copied 20Gb.
> On reboot, I got such errors.
> The hd was partitionned (all ext4) as:
> / (5Gb)  |  /usr (20Gb)  |  /pub (1.5Tb)

Theirry, are you willing to try to see if you can get a reliable
reproduction case?  That's what we need, very badly.  The fact that
you only copied 20GB is very good; better than than 2TB.  If you can
reliably reproduce the failure 2 or 3 times, can you give us exact
reproduction instructions?  That would be extremely useful.

Thanks in advance,

						- Ted

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2009-04-24  9:13 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-17 11:03 fsck.ext4: Group descriptors look bad... trying backup blocks Jeremy Sanders
2009-04-17 11:26 ` Jeremy Sanders
2009-04-17 11:56 ` Theodore Tso
2009-04-17 12:16   ` Jeremy Sanders
2009-04-17 17:10     ` Eric Sandeen
2009-04-17 18:51       ` Jeremy Sanders
2009-04-17 12:24   ` Jeremy Sanders
2009-04-17 16:36     ` Theodore Tso
2009-04-17 17:00 ` Eric Sandeen
2009-04-20  9:33   ` Jeremy Sanders
2009-04-20 11:35     ` Theodore Tso
2009-04-20 11:43       ` Jeremy Sanders
2009-04-20 12:48         ` Theodore Tso
2009-04-20 12:54           ` Jeremy Sanders
2009-04-20 14:49           ` Eric Sandeen
2009-04-20 15:51             ` Eric Sandeen
2009-04-20 15:53               ` Jeremy Sanders
2009-04-20 16:26                 ` Eric Sandeen
2009-04-20 16:40                   ` Jeremy Sanders
2009-04-20 18:28                 ` Andreas Dilger
2009-04-20 18:55                   ` Jeremy Sanders
2009-04-20 20:45                     ` Andreas Dilger
2009-04-22  9:34                       ` Jeremy Sanders
2009-04-22  9:07             ` Jeremy Sanders
2009-04-22  9:59               ` Thierry Vignaud
2009-04-24  8:27       ` Jeremy Sanders
2009-04-21 15:14     ` Thierry Vignaud
2009-04-21 15:52       ` Eric Sandeen
     [not found]         ` <m23ac255pp.fsf@vador.mandriva.com>
2009-04-21 16:40           ` Eric Sandeen
2009-04-21 16:56             ` Thierry Vignaud
2009-04-21 16:43       ` Theodore Tso

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).