linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Recovering a damaged ext4 fs - revisited.
@ 2009-02-06  3:06 J.D. Bakker
  2009-02-06  4:02 ` Eric Sandeen
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: J.D. Bakker @ 2009-02-06  3:06 UTC (permalink / raw)
  To: linux-ext4

Hi,

My 4TB ext4 RAID-6 has just become damaged for the second time in two 
months. While I do have backups for most of my data, it would be good 
to know if there is a recovery procedure or a way to avoid these 
crashes. The symptoms are massive group descriptor corruption, 
similar to what was mentioned in 
http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .

The bad news: on the first occurrence I didn't record any information 
but decided to zero the partitions and restart from scratch. This 
second time my kernel is tainted by the nvidia module (as I since 
switched to an nVidia 8500-card from the Radeon X1300 I'd borrowed to 
get the system up).

The machine is an Intel i720 on an Asus P6T with 3GB RAM, running 
2.6.28 x86_64. /dev/md0 is a RAID-6 over six 1TB drives. Details:

http://lartmaker.nl/ext4/kernel-config.txt
http://lartmaker.nl/ext4/dmesg.txt
http://lartmaker.nl/ext4/lspci.txt
http://lartmaker.nl/ext4/proc-mdstat.txt
http://lartmaker.nl/ext4/proc-partitions.txt

This afternoon I issued an rm on a file which was a few hundred MB 
large. The rm process kept running at 100% CPU for over a minute, and 
could not be terminated through either CTRL-C or kill -9 (process 
would remain in the 'R'-state). The kernel reported a soft lockup, 
with the following call trace:

   [<ffffffff8050f1b7>] ? _spin_lock+0x16/0x19
   [<ffffffff80308a23>] ? ext4_mb_init_cache+0x6d2/0x876
   [<ffffffff802754de>] ? __lru_cache_add+0x8a/0xb2
   [<ffffffff80308cd6>] ? ext4_mb_load_buddy+0x10f/0x2f2
   [<ffffffff80309d15>] ? ext4_mb_free_blocks+0x2b3/0x611
   [<ffffffff802f0aa8>] ? ext4_free_blocks+0x75/0xa8
   [<ffffffff80303839>] ? ext4_ext_truncate+0x3f9/0x832
   [<ffffffff802f848e>] ? ext4_truncate+0x67/0x5bc
   [<ffffffff80316279>] ? jbd2_journal_dirty_metadata+0x124/0x146
   [<ffffffff80305ba6>] ? __ext4_journal_dirty_metadata+0x1e/0x46
   [<ffffffff802f3e9b>] ? ext4_mark_iloc_dirty+0x3fa/0x463
   [<ffffffff802f4a81>] ? ext4_mark_inode_dirty+0x134/0x147
   [<ffffffff802f8b2b>] ? ext4_delete_inode+0x148/0x209
   [<ffffffff802f89e3>] ? ext4_delete_inode+0x0/0x209
   [<ffffffff802a7472>] ? generic_delete_inode+0x82/0x108
   [<ffffffff8029ff76>] ? do_unlinkat+0xe2/0x13b
   [<ffffffff8050f8ba>] ? error_exit+0x0/0x70
   [<ffffffff8020bf5a>] ? system_call_fastpath+0x16/0x1b

(full log at http://lartmaker.nl/ext4/softlock-log.txt).

The system was otherwise still responsive, as long as processes 
didn't access the ext4 fs on the RAID array. I tried to halt the 
system, which did not work. Finally I powered the machine down 
manually.

On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck 
-nv /dev/md0 reported:

   e2fsck 1.41.4 (27-Jan-2009)
   ./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...
   Group descriptor 0 checksum is invalid.  Fix? no
   Group descriptor 1 checksum is invalid.  Fix? no
   Group descriptor 2 checksum is invalid.  Fix? no
   [...]
   Group descriptor 29808 checksum is invalid.  Fix? no
   newraidfs contains a file system with errors, check forced.
   Pass 1: Checking inodes, blocks, and sizes
   Pass 2: Checking directory structure
   Pass 3: Checking directory connectivity
   Pass 4: Checking reference counts
   Pass 5: Checking group summary information
   Block bitmap differences:  [...]
   Fix? no
   Free blocks count wrong for group #0 (23513, counted=464).
   Fix? no
   Free blocks count wrong for group #1 (31743, counted=509).
   Fix? no
   [...]
   Free inodes count wrong for group #7748 (8192, counted=940).
   Fix? no
   Directories count wrong for group #7748 (0, counted=1).
   Fix? no
   Free inodes count wrong for group #7749 (8192, counted=8059).
   Fix? no
   Free inodes count wrong (244195317, counted=237646747).
   Fix? no
   newraidfs: ***** FILE SYSTEM WAS MODIFIED *****
   newraidfs: ********** WARNING: Filesystem still has errors **********
         11 inodes used (0.00%)
      41796 non-contiguous files (379963.6%)
       3002 non-contiguous directories (27290.9%)
            # of inodes with ind/dind/tind blocks: 0/0/0
            Extent depth histogram: 4423417/4694/3
   15377150 blocks used (1.57%)
          0 bad blocks
        106 large files

    3738164 regular files
     685644 directories
       3663 character device files
       8709 block device files
         19 fifos
    2180635 links
      47335 symbolic links (43028 fast symbolic links)
         54 sockets
   --------
    6664223 files
   Error writing block 1 (Attempt to write block from filesystem 
resulted in short write).  Ignore error? no
   Error writing block 2 (Attempt to write block from filesystem 
resulted in short write).  Ignore error? no
   Error writing block 3 (Attempt to write block from filesystem 
resulted in short write).  Ignore error? no
   [...]
   Error writing block 231 (Attempt to write block from filesystem 
resulted in short write).  Ignore error? no
   Error writing block 232 (Attempt to write block from filesystem 
resulted in short write).  Ignore error? no

(full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)

As suggested in the earlier threads I ran dumpe2fs; once without the 
-b option, once with -b 32768 and once with -b 98304:

http://lartmaker.nl/ext4/dumpe2fs-md0.txt
http://lartmaker.nl/ext4/dumpe2fs-md0-32768.txt
http://lartmaker.nl/ext4/dumpe2fs-md0-98304.txt

Output of findsuper:

http://lartmaker.nl/ext4/findsuper.txt

Please let me know if you need more information.

As I said, is there anything I can do to recover my data, or to make 
sure this doesn't happen again?

Thanks,

JDB.
-- 
LART. 250 MIPS under one Watt. Free hardware design files.
http://www.lartmaker.nl/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06  3:06 Recovering a damaged ext4 fs - revisited J.D. Bakker
@ 2009-02-06  4:02 ` Eric Sandeen
  2009-02-06 12:18   ` J.D. Bakker
  2009-02-06  6:29 ` Andreas Dilger
  2009-02-06 22:15 ` Ric Wheeler
  2 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2009-02-06  4:02 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: linux-ext4

J.D. Bakker wrote:
> Hi,
> 
> My 4TB ext4 RAID-6 has just become damaged for the second time in two 
> months. While I do have backups for most of my data, it would be good 
> to know if there is a recovery procedure or a way to avoid these 
> crashes. The symptoms are massive group descriptor corruption, 
> similar to what was mentioned in 
> http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
> http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .

.... snip ....

>    Error writing block 1 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>    Error writing block 2 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>    Error writing block 3 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>    [...]
>    Error writing block 231 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>    Error writing block 232 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
> 
> (full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)

Those seem a bit odd; why are these write failing?  Anything in the
kernel logs when this happens?  I'm just wondering if there could be
some underlying storage problem?

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06  3:06 Recovering a damaged ext4 fs - revisited J.D. Bakker
  2009-02-06  4:02 ` Eric Sandeen
@ 2009-02-06  6:29 ` Andreas Dilger
  2009-02-06 12:23   ` J.D. Bakker
  2009-02-06 22:15 ` Ric Wheeler
  2 siblings, 1 reply; 10+ messages in thread
From: Andreas Dilger @ 2009-02-06  6:29 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: linux-ext4

On Feb 06, 2009  04:06 +0100, J.D. Bakker wrote:
> On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck -nv 
> /dev/md0 reported:
>
>   e2fsck 1.41.4 (27-Jan-2009)
>   ./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...

Not sure why it considers the initial group descriptors bad.

>   Group descriptor 0 checksum is invalid.  Fix? no
>   Group descriptor 1 checksum is invalid.  Fix? no
>   Group descriptor 2 checksum is invalid.  Fix? no
>   [...]
>   Group descriptor 29808 checksum is invalid.  Fix? no

Note that the checksums in the backup blocks are probably incorrect, so
this isn't itself a problem.

>   newraidfs contains a file system with errors, check forced.
>   Pass 1: Checking inodes, blocks, and sizes
>   Pass 2: Checking directory structure
>   Pass 3: Checking directory connectivity
>   Pass 4: Checking reference counts
>   Pass 5: Checking group summary information
>   Block bitmap differences:  [...]
>   Fix? no
>   Free blocks count wrong for group #0 (23513, counted=464).
>   Fix? no
>   Free blocks count wrong for group #1 (31743, counted=509).
>   Fix? no
>   [...]
>   Free inodes count wrong for group #7748 (8192, counted=940).
>   Fix? no
>   Directories count wrong for group #7748 (0, counted=1).
>   Fix? no
>   Free inodes count wrong for group #7749 (8192, counted=8059).
>   Fix? no
>   Free inodes count wrong (244195317, counted=237646747).
>   Fix? no

These also look like trivial errors, due to using the backup group
descriptors (which are not kept up-to-date.  It appears from this
e2fsck output that there isn't really anything wrong with the fs.

>   Error writing block 1 (Attempt to write block from filesystem resulted 
> in short write).  Ignore error? no
>   Error writing block 2 (Attempt to write block from filesystem resulted 
> in short write).  Ignore error? no
>   Error writing block 3 (Attempt to write block from filesystem resulted 
> in short write).  Ignore error? no

This is a serious problem.

> As I said, is there anything I can do to recover my data, or to make  
> sure this doesn't happen again?

I would say to run "e2fsck -fp /dev/XXX" and your data _should_ be
there.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06  4:02 ` Eric Sandeen
@ 2009-02-06 12:18   ` J.D. Bakker
  2009-02-06 15:23     ` Eric Sandeen
  0 siblings, 1 reply; 10+ messages in thread
From: J.D. Bakker @ 2009-02-06 12:18 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: linux-ext4

At 22:02 -0600 05-02-2009, Eric Sandeen wrote:
>J.D. Bakker wrote:
>  >    Error writing block 1 (Attempt to write block from filesystem
>>  resulted in short write).  Ignore error? no
>>     Error writing block 2 (Attempt to write block from filesystem
>>  resulted in short write).  Ignore error? no
>>     Error writing block 3 (Attempt to write block from filesystem
>>  resulted in short write).  Ignore error? no
>>     [...]
>>     Error writing block 231 (Attempt to write block from filesystem
>>  resulted in short write).  Ignore error? no
>>     Error writing block 232 (Attempt to write block from filesystem
>>  resulted in short write).  Ignore error? no
>>
>>  (full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)
>
>Those seem a bit odd; why are these write failing?  Anything in the
>kernel logs when this happens?  I'm just wondering if there could be
>some underlying storage problem?

No, nothing in the logs.

Isn't this a side-effect of me passing the -n option to e2fsck? I 
haven't traced the full path in the e2fsprogs-source, but it would 
appear that the -n option sets E2F_OPT_NO, which sets 
E2F_OPT_READONLY, which clears EXT2_FLAG_RW, which (in a few places) 
clears IO_FLAG_RW, which appears to open the fs RO (as expected).

JDB.
[I passed -n to e2fsck as I want to keep the fs as untouched as 
possible, and I don't have 4TB in scratch space handy to park a copy]
-- 
LART. 250 MIPS under one Watt. Free hardware design files.
http://www.lartmaker.nl/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06  6:29 ` Andreas Dilger
@ 2009-02-06 12:23   ` J.D. Bakker
  2009-02-06 21:44     ` Andreas Dilger
  0 siblings, 1 reply; 10+ messages in thread
From: J.D. Bakker @ 2009-02-06 12:23 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: linux-ext4

At 23:29 -0700 05-02-2009, Andreas Dilger wrote:
>On Feb 06, 2009  04:06 +0100, J.D. Bakker wrote:
>  > On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck -nv
>  > /dev/md0 reported:
>  > [...]
>  >   Error writing block 1 (Attempt to write block from filesystem resulted
>>  in short write).  Ignore error? no
>>    Error writing block 2 (Attempt to write block from filesystem resulted
>>  in short write).  Ignore error? no
>>    Error writing block 3 (Attempt to write block from filesystem resulted
>>  in short write).  Ignore error? no
>
>This is a serious problem.

Could this be caused by my using the -n option on e2fsck (see my 
reply to Eric)?

>  > As I said, is there anything I can do to recover my data, or to make 
>>  sure this doesn't happen again?
>
>I would say to run "e2fsck -fp /dev/XXX" and your data _should_ be
>there.

No dice:

   newraidfs: Note: if several inode or block bitmap blocks or part
   of the inode table require relocation, you may wish to try
   running e2fsck with the '-b 32768' option first.  The problem
   may lie only with the primary block group descriptors, and
   the backup block group descriptors may be OK.

   newraidfs: Block bitmap for group 7808 is not in group.  (block 3731742663)

   newraidfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
	  (i.e., without -a or -p options)

Thanks,

JDB.
-- 
LART. 250 MIPS under one Watt. Free hardware design files.
http://www.lartmaker.nl/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06 12:18   ` J.D. Bakker
@ 2009-02-06 15:23     ` Eric Sandeen
  0 siblings, 0 replies; 10+ messages in thread
From: Eric Sandeen @ 2009-02-06 15:23 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: linux-ext4

J.D. Bakker wrote:
> At 22:02 -0600 05-02-2009, Eric Sandeen wrote:
>> J.D. Bakker wrote:
>>  >    Error writing block 1 (Attempt to write block from filesystem
>>>  resulted in short write).  Ignore error? no
>>>     Error writing block 2 (Attempt to write block from filesystem
>>>  resulted in short write).  Ignore error? no
>>>     Error writing block 3 (Attempt to write block from filesystem
>>>  resulted in short write).  Ignore error? no
>>>     [...]
>>>     Error writing block 231 (Attempt to write block from filesystem
>>>  resulted in short write).  Ignore error? no
>>>     Error writing block 232 (Attempt to write block from filesystem
>>>  resulted in short write).  Ignore error? no
>>>
>>>  (full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)
>> Those seem a bit odd; why are these write failing?  Anything in the
>> kernel logs when this happens?  I'm just wondering if there could be
>> some underlying storage problem?
> 
> No, nothing in the logs.
> 
> Isn't this a side-effect of me passing the -n option to e2fsck? I 
> haven't traced the full path in the e2fsprogs-source, but it would 
> appear that the -n option sets E2F_OPT_NO, which sets 
> E2F_OPT_READONLY, which clears EXT2_FLAG_RW, which (in a few places) 
> clears IO_FLAG_RW, which appears to open the fs RO (as expected).

oh, perhaps.  I'll have to look more closely; I'd hope (I thought...)
that running it in test mode wouldn't issue such dire error messages :)

-Eric

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06 12:23   ` J.D. Bakker
@ 2009-02-06 21:44     ` Andreas Dilger
  0 siblings, 0 replies; 10+ messages in thread
From: Andreas Dilger @ 2009-02-06 21:44 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: linux-ext4

On Feb 06, 2009  13:23 +0100, J.D. Bakker wrote:
> At 23:29 -0700 05-02-2009, Andreas Dilger wrote:
>> On Feb 06, 2009  04:06 +0100, J.D. Bakker wrote:
>>  > On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck -nv
>>  > /dev/md0 reported:
>>  > [...]
>>  >   Error writing block 1 (Attempt to write block from filesystem resulted
>>>  in short write).  Ignore error? no
>>>    Error writing block 2 (Attempt to write block from filesystem resulted
>>>  in short write).  Ignore error? no
>>>    Error writing block 3 (Attempt to write block from filesystem resulted
>>>  in short write).  Ignore error? no
>>
>> This is a serious problem.
>
> Could this be caused by my using the -n option on e2fsck (see my reply to 
> Eric)?

Then it is a bug in e2fsck.  e2fsck shouldn't even TRY to write to the
filesystem without asking the user first, and "-n" means the answer is
always no so it should never do this.

>>  > As I said, is there anything I can do to recover my data, or to make 
>> 
>>>  sure this doesn't happen again?
>>
>> I would say to run "e2fsck -fp /dev/XXX" and your data _should_ be
>> there.
>
> No dice:
>
>   newraidfs: Note: if several inode or block bitmap blocks or part
>   of the inode table require relocation, you may wish to try
>   running e2fsck with the '-b 32768' option first.  The problem
>   may lie only with the primary block group descriptors, and
>   the backup block group descriptors may be OK.
>
>   newraidfs: Block bitmap for group 7808 is not in group.  (block 3731742663)
>
>   newraidfs: UNEXPECTED INCONSISTENCY; RUN fsck MANUALLY.
> 	  (i.e., without -a or -p options)

Then you should run it with "-f" and maybe "-b32768" (if it doesn't
do this on its own).  If you want to avoid hitting "y" 27000 times,
you should also add "-y".

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06  3:06 Recovering a damaged ext4 fs - revisited J.D. Bakker
  2009-02-06  4:02 ` Eric Sandeen
  2009-02-06  6:29 ` Andreas Dilger
@ 2009-02-06 22:15 ` Ric Wheeler
  2009-02-06 22:34   ` J.D. Bakker
  2 siblings, 1 reply; 10+ messages in thread
From: Ric Wheeler @ 2009-02-06 22:15 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: linux-ext4

J.D. Bakker wrote:
> Hi,
>
> My 4TB ext4 RAID-6 has just become damaged for the second time in two 
> months. While I do have backups for most of my data, it would be good 
> to know if there is a recovery procedure or a way to avoid these 
> crashes. The symptoms are massive group descriptor corruption, similar 
> to what was mentioned in 
> http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
> http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .
What kind of RAID 6 device are you using? Is it MD raid or some vendor 
array? 

Ric


>
> The bad news: on the first occurrence I didn't record any information 
> but decided to zero the partitions and restart from scratch. This 
> second time my kernel is tainted by the nvidia module (as I since 
> switched to an nVidia 8500-card from the Radeon X1300 I'd borrowed to 
> get the system up).
>
> The machine is an Intel i720 on an Asus P6T with 3GB RAM, running 
> 2.6.28 x86_64. /dev/md0 is a RAID-6 over six 1TB drives. Details:
>
> http://lartmaker.nl/ext4/kernel-config.txt
> http://lartmaker.nl/ext4/dmesg.txt
> http://lartmaker.nl/ext4/lspci.txt
> http://lartmaker.nl/ext4/proc-mdstat.txt
> http://lartmaker.nl/ext4/proc-partitions.txt
>
> This afternoon I issued an rm on a file which was a few hundred MB 
> large. The rm process kept running at 100% CPU for over a minute, and 
> could not be terminated through either CTRL-C or kill -9 (process 
> would remain in the 'R'-state). The kernel reported a soft lockup, 
> with the following call trace:
>
>   [<ffffffff8050f1b7>] ? _spin_lock+0x16/0x19
>   [<ffffffff80308a23>] ? ext4_mb_init_cache+0x6d2/0x876
>   [<ffffffff802754de>] ? __lru_cache_add+0x8a/0xb2
>   [<ffffffff80308cd6>] ? ext4_mb_load_buddy+0x10f/0x2f2
>   [<ffffffff80309d15>] ? ext4_mb_free_blocks+0x2b3/0x611
>   [<ffffffff802f0aa8>] ? ext4_free_blocks+0x75/0xa8
>   [<ffffffff80303839>] ? ext4_ext_truncate+0x3f9/0x832
>   [<ffffffff802f848e>] ? ext4_truncate+0x67/0x5bc
>   [<ffffffff80316279>] ? jbd2_journal_dirty_metadata+0x124/0x146
>   [<ffffffff80305ba6>] ? __ext4_journal_dirty_metadata+0x1e/0x46
>   [<ffffffff802f3e9b>] ? ext4_mark_iloc_dirty+0x3fa/0x463
>   [<ffffffff802f4a81>] ? ext4_mark_inode_dirty+0x134/0x147
>   [<ffffffff802f8b2b>] ? ext4_delete_inode+0x148/0x209
>   [<ffffffff802f89e3>] ? ext4_delete_inode+0x0/0x209
>   [<ffffffff802a7472>] ? generic_delete_inode+0x82/0x108
>   [<ffffffff8029ff76>] ? do_unlinkat+0xe2/0x13b
>   [<ffffffff8050f8ba>] ? error_exit+0x0/0x70
>   [<ffffffff8020bf5a>] ? system_call_fastpath+0x16/0x1b
>
> (full log at http://lartmaker.nl/ext4/softlock-log.txt).
>
> The system was otherwise still responsive, as long as processes didn't 
> access the ext4 fs on the RAID array. I tried to halt the system, 
> which did not work. Finally I powered the machine down manually.
>
> On reboot the system refused to auto-fsck /dev/md0. A manual e2fsck 
> -nv /dev/md0 reported:
>
>   e2fsck 1.41.4 (27-Jan-2009)
>   ./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...
>   Group descriptor 0 checksum is invalid.  Fix? no
>   Group descriptor 1 checksum is invalid.  Fix? no
>   Group descriptor 2 checksum is invalid.  Fix? no
>   [...]
>   Group descriptor 29808 checksum is invalid.  Fix? no
>   newraidfs contains a file system with errors, check forced.
>   Pass 1: Checking inodes, blocks, and sizes
>   Pass 2: Checking directory structure
>   Pass 3: Checking directory connectivity
>   Pass 4: Checking reference counts
>   Pass 5: Checking group summary information
>   Block bitmap differences:  [...]
>   Fix? no
>   Free blocks count wrong for group #0 (23513, counted=464).
>   Fix? no
>   Free blocks count wrong for group #1 (31743, counted=509).
>   Fix? no
>   [...]
>   Free inodes count wrong for group #7748 (8192, counted=940).
>   Fix? no
>   Directories count wrong for group #7748 (0, counted=1).
>   Fix? no
>   Free inodes count wrong for group #7749 (8192, counted=8059).
>   Fix? no
>   Free inodes count wrong (244195317, counted=237646747).
>   Fix? no
>   newraidfs: ***** FILE SYSTEM WAS MODIFIED *****
>   newraidfs: ********** WARNING: Filesystem still has errors **********
>         11 inodes used (0.00%)
>      41796 non-contiguous files (379963.6%)
>       3002 non-contiguous directories (27290.9%)
>            # of inodes with ind/dind/tind blocks: 0/0/0
>            Extent depth histogram: 4423417/4694/3
>   15377150 blocks used (1.57%)
>          0 bad blocks
>        106 large files
>
>    3738164 regular files
>     685644 directories
>       3663 character device files
>       8709 block device files
>         19 fifos
>    2180635 links
>      47335 symbolic links (43028 fast symbolic links)
>         54 sockets
>   --------
>    6664223 files
>   Error writing block 1 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 2 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 3 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   [...]
>   Error writing block 231 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>   Error writing block 232 (Attempt to write block from filesystem 
> resulted in short write).  Ignore error? no
>
> (full log at http://lartmaker.nl/ext4/e2fsck-md0.txt)
>
> As suggested in the earlier threads I ran dumpe2fs; once without the 
> -b option, once with -b 32768 and once with -b 98304:
>
> http://lartmaker.nl/ext4/dumpe2fs-md0.txt
> http://lartmaker.nl/ext4/dumpe2fs-md0-32768.txt
> http://lartmaker.nl/ext4/dumpe2fs-md0-98304.txt
>
> Output of findsuper:
>
> http://lartmaker.nl/ext4/findsuper.txt
>
> Please let me know if you need more information.
>
> As I said, is there anything I can do to recover my data, or to make 
> sure this doesn't happen again?
>
> Thanks,
>
> JDB.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06 22:15 ` Ric Wheeler
@ 2009-02-06 22:34   ` J.D. Bakker
  2009-02-06 22:43     ` Ric Wheeler
  0 siblings, 1 reply; 10+ messages in thread
From: J.D. Bakker @ 2009-02-06 22:34 UTC (permalink / raw)
  To: Ric Wheeler; +Cc: linux-ext4

At 17:15 -0500 06-02-2009, Ric Wheeler wrote:
>J.D. Bakker wrote:
>>Hi,
>>
>>My 4TB ext4 RAID-6 has just become damaged for the second time in 
>>two months. While I do have backups for most of my data, it would 
>>be good to know if there is a recovery procedure or a way to avoid 
>>these crashes. The symptoms are massive group descriptor 
>>corruption, similar to what was mentioned in 
>>http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
>>http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .
>What kind of RAID 6 device are you using? Is it MD raid or some vendor array?

md, as shown in the linked config and dmesg.

>>http://lartmaker.nl/ext4/kernel-config.txt
>>http://lartmaker.nl/ext4/dmesg.txt
>>http://lartmaker.nl/ext4/lspci.txt
>>http://lartmaker.nl/ext4/proc-mdstat.txt
>>http://lartmaker.nl/ext4/proc-partitions.txt

JDB.
-- 
LART. 250 MIPS under one Watt. Free hardware design files.
http://www.lartmaker.nl/

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Recovering a damaged ext4 fs - revisited.
  2009-02-06 22:34   ` J.D. Bakker
@ 2009-02-06 22:43     ` Ric Wheeler
  0 siblings, 0 replies; 10+ messages in thread
From: Ric Wheeler @ 2009-02-06 22:43 UTC (permalink / raw)
  To: J.D. Bakker; +Cc: Ric Wheeler, linux-ext4

J.D. Bakker wrote:
> At 17:15 -0500 06-02-2009, Ric Wheeler wrote:
>> J.D. Bakker wrote:
>>> Hi,
>>>
>>> My 4TB ext4 RAID-6 has just become damaged for the second time in 
>>> two months. While I do have backups for most of my data, it would be 
>>> good to know if there is a recovery procedure or a way to avoid 
>>> these crashes. The symptoms are massive group descriptor corruption, 
>>> similar to what was mentioned in 
>>> http://thread.gmane.org/gmane.comp.file-systems.ext4/10844 and 
>>> http://article.gmane.org/gmane.comp.file-systems.ext4/11195 .
>> What kind of RAID 6 device are you using? Is it MD raid or some 
>> vendor array?
>
> md, as shown in the linked config and dmesg.
>
>>> http://lartmaker.nl/ext4/kernel-config.txt
>>> http://lartmaker.nl/ext4/dmesg.txt
>>> http://lartmaker.nl/ext4/lspci.txt
>>> http://lartmaker.nl/ext4/proc-mdstat.txt
>>> http://lartmaker.nl/ext4/proc-partitions.txt
>
> JDB.

RAID6 is not that new, but it is newer than MD raid5. Does RAID5/6 
handle the write barriers correctly these days? I think that barriers 
are enabled only for RAID1 which means that your disks might be holding 
up lots of volatile data that will go "poof" if you power off or reboot.

You can "fix" this by disabling the write cache on your drives, but you 
will have a performance hit (at least for S-ATA drives).

Ric


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2009-02-06 22:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-02-06  3:06 Recovering a damaged ext4 fs - revisited J.D. Bakker
2009-02-06  4:02 ` Eric Sandeen
2009-02-06 12:18   ` J.D. Bakker
2009-02-06 15:23     ` Eric Sandeen
2009-02-06  6:29 ` Andreas Dilger
2009-02-06 12:23   ` J.D. Bakker
2009-02-06 21:44     ` Andreas Dilger
2009-02-06 22:15 ` Ric Wheeler
2009-02-06 22:34   ` J.D. Bakker
2009-02-06 22:43     ` Ric Wheeler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).