resize2fs: Should never happen: resize inode corrupt!

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* resize2fs: Should never happen: resize inode corrupt! - lost key inodes
@ 2015-08-11 18:15 Johan Harvyl
  2015-08-11 22:47 ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-08-11 18:15 UTC (permalink / raw)
  To: linux-ext4

Hi,

I recently attempted an operation I have done many many times before, add a
drive to a raid array followed by offline resize2fs to expand the ext4fs 
on it.

This time however it failed miserably and key parts of the filesystem appear
so corrupt that it can no longer be mounted.

Here is what triggered all this:
# umount /dev/md0
# fsck.ext4 -f /dev/md0
# resize2fs /dev/md0
Should never happen: resize inode corrupt!

It looks to me like there is some sanity check missing in resize2fs, and I
would like to figure out what.

Scanning through the linux-ext4 archives a bit I found the
"64bit + resize2fs... this is Not Good" thread:
http://www.spinics.net/lists/linux-ext4/msg35039.html

His problem looks somewhat similar to mine although I do not see the same
possible root cause.

Googling I also find a few threads like:
http://www.spinics.net/lists/linux-ext4/msg27511.html
That suggests it would not be possible to resize a 64bit fs with 
resize_inode
and flex_bg, but those threads are old and resize2fs 1.42.13 (my 
version) did
not articulate that combination being a problem.

Any input on what resize2fs has actually done and suggestions on what to try
to recover would be greatly appreciated.

The md array has been re-started read-only and will remain so for the time
being, I want a clear understanding of what has actually happened before I
try something possibly destructive (like disabling the journal and running
e2fsck -f).To be honest part of me enjoy getting my hands dirty digging
through the filesystem internals and there are backups of the important
stuff but still there are some data I would like to recover.

What I would like is something along the lines of a read-only fsck that 
lets me
work with the fixed-up fs without actually modifying the underlying 
block device
as I do not quite trust e2fsprogs to make further changes to that 
filesystem.

The best I have found so far is UFS explorer, which looks promising. It 
does find
a lot of the files and has options to copy entire directories onto another
filesystem but I have no way of knowing that the contents in the files 
are actually
intact so it or may not be worth spending money on.

I will now try to go through a bit of what I have tried and found so far.

For reference here is the md reshape. At the end of this post there will be
some further history on how the md and ext4fs was created and expanded:
# mdadm --add /dev/md0 /dev/sdr
mdadm: added /dev/sdr
# mdadm --grow /dev/md0 --raid-devices=8

[119591.811743] md0: detected capacity change from 20003262300160 to 
24003914760192
[119592.891563] VFS: busy inodes on changed media or resized disk md0

Attempt at mounting /dev/md0:
[146160.561297] EXT4-fs (md0): no journal found

Attempt at mounting /dev/md0 with -o ro,noload:
[146592.329911] EXT4-fs (md0): get root inode failed
[146592.329914] EXT4-fs (md0): mount failed

debugfs:  stat <2>
Inode: 2   Type: bad type    Mode:  0000   Flags: 0x0
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 0   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x00000000 -- Thu Jan  1 01:00:00 1970
atime: 0x00000000 -- Thu Jan  1 01:00:00 1970
mtime: 0x00000000 -- Thu Jan  1 01:00:00 1970
Size of extra inode fields: 0
BLOCKS:

debugfs:  stat <7>
Inode: 7   Type: bad type    Mode:  0000   Flags: 0x0
Generation: 0    Version: 0x00000000
User:     0   Group:     0   Size: 0
File ACL: 0    Directory ACL: 0
Links: 0   Blockcount: 0
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x00000000 -- Thu Jan  1 01:00:00 1970
atime: 0x00000000 -- Thu Jan  1 01:00:00 1970
mtime: 0x00000000 -- Thu Jan  1 01:00:00 1970
Size of extra inode fields: 0
BLOCKS:

Manual check of the root inode on the broken filesystem:
  Group  0: block bitmap at 2881, inode bitmap at 2897, inode table at 2913
            4294963995 free blocks, 501 free inodes, 2 used directories, 
501 unused inodes
            [Checksum 0x404c]

Clearly the 4294963995 free blocks in a 32768 block group does not make 
sense.
00001000  41 0B 00 00  51 0B 00 00   61 0B 00 00  1B F3 F5 01
00001010  02 00 04 00  00 00 00 00   00 00 00 00  F5 01 4C 40
00001020  00 00 00 00  00 00 00 00   00 00 00 00 *FF FF*00 00
00001030  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00

In [72]: hex(2913 * 4096 + 1 * 256)
Out[72]: '0xb61100'

00B61100  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61110  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61120  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61130  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
...
00B61700  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61710  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61720  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
00B61730  00 00 00 00  00 00 00 00   00 00 00 00  00 00 00 00
Uh oh, where did the root inode, and the resize inode go?

Just to confirm the math, here is the same thing on a reference clean 
filesystem:
  Group  0: block bitmap at 2641, inode bitmap at 2657, inode table at 2673
            19 free blocks, 501 free inodes, 2 used directories, 501 
unused inodes
            [Checksum 0x5791]

In [42]: hex(2673*4096 + 1*256)
Out[42]: '0xa71100'

00A71100  ED 41 00 00  00 10 00 00   D9 D3 BD 55  B7 D3 BD 55
00A71110  B7 D3 BD 55  00 00 00 00   00 00 13 00  08 00 00 00
00A71120  00 00 08 00  23 00 00 00   0A F3 01 00  04 00 00 00
00A71130  00 00 00 00  00 00 00 00   01 00 00 00  EF 5F 00 00

The dirent for / is at 0x5FEF * 4096:
05FEF000  02 00 00 00  0C 00 01 02   2E 00 00 00  02 00 00 00
05FEF010  0C 00 02 02  2E 2E 00 00   0B 00 00 00  14 00 0A 02
05FEF020  6C 6F 73 74  2B 66 6F 75   6E 64 00 00  01 80 46 02
In other words ".", "..", "lost+found" and so on...
<END of reference clean file system data>

Going back to the broken filesystem again, the root dirent is at:
01DE8000  02 00 00 00  0C 00 01 02   2E 00 00 00  02 00 00 00
01DE8010  0C 00 02 02  2E 2E 00 00   0B 00 00 00  14 00 0A 02
01DE8020  6C 6F 73 74  2B 66 6F 75   6E 64 00 00  0C 40 8C 03
But again where is its inode?

I have not been able to find an inode that references that block, at least
not in the same way I see on other filesystems.

###
Current kernel (stock debian):
4.0.0-2-amd64 #1 SMP Debian 4.0.8-2 (2015-07-22) x86_64 GNU/Linux
Current (when failing resize2fs was executed) e2fsprogs version (stock 
debian): 1.42.13-1

MD and FS information
---
/dev/md0:
      Raid Level : raid6
      Array Size : 23441323008 (22355.39 GiB 24003.91 GB)
   Used Dev Size : 3906887168 (3725.90 GiB 4000.65 GB)
    Raid Devices : 8
   Total Devices : 8

# dumpe2fs -h /dev/md0
dumpe2fs 1.42.13 (17-May-2015)
Filesystem volume name:   <none>
Last mounted on:          /mnt/r0
Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent 64bit flex_bg sparse_super large_file huge_file un\
init_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              91568128
Block count:              5860330752
Reserved block count:     0
Free blocks:              1013128185
Free inodes:              88364147
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   32
RAID stride:              128
RAID stripe width:        512
Flex block group size:    16
Filesystem created:       Wed Jun 25 23:22:06 2014
Last mount time:          Fri Jul 31 15:35:09 2015
Last write time:          Sun Aug  2 08:03:47 2015
Mount count:              0
Maximum mount count:      -1
Last checked:             Sun Aug  2 07:44:35 2015
Check interval:           0 (<none>)
Lifetime writes:          19 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      6bb07dee-8871-4b62-aa92-20080e16cb8c
Journal backup:           inode blocks
Journal superblock magic number invalid!

Some possibly relevant pieces from /etc/mke2fs.conf:
[defaults]
         base_features = 
sparse_super,large_file,filetype,resize_inode,dir_index,ext_attr
         default_mntopts = acl,user_xattr
         enable_periodic_fsck = 0
         blocksize = 4096
         inode_size = 256
         inode_ratio = 16384

[fs_types]
         ext4 = {
                 features = 
has_journal,extent,huge_file,flex_bg,uninit_bg,dir_nlink,extra_isize
                 auto_64-bit_support = 1
                 inode_size = 256
         }
Note that this is what that file looks like right now, I cannot think of 
a way
of telling what it looked like when the filesystem was initially created.

What I can come up with is a best guess since another ext4fs on that same
machine created around the same time (and therefore likely with the same
mke2fs.conf) does not have the resize_inode flag set, which my corrupt
fs has. I have no idea how that got enabled on my corrupt fs.

###
How the md and ext4fs was created and expanded
---
# mdadm --create --verbose --chunk=512 /dev/md0 --level=5 
--raid-devices=5 /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm
mdadm: layout defaults to left-symmetric
mdadm: layout defaults to left-symmetric
mdadm: /dev/sdm appears to be part of a raid array:
        level=raid6 devices=8 ctime=Wed Jan 25 23:49:02 2012
mdadm: size set to 3906887168K
mdadm: automatically enabling write-intent bitmap on large array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
---
# mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
mke2fs 1.42.10 (18-May-2014)
Creating filesystem with 3906887168 4k blocks and 61045248 inodes
Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
---
# mdadm --add /dev/md0 /dev/sdo
mdadm: added /dev/sdo
# mdadm --grow /dev/md0 --level=6 --raid-devices=6 
--backup-file=/mnt/md100/md0_backup
mdadm: level of /dev/md0 changed to raid6
---
# mdadm --add /dev/md0 /dev/sdq
mdadm: added /dev/sdq
# mdadm --grow /dev/md0 --raid-devices=7
---
# umount /dev/md0
# fsck.ext4 -f /dev/md0
# resize2fs /dev/md0

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-08-11 18:15 resize2fs: Should never happen: resize inode corrupt! - lost key inodes Johan Harvyl
@ 2015-08-11 22:47 ` Theodore Ts'o
  2015-08-12 22:00   ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2015-08-11 22:47 UTC (permalink / raw)
  To: Johan Harvyl; +Cc: linux-ext4

On Tue, Aug 11, 2015 at 08:15:58PM +0200, Johan Harvyl wrote:
> 
> I recently attempted an operation I have done many many times before, add a
> drive to a raid array followed by offline resize2fs to expand the ext4fs on
> it.

If you've read the old threads, you'll note that online resize is
actually safer (has proven to have had less bugs) than offline
resize, at least with respect to big ext4 file systems.  :-/

I'm not aware of any offline resize with 1.42.13, but it sounds like
you were originally using mke2fs and resize2fs 1.42.10, which did have
some bugs, and so the question is what sort of might it might have
left things.

It looks like you were resizing the file system from 18TB to 22TB.
There shouldn't have been a resize inode if the file system was larger
than 16TB, and so it sounds like is that was what tickled this error message:

> # resize2fs /dev/md0
> Should never happen: resize inode corrupt!

This was after most of the resize work has been done, so the question
is what we need to do to get your file system up and running again.

What does "e2fsck -fn /dev/md0" report?

Hopefully "e2fsck -fy /dev/md0" will fix things for you, but if you
haven't made backups, we should be careful before we move forward.

							- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-08-11 22:47 ` Theodore Ts'o
@ 2015-08-12 22:00   ` Johan Harvyl
  2015-08-13 13:27     ` Theodore Ts'o
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-08-12 22:00 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On 2015-08-12 00:47, Theodore Ts'o wrote:
> On Tue, Aug 11, 2015 at 08:15:58PM +0200, Johan Harvyl wrote:
>> I recently attempted an operation I have done many many times before, add a
>> drive to a raid array followed by offline resize2fs to expand the ext4fs on
>> it.
> If you've read the old threads, you'll note that online resize is
> actually safer (has proven to have had less bugs) than offline
> resize, at least with respect to big ext4 file systems.  :-/
Hi,

Thank you for you quick respone.

I did notice posts about online resizes being safer, which frankly 
surprised me, I would have expected
the opposite.

I would like to try my best, with your guidance, to track down how I 
ended up in this state, if nothing
else to avoid others ending up in the same situation.

The filesystem was originally created (with -O 64bit) on a 4*4TB device, 
so slightly under
16 power-of-2 TB using mke2fs 1.42.10.

I did not manually add the resize_inode feature flag at that time, but 
it is possible that I could
have added it later with tune2fs although I can neither remember doing 
so nor think of a reason
I would have. Could any of the e2fsprogs have added the resize_inode 
flag automatically, for instance
when it was expanded the first time, from just below 16 TB to just below 
20 TB?

When should this incompatibility of feature flags have been discovered, 
was it wrong to even end up
in a state where it was enabled on a >16 TB filesystem? Should it have 
been caught in a sanity before
performing the offline resize?

> I'm not aware of any offline resize with 1.42.13, but it sounds like
> you were originally using mke2fs and resize2fs 1.42.10, which did have
> some bugs, and so the question is what sort of might it might have
> left things.
What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any 
specific commits of interest?
I scanned the git log -p --full-history v1.42.10..v1.42.13 -- resize/ 
and nothing really jumped out at me...

Are you thinking the fs was actually put into a bad state already when 
it was first expanded from 16 TB
to 20 TB with resize2fs 1.42.10 although it did not show at the time?

Can you think of why it would zero out the first thousands of inodes, 
like the root inode, lost+found
and so on? I am thinking that would help me assess the potential damage 
to the files. Could I perhaps
expect the same kind of zeroed out blocks at regular intervals all over 
the device?

> It looks like you were resizing the file system from 18TB to 22TB.
perhaps not important, but to be clear, 20 TB -> 24 TB
> There shouldn't have been a resize inode if the file system was larger
> than 16TB, and so it sounds like is that was what tickled this error message:
And judging by the error and the code leading up to that error, my guess 
is there never was a resize inode
on that filesystem even though the feature flag was for some reason set.
>> # resize2fs /dev/md0
>> Should never happen: resize inode corrupt!
> This was after most of the resize work has been done, so the question
> is what we need to do to get your file system up and running again.
>
> What does "e2fsck -fn /dev/md0" report?

Since the journal inode (as well as the root inode) have been zeroed out 
in the resize process it exits
immediately with:
e2fsck 1.42.13 (17-May-2015)
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
e2fsck: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

e2fsck: Illegal inode number while checking ext3 journal for /dev/md0

/dev/md0: ********** WARNING: Filesystem still has errors **********
#

I built the v1.42.13 tag with the fatal error removed hoping it would 
continue and I ended up with:
# ./e2fsck/e2fsck -fn /dev/md0
e2fsck 1.42.13 (17-May-2015)
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
./e2fsck/e2fsck: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

./e2fsck/e2fsck: Illegal inode number while checking ext3 journal for 
/dev/md0
Resize inode not valid.  Recreate? no
....

Many hours later I checked the progress as it had not completed and it 
was still utilizing
100% of one core:
25544 root      20   0   78208  70700   2424 R  93.8  0.3 791:14.82 e2fsck

iotop/iostat indicated no significant disk activity on the device in 
question. I have not had time yet
to debug where it was stuck. An e2fsck -fn on that device, when it was 
still healthy, would typically
take an hour or two, not 10+ h.

>
> Hopefully "e2fsck -fy /dev/md0" will fix things for you, but if you
> haven't made backups, we should be careful before we move forward.
>
> 							- Ted
>
I have backups of the most important things, but before trying something 
that actually modifies
the fs I would like to do as thorough analysis as I can of what happened 
in order to avoid repeats
for myself and others as I believe there is a bug in at least one of the 
e2fsprogs that allowed for this
to happen.

thanks,
-johan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-08-12 22:00   ` Johan Harvyl
@ 2015-08-13 13:27     ` Theodore Ts'o
  2015-08-13 18:12       ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2015-08-13 13:27 UTC (permalink / raw)
  To: Johan Harvyl; +Cc: linux-ext4

On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:

> >I'm not aware of any offline resize with 1.42.13, but it sounds like
> >you were originally using mke2fs and resize2fs 1.42.10, which did have
> >some bugs, and so the question is what sort of might it might have
> >left things.
> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
> specific commits of interest?

I suspect it was caused by a bug in resize2fs 1.42.10.  The problem is
that off-line resize2fs is much more powerful; it can handle moving
file system metadata blocks around, so it can grow file systems in
cases which aren't supported by online resize --- and it can shrink
file systems when online resize doesn't support any kind of file
system shrink.  As such, the code is a lot more complicated, whereas
the online resize code is much simpler, and ultimately, much more
robust.

> Can you think of why it would zero out the first thousands of
> inodes, like the root inode, lost+found and so on? I am thinking
> that would help me assess the potential damage to the files. Could I
> perhaps expect the same kind of zeroed out blocks at regular
> intervals all over the device?

I didn't realize that the first thousands of inodes had been zeroed;
either you didn't mention this earier or I had missed that from your
e-mail.  I suspect the resize inode before the resize was pretty
terribly corrupted, but in a way that e2fsck didn't complain.

I'll have to try to reproduce the problem based how you originally
created and grew the file system and see if I can somehow reproduce
the problem.  Obviously e2fsck and resize2fs should be changed to make
this operation much more robust.  If you can tell me the exact
original size (just under 16TB is probably good enough, but if you
know the exact starting size, that might be helpful), and then steps
by which the file system was grown, and which version of e2fsprogs was
installed at the time, that would be quite helpful.

Thanks,

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-08-13 13:27     ` Theodore Ts'o
@ 2015-08-13 18:12       ` Johan Harvyl
  2015-09-03 22:16         ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-08-13 18:12 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On 2015-08-13 15:27, Theodore Ts'o wrote:
> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>
>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>> you were originally using mke2fs and resize2fs 1.42.10, which did have
>>> some bugs, and so the question is what sort of might it might have
>>> left things.
>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
>> specific commits of interest?
> I suspect it was caused by a bug in resize2fs 1.42.10.  The problem is
> that off-line resize2fs is much more powerful; it can handle moving
> file system metadata blocks around, so it can grow file systems in
> cases which aren't supported by online resize --- and it can shrink
> file systems when online resize doesn't support any kind of file
> system shrink.  As such, the code is a lot more complicated, whereas
> the online resize code is much simpler, and ultimately, much more
> robust.
Understood, so would it have been possible to move from my 20 TB -> 24 
TB fs with
online resize? I am confused by the threads I see on the net with 
regards to this.
>> Can you think of why it would zero out the first thousands of
>> inodes, like the root inode, lost+found and so on? I am thinking
>> that would help me assess the potential damage to the files. Could I
>> perhaps expect the same kind of zeroed out blocks at regular
>> intervals all over the device?
> I didn't realize that the first thousands of inodes had been zeroed;
> either you didn't mention this earier or I had missed that from your
> e-mail.  I suspect the resize inode before the resize was pretty
> terribly corrupted, but in a way that e2fsck didn't complain.

Hi,

I may not have been clear on that it was not just the first handful of 
inodes.

When I manually sampled some inodes with debugfs and a disk editor, the 
first group
I found valid inodes in was:
  Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode 
table at 1572896

With 512 inodes per group that would mean at least some 24k inodes are 
blanked out,
but I did not check them all, I just sampled groups manually so there 
could be some
valid in some of the groups below group 48 or a lot more invalid afterwards.

> I'll have to try to reproduce the problem based how you originally
> created and grew the file system and see if I can somehow reproduce
> the problem.  Obviously e2fsck and resize2fs should be changed to make
> this operation much more robust.  If you can tell me the exact
> original size (just under 16TB is probably good enough, but if you
> know the exact starting size, that might be helpful), and then steps
> by which the file system was grown, and which version of e2fsprogs was
> installed at the time, that would be quite helpful.
>
> Thanks,
>
> 						- Ted

Cool, I will try to go through its history in some detail below.

If you have ideas on what I could look for, like ideas on if there is a 
particular periodicity
to the corruption I can write some python to explore such theories.


The filesystem was originally created with e2fsprogs 1.42.10-1 and most 
likely linux-image
3.14 from Debian.

# mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
mke2fs 1.42.10 (18-May-2014)
Creating filesystem with 3906887168 4k blocks and 61045248 inodes
Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
#

It was expanded with 4 TB (another 976721792 4k blocks). Best I can tell 
from my logs this
was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 (debian 
packages) and
Linux 3.16. Everything was running fine after this.
NOTE #1: It does *not* look like this filesystem was ever touched by 
resize2fs 1.42.10.
NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 
appear to be this:
49d0fe2 libext2fs: fix potential buffer overflow in closefs()

Then for the final 4 TB for a total of 5860330752 4k blocks which was 
done with
e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
"Should never happen: resize inode corrupt"
was seen.

In both cases the same offline resize was done, with no exotic options:
# umount /dev/md0
# fsck.ext4 -f /dev/md0
# resize2fs /dev/md0

thanks,
-johan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-08-13 18:12       ` Johan Harvyl
@ 2015-09-03 22:16         ` Johan Harvyl
  2015-09-12 10:27           ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-09-03 22:16 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

Hello again,

I finally got around to dig some more into this and made what I consider 
some good progress as I am now able to mount the filesystem read-only so 
I thought I would update this thread a bit.

Short one sentence recap since it's been a while since the original 
post: I am trying to recover a filesystem that was quite badly damaged 
by an offline resize2fs of a fairly modern ext4fs from 20 TB to 24 TB.

I spent a lot of time trying to get something meaningful out of 
e2fsck/debugfs and learned quite a bit in the process and I would like 
to briefly share some observations.

1) The first hurdle running e2fsck -fnv is that the "Superblock has an 
invalid journal (inode 8)" is considered fatal and cannot be fixed, at 
least not in r/o mode so e2fsck just stops, this check needed to go away.

2) e2fsck gets utterly confused by the "bad block inode" that 
incorrectly gets identified as having something worth looking at and 
spends days iterating through blocks (before I cancelled it). Removing 
handling if ino == EXT2_BAD_INO in pass1 and pass1b made things a bit 
better.

3) e2fsck using a backup superblock
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
e2fsck: Group descriptors look bad... trying backup blocks...
This is bad, as it means using a superblock that has not been updated 
with the +4TB. Consequently it gets the location of the first block 
group wrong, or at the very least the first inode table that houses the 
root inode.
Forcing it to use the master superblock again makes things a bit better.

I have some logs from various e2fsck runs with various amounts of hacks 
applied if they are of any interest to developers? I will also likely 
have the filesystem in this state for a week or two more if any other 
information I can extract is of interest to figure out what made 
resize2fs screw things up.

In the end, the only actual change I have made to the filesystem to make 
it mountable is that I borrowed a root inode from a different filesystem 
and updated the i_block pointer to point to the extent tree 
corresponding to the root inode of my broken filesystem which was quite 
easy to find by just looking for the string "lost+found".

# mount -o ro,noload /dev/md0 /mnt/loop
[2815465.034803] EXT4-fs (md0): mounted filesystem without journal. 
Opts: noload

# df -h /dev/md0
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0         22T -382T  404T    - /mnt/loop

Uh oh, does not look to good.. But hey, doing some checks on the data 
contents and so far results are very promising. An "ls /" looks good and 
so does a lot of the data that I can verify checksums on, checks are 
still running...

I really do not know how to move on with trying to repair the filesystem 
with e2fsck. I do not feel brave enough to let it run r/w on the given 
how many hacks that I consider very dirty were required to even get it 
this far. At this point letting it make changes to the filesystem may 
actually make it worse so I see no other way forward than extracting all 
the contents and recreating the filesystem from scratch.

Question is though, what is the recommended way to create the 
filesystem? 64bit is clearly necessary, but what about the other feature 
flags like flex_bg/meta_bg/resize_inode...? I do not care much about 
slight gains in performance, robustness is more important, and that it 
can be resized in the future.

Only online resize from now on, never offlline, I learned that lesson...

Will it be possible to expand from 24 TB to 28 TB online?

thanks,
-johan

On 2015-08-13 20:12, Johan Harvyl wrote:
> On 2015-08-13 15:27, Theodore Ts'o wrote:
>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>
>>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>>> you were originally using mke2fs and resize2fs 1.42.10, which did have
>>>> some bugs, and so the question is what sort of might it might have
>>>> left things.
>>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
>>> specific commits of interest?
>> I suspect it was caused by a bug in resize2fs 1.42.10.  The problem is
>> that off-line resize2fs is much more powerful; it can handle moving
>> file system metadata blocks around, so it can grow file systems in
>> cases which aren't supported by online resize --- and it can shrink
>> file systems when online resize doesn't support any kind of file
>> system shrink.  As such, the code is a lot more complicated, whereas
>> the online resize code is much simpler, and ultimately, much more
>> robust.
> Understood, so would it have been possible to move from my 20 TB -> 24 
> TB fs with
> online resize? I am confused by the threads I see on the net with 
> regards to this.
>>> Can you think of why it would zero out the first thousands of
>>> inodes, like the root inode, lost+found and so on? I am thinking
>>> that would help me assess the potential damage to the files. Could I
>>> perhaps expect the same kind of zeroed out blocks at regular
>>> intervals all over the device?
>> I didn't realize that the first thousands of inodes had been zeroed;
>> either you didn't mention this earier or I had missed that from your
>> e-mail.  I suspect the resize inode before the resize was pretty
>> terribly corrupted, but in a way that e2fsck didn't complain.
>
> Hi,
>
> I may not have been clear on that it was not just the first handful of 
> inodes.
>
> When I manually sampled some inodes with debugfs and a disk editor, 
> the first group
> I found valid inodes in was:
>  Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode 
> table at 1572896
>
> With 512 inodes per group that would mean at least some 24k inodes are 
> blanked out,
> but I did not check them all, I just sampled groups manually so there 
> could be some
> valid in some of the groups below group 48 or a lot more invalid 
> afterwards.
>
>> I'll have to try to reproduce the problem based how you originally
>> created and grew the file system and see if I can somehow reproduce
>> the problem.  Obviously e2fsck and resize2fs should be changed to make
>> this operation much more robust.  If you can tell me the exact
>> original size (just under 16TB is probably good enough, but if you
>> know the exact starting size, that might be helpful), and then steps
>> by which the file system was grown, and which version of e2fsprogs was
>> installed at the time, that would be quite helpful.
>>
>> Thanks,
>>
>>                         - Ted
>
> Cool, I will try to go through its history in some detail below.
>
> If you have ideas on what I could look for, like ideas on if there is 
> a particular periodicity
> to the corruption I can write some python to explore such theories.
>
>
> The filesystem was originally created with e2fsprogs 1.42.10-1 and 
> most likely linux-image
> 3.14 from Debian.
>
> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
> mke2fs 1.42.10 (18-May-2014)
> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
> 78675968,
>         102400000, 214990848, 512000000, 550731776, 644972544, 
> 1934917632,
>         2560000000, 3855122432
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
> #
>
> It was expanded with 4 TB (another 976721792 4k blocks). Best I can 
> tell from my logs this
> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 (debian 
> packages) and
> Linux 3.16. Everything was running fine after this.
> NOTE #1: It does *not* look like this filesystem was ever touched by 
> resize2fs 1.42.10.
> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 
> appear to be this:
> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>
> Then for the final 4 TB for a total of 5860330752 4k blocks which was 
> done with
> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
> "Should never happen: resize inode corrupt"
> was seen.
>
> In both cases the same offline resize was done, with no exotic options:
> # umount /dev/md0
> # fsck.ext4 -f /dev/md0
> # resize2fs /dev/md0
>
> thanks,
> -johan

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-03 22:16         ` Johan Harvyl
@ 2015-09-12 10:27           ` Johan Harvyl
  2015-09-14 21:35             ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-09-12 10:27 UTC (permalink / raw)
  To: Theodore Ts'o, linux-ext4

Hi,

I have now evacuated the data on the filesystem and I *did* manage to 
recreate the
"Should never happen: resize inode corrupt!" using the versions of 
e2fsprogs I believe I was using at the time.

The vast majority of the data that I was able to checksum was ok.

For me I guess the way forward should be to recreate the fs with 1.42.13 
and stick to online resize
from now on, correct?

Are there any feature flags that I should not use when expanding file 
systems or any that I must use?

-johan


Here is a step by step of what I did to reproduce

I have built the following two versions of e2fsprogs (configure, make, 
make install, nothing else):
421d693 (HEAD) libext2fs: fix potential buffer overflow in closefs()
6a3741a (tag: v1.42.12) Update release notes, etc. for final 1.42.12 release

9779e29 (HEAD, tag: v1.42.10) Update release notes, etc. for final 
1.42.10 release

===

First build the fs with 1.42.10 with the exact number of blocks I 
originally had.

# MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
/root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
mke2fs 1.42.10 (18-May-2014)
/dev/md0 contains a ext4 file system
         created on Sat Sep 12 11:23:02 2015
Proceed anyway? (y,n) y
Creating filesystem with 3906887168 4k blocks and 61045248 inodes
Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

 From dumpe2fs I observe:
1) the fs features match what I had on my broken fs
2) the number of free blocks is 512088558484167 which is clearly wrong.

# e2fsck -fnv /dev/md0
e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (512088558484167, counted=3902749383).
Fix? no

So the initial fs created by 1.42.10 appear to be bad.

Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent 64bit flex_bg sparse_super large_file huge_file 
uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              61045248
Block count:              3906887168
Reserved block count:     0
Free blocks:              512088558484167
Free inodes:              61045237
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      185
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   32
Flex block group size:    16
Filesystem created:       Sat Sep 12 11:27:55 2015
Last mount time:          n/a
Last write time:          Sat Sep 12 11:27:55 2015
Mount count:              0
Maximum mount count:      -1
Last checked:             Sat Sep 12 11:27:55 2015
Check interval:           0 (<none>)
Lifetime writes:          158 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed: f252a723-7016-43d1-97f8-579062a215e1
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00000001
Journal start:            0



The next step is resizing + 4 TB with 1.42.12.
# MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
/root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
resize2fs 1.42.12 (29-Aug-2014)
<and nothing more>
It did *not* print the "Resizing the filesystem on /dev/md0 to 
4883608960 (4k) blocks." that it should have.

I let it run for 90+ minutes sampling CPU and IO usage with iotop from 
time to time. It was using more or less 100% CPU and no visible io.

So, I let e2fsck fix the free block count and re-did the resize:
# e2fsck -f /dev/md0
e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (512088558484167, counted=3902749383).
Fix<y>? yes

/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks

# MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
/root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
resize2fs 1.42.12 (29-Aug-2014)
Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 119229)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 8)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.

dumpe2fs 1.42.13 (17-May-2015)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent 64bit flex_bg sparse_super large_file huge_file 
uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              76306432
Block count:              4883608960
Reserved block count:     0
Free blocks:              4878450712
Free inodes:              76306421
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   32
RAID stride:              32752
Flex block group size:    16
Filesystem created:       Sat Sep 12 11:41:10 2015
Last mount time:          n/a
Last write time:          Sat Sep 12 11:56:20 2015
Mount count:              0
Maximum mount count:      -1
Last checked:             Sat Sep 12 11:49:28 2015
Check interval:           0 (<none>)
Lifetime writes:          279 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
Journal backup:           inode blocks
Journal features:         (none)
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00000001
Journal start:            0

Looking good so far, and now for the final resize to 24 TB using 1.42.13:
# resize2fs -p /dev/md0
resize2fs 1.42.13 (17-May-2015)
Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 149036)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 14)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Should never happen: resize inode corrupt!

# dumpe2fs -h /dev/md0
dumpe2fs 1.42.13 (17-May-2015)
Filesystem volume name:   <none>
Last mounted on:          <not available>
Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index 
filetype extent 64bit flex_bg sparse_super large_file huge_file 
uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              91568128
Block count:              5860330752
Reserved block count:     0
Free blocks:              5853069550
Free inodes:              91568117
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         512
Inode blocks per group:   32
RAID stride:              32752
Flex block group size:    16
Filesystem created:       Sat Sep 12 11:41:10 2015
Last mount time:          n/a
Last write time:          Sat Sep 12 12:03:55 2015
Mount count:              0
Maximum mount count:      -1
Last checked:             Sat Sep 12 11:49:28 2015
Check interval:           0 (<none>)
Lifetime writes:          279 MB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
Journal backup:           inode blocks
Journal superblock magic number invalid!


On 2015-09-04 00:16, Johan Harvyl wrote:
> Hello again,
>
> I finally got around to dig some more into this and made what I 
> consider some good progress as I am now able to mount the filesystem 
> read-only so I thought I would update this thread a bit.
>
> Short one sentence recap since it's been a while since the original 
> post: I am trying to recover a filesystem that was quite badly damaged 
> by an offline resize2fs of a fairly modern ext4fs from 20 TB to 24 TB.
>
> I spent a lot of time trying to get something meaningful out of 
> e2fsck/debugfs and learned quite a bit in the process and I would like 
> to briefly share some observations.
>
> 1) The first hurdle running e2fsck -fnv is that the "Superblock has an 
> invalid journal (inode 8)" is considered fatal and cannot be fixed, at 
> least not in r/o mode so e2fsck just stops, this check needed to go away.
>
> 2) e2fsck gets utterly confused by the "bad block inode" that 
> incorrectly gets identified as having something worth looking at and 
> spends days iterating through blocks (before I cancelled it). Removing 
> handling if ino == EXT2_BAD_INO in pass1 and pass1b made things a bit 
> better.
>
> 3) e2fsck using a backup superblock
> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
> e2fsck: Group descriptors look bad... trying backup blocks...
> This is bad, as it means using a superblock that has not been updated 
> with the +4TB. Consequently it gets the location of the first block 
> group wrong, or at the very least the first inode table that houses 
> the root inode.
> Forcing it to use the master superblock again makes things a bit better.
>
> I have some logs from various e2fsck runs with various amounts of 
> hacks applied if they are of any interest to developers? I will also 
> likely have the filesystem in this state for a week or two more if any 
> other information I can extract is of interest to figure out what made 
> resize2fs screw things up.
>
>
>
> In the end, the only actual change I have made to the filesystem to 
> make it mountable is that I borrowed a root inode from a different 
> filesystem and updated the i_block pointer to point to the extent tree 
> corresponding to the root inode of my broken filesystem which was 
> quite easy to find by just looking for the string "lost+found".
>
> # mount -o ro,noload /dev/md0 /mnt/loop
> [2815465.034803] EXT4-fs (md0): mounted filesystem without journal. 
> Opts: noload
>
> # df -h /dev/md0
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/md0         22T -382T  404T    - /mnt/loop
>
> Uh oh, does not look to good.. But hey, doing some checks on the data 
> contents and so far results are very promising. An "ls /" looks good 
> and so does a lot of the data that I can verify checksums on, checks 
> are still running...
>
> I really do not know how to move on with trying to repair the 
> filesystem with e2fsck. I do not feel brave enough to let it run r/w 
> on the given how many hacks that I consider very dirty were required 
> to even get it this far. At this point letting it make changes to the 
> filesystem may actually make it worse so I see no other way forward 
> than extracting all the contents and recreating the filesystem from 
> scratch.
>
> Question is though, what is the recommended way to create the 
> filesystem? 64bit is clearly necessary, but what about the other 
> feature flags like flex_bg/meta_bg/resize_inode...? I do not care much 
> about slight gains in performance, robustness is more important, and 
> that it can be resized in the future.
>
> Only online resize from now on, never offlline, I learned that lesson...
>
> Will it be possible to expand from 24 TB to 28 TB online?
>
> thanks,
> -johan
>
>
> On 2015-08-13 20:12, Johan Harvyl wrote:
>> On 2015-08-13 15:27, Theodore Ts'o wrote:
>>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>>
>>>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>>>> you were originally using mke2fs and resize2fs 1.42.10, which did 
>>>>> have
>>>>> some bugs, and so the question is what sort of might it might have
>>>>> left things.
>>>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
>>>> specific commits of interest?
>>> I suspect it was caused by a bug in resize2fs 1.42.10.  The problem is
>>> that off-line resize2fs is much more powerful; it can handle moving
>>> file system metadata blocks around, so it can grow file systems in
>>> cases which aren't supported by online resize --- and it can shrink
>>> file systems when online resize doesn't support any kind of file
>>> system shrink.  As such, the code is a lot more complicated, whereas
>>> the online resize code is much simpler, and ultimately, much more
>>> robust.
>> Understood, so would it have been possible to move from my 20 TB -> 
>> 24 TB fs with
>> online resize? I am confused by the threads I see on the net with 
>> regards to this.
>>>> Can you think of why it would zero out the first thousands of
>>>> inodes, like the root inode, lost+found and so on? I am thinking
>>>> that would help me assess the potential damage to the files. Could I
>>>> perhaps expect the same kind of zeroed out blocks at regular
>>>> intervals all over the device?
>>> I didn't realize that the first thousands of inodes had been zeroed;
>>> either you didn't mention this earier or I had missed that from your
>>> e-mail.  I suspect the resize inode before the resize was pretty
>>> terribly corrupted, but in a way that e2fsck didn't complain.
>>
>> Hi,
>>
>> I may not have been clear on that it was not just the first handful 
>> of inodes.
>>
>> When I manually sampled some inodes with debugfs and a disk editor, 
>> the first group
>> I found valid inodes in was:
>>  Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode 
>> table at 1572896
>>
>> With 512 inodes per group that would mean at least some 24k inodes 
>> are blanked out,
>> but I did not check them all, I just sampled groups manually so there 
>> could be some
>> valid in some of the groups below group 48 or a lot more invalid 
>> afterwards.
>>
>>> I'll have to try to reproduce the problem based how you originally
>>> created and grew the file system and see if I can somehow reproduce
>>> the problem.  Obviously e2fsck and resize2fs should be changed to make
>>> this operation much more robust.  If you can tell me the exact
>>> original size (just under 16TB is probably good enough, but if you
>>> know the exact starting size, that might be helpful), and then steps
>>> by which the file system was grown, and which version of e2fsprogs was
>>> installed at the time, that would be quite helpful.
>>>
>>> Thanks,
>>>
>>>                         - Ted
>>
>> Cool, I will try to go through its history in some detail below.
>>
>> If you have ideas on what I could look for, like ideas on if there is 
>> a particular periodicity
>> to the corruption I can write some python to explore such theories.
>>
>>
>> The filesystem was originally created with e2fsprogs 1.42.10-1 and 
>> most likely linux-image
>> 3.14 from Debian.
>>
>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
>> mke2fs 1.42.10 (18-May-2014)
>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
>> Superblock backups stored on blocks:
>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 
>> 1605632, 2654208,
>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
>> 78675968,
>>         102400000, 214990848, 512000000, 550731776, 644972544, 
>> 1934917632,
>>         2560000000, 3855122432
>>
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (32768 blocks): done
>> Writing superblocks and filesystem accounting information: done
>> #
>>
>> It was expanded with 4 TB (another 976721792 4k blocks). Best I can 
>> tell from my logs this
>> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 (debian 
>> packages) and
>> Linux 3.16. Everything was running fine after this.
>> NOTE #1: It does *not* look like this filesystem was ever touched by 
>> resize2fs 1.42.10.
>> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 
>> appear to be this:
>> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>>
>> Then for the final 4 TB for a total of 5860330752 4k blocks which was 
>> done with
>> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
>> "Should never happen: resize inode corrupt"
>> was seen.
>>
>> In both cases the same offline resize was done, with no exotic options:
>> # umount /dev/md0
>> # fsck.ext4 -f /dev/md0
>> # resize2fs /dev/md0
>>
>> thanks,
>> -johan
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-12 10:27           ` Johan Harvyl
@ 2015-09-14 21:35             ` Johan Harvyl
  2015-09-15 17:55               ` Johan Harvyl
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-09-14 21:35 UTC (permalink / raw)
  To: Theodore Ts'o, linux-ext4

In an attempt to further isolate what versions of e2fsprogs, at a commit 
level, that are
needed to reproduce the bad behavior I tried my own step-by-step, 
initially with a much
higher -i 16777216 to mkfs.ext4 in the hope that fewer inodes would make 
all the
operations run faster.

When I was unable to reproduce with -i 16777216 instead, I switched back 
to exactly
what I reproduced with the first time, and I *still* did not get the 
"Should never happen:
resize inode corrupt!".

The only reasonable explanation I can come up with to this is that 
something is not being
initialized properly that resize2fs expects to be initialized. I have no 
indications of any
issues with any hardware or the underlying md block.

What I did however notice is that I can have the same kind of filesystem 
corruption
*without* seeing the "Should never happen: resize inode corrupt!" 
message using the
following sequence, and this *is* reproducible one time after another:

# MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
/root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
# e2fsck -fy /dev/md0 (using 1.42.13)
# resize2fs -p /dev/md0 19534435840k (using 1.42.13)
# resize2fs -p /dev/md0 (using 1.42.13)
# e2fsck -fn /dev/md0
e2fsck 1.42.13 (17-May-2015)
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
e2fsck: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

e2fsck: Illegal inode number while checking ext3 journal for /dev/md0

At this point the root inode is also bad and this fails:
# mount /dev/md0 /mnt/loop -o ro,noload
mount: mount /dev/md0 on /mnt/loop failed: Stale file handle
[3766493.732188] EXT4-fs (md0): get root inode failed
[3766493.732190] EXT4-fs (md0): mount failed

Note that only versions 1.42.10 and 1.42.13 are involved now, 1.42.12 is 
not needed.

Kernel is the debian:
ii  linux-image-4.0.0-2-amd64      4.0.8-2 amd64                Linux 
4.0 for 64-bit PCs

For the record I also tried a more recent e2fsprogs for the resize 
(instead of 1.42.13),
locally built from:
956b0f1 Merge branch 'maint' into next
and I could still reproduce it on the first attempt.

More verbose logs follows.

Does anyone else have some kind of testbed to test the same sequence of 
commands?

===

# MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
/root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
mke2fs 1.42.10 (18-May-2014)
/dev/md0 contains a ext4 file system
         last mounted on Sun Sep 13 22:19:28 2015
Proceed anyway? (y,n) y
Creating filesystem with 3906887168 4k blocks and 61045248 inodes
Filesystem UUID: e263356e-4fe4-4e9b-bd0c-8edc2c411735
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

# e2fsck -fy /dev/md0
e2fsck 1.42.13 (17-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (512088558484167, counted=3902749383).
Fix? yes


/dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
/dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks

# resize2fs -p /dev/md0 19534435840k
resize2fs 1.42.13 (17-May-2015)
Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 119229)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 8)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.

# resize2fs -p /dev/md0
resize2fs 1.42.13 (17-May-2015)
Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 149036)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 14)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/md0 is now 5860330752 (4k) blocks long.

# e2fsck -fn /dev/md0
e2fsck 1.42.13 (17-May-2015)
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
e2fsck: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

e2fsck: Illegal inode number while checking ext3 journal for /dev/md0

On 2015-09-12 12:27, Johan Harvyl wrote:
> Hi,
>
> I have now evacuated the data on the filesystem and I *did* manage to 
> recreate the
> "Should never happen: resize inode corrupt!" using the versions of 
> e2fsprogs I believe I was using at the time.
>
> The vast majority of the data that I was able to checksum was ok.
>
> For me I guess the way forward should be to recreate the fs with 
> 1.42.13 and stick to online resize
> from now on, correct?
>
> Are there any feature flags that I should not use when expanding file 
> systems or any that I must use?
>
> -johan
>
>
> Here is a step by step of what I did to reproduce
>
> I have built the following two versions of e2fsprogs (configure, make, 
> make install, nothing else):
> 421d693 (HEAD) libext2fs: fix potential buffer overflow in closefs()
> 6a3741a (tag: v1.42.12) Update release notes, etc. for final 1.42.12 
> release
>
> 9779e29 (HEAD, tag: v1.42.10) Update release notes, etc. for final 
> 1.42.10 release
>
> ===
>
> First build the fs with 1.42.10 with the exact number of blocks I 
> originally had.
>
> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
> /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 
> 15627548672k
> mke2fs 1.42.10 (18-May-2014)
> /dev/md0 contains a ext4 file system
>         created on Sat Sep 12 11:23:02 2015
> Proceed anyway? (y,n) y
> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
> 78675968,
>         102400000, 214990848, 512000000, 550731776, 644972544, 
> 1934917632,
>         2560000000, 3855122432
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> From dumpe2fs I observe:
> 1) the fs features match what I had on my broken fs
> 2) the number of free blocks is 512088558484167 which is clearly wrong.
>
> # e2fsck -fnv /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (512088558484167, counted=3902749383).
> Fix? no
>
> So the initial fs created by 1.42.10 appear to be bad.
>
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index 
> filetype extent 64bit flex_bg sparse_super large_file huge_file 
> uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              61045248
> Block count:              3906887168
> Reserved block count:     0
> Free blocks:              512088558484167
> Free inodes:              61045237
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Reserved GDT blocks:      185
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         512
> Inode blocks per group:   32
> Flex block group size:    16
> Filesystem created:       Sat Sep 12 11:27:55 2015
> Last mount time:          n/a
> Last write time:          Sat Sep 12 11:27:55 2015
> Mount count:              0
> Maximum mount count:      -1
> Last checked:             Sat Sep 12 11:27:55 2015
> Check interval:           0 (<none>)
> Lifetime writes:          158 MB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed: f252a723-7016-43d1-97f8-579062a215e1
> Journal backup:           inode blocks
> Journal features:         (none)
> Journal size:             128M
> Journal length:           32768
> Journal sequence:         0x00000001
> Journal start:            0
>
>
>
> The next step is resizing + 4 TB with 1.42.12.
> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
> /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
> resize2fs 1.42.12 (29-Aug-2014)
> <and nothing more>
> It did *not* print the "Resizing the filesystem on /dev/md0 to 
> 4883608960 (4k) blocks." that it should have.
>
> I let it run for 90+ minutes sampling CPU and IO usage with iotop from 
> time to time. It was using more or less 100% CPU and no visible io.
>
> So, I let e2fsck fix the free block count and re-did the resize:
> # e2fsck -f /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (512088558484167, counted=3902749383).
> Fix<y>? yes
>
> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 
> blocks
>
> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
> /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
> resize2fs 1.42.12 (29-Aug-2014)
> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
> Begin pass 2 (max = 6)
> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 3 (max = 119229)
> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 5 (max = 8)
> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>
> dumpe2fs 1.42.13 (17-May-2015)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index 
> filetype extent 64bit flex_bg sparse_super large_file huge_file 
> uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              76306432
> Block count:              4883608960
> Reserved block count:     0
> Free blocks:              4878450712
> Free inodes:              76306421
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         512
> Inode blocks per group:   32
> RAID stride:              32752
> Flex block group size:    16
> Filesystem created:       Sat Sep 12 11:41:10 2015
> Last mount time:          n/a
> Last write time:          Sat Sep 12 11:56:20 2015
> Mount count:              0
> Maximum mount count:      -1
> Last checked:             Sat Sep 12 11:49:28 2015
> Check interval:           0 (<none>)
> Lifetime writes:          279 MB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
> Journal backup:           inode blocks
> Journal features:         (none)
> Journal size:             128M
> Journal length:           32768
> Journal sequence:         0x00000001
> Journal start:            0
>
> Looking good so far, and now for the final resize to 24 TB using 1.42.13:
> # resize2fs -p /dev/md0
> resize2fs 1.42.13 (17-May-2015)
> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
> Begin pass 2 (max = 6)
> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 3 (max = 149036)
> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 5 (max = 14)
> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Should never happen: resize inode corrupt!
>
> # dumpe2fs -h /dev/md0
> dumpe2fs 1.42.13 (17-May-2015)
> Filesystem volume name:   <none>
> Last mounted on:          <not available>
> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      has_journal ext_attr resize_inode dir_index 
> filetype extent 64bit flex_bg sparse_super large_file huge_file 
> uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         clean with errors
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              91568128
> Block count:              5860330752
> Reserved block count:     0
> Free blocks:              5853069550
> Free inodes:              91568117
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Group descriptor size:    64
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         512
> Inode blocks per group:   32
> RAID stride:              32752
> Flex block group size:    16
> Filesystem created:       Sat Sep 12 11:41:10 2015
> Last mount time:          n/a
> Last write time:          Sat Sep 12 12:03:55 2015
> Mount count:              0
> Maximum mount count:      -1
> Last checked:             Sat Sep 12 11:49:28 2015
> Check interval:           0 (<none>)
> Lifetime writes:          279 MB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group root)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Journal inode:            8
> Default directory hash:   half_md4
> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
> Journal backup:           inode blocks
> Journal superblock magic number invalid!
>
>
> On 2015-09-04 00:16, Johan Harvyl wrote:
>> Hello again,
>>
>> I finally got around to dig some more into this and made what I 
>> consider some good progress as I am now able to mount the filesystem 
>> read-only so I thought I would update this thread a bit.
>>
>> Short one sentence recap since it's been a while since the original 
>> post: I am trying to recover a filesystem that was quite badly 
>> damaged by an offline resize2fs of a fairly modern ext4fs from 20 TB 
>> to 24 TB.
>>
>> I spent a lot of time trying to get something meaningful out of 
>> e2fsck/debugfs and learned quite a bit in the process and I would 
>> like to briefly share some observations.
>>
>> 1) The first hurdle running e2fsck -fnv is that the "Superblock has 
>> an invalid journal (inode 8)" is considered fatal and cannot be 
>> fixed, at least not in r/o mode so e2fsck just stops, this check 
>> needed to go away.
>>
>> 2) e2fsck gets utterly confused by the "bad block inode" that 
>> incorrectly gets identified as having something worth looking at and 
>> spends days iterating through blocks (before I cancelled it). 
>> Removing handling if ino == EXT2_BAD_INO in pass1 and pass1b made 
>> things a bit better.
>>
>> 3) e2fsck using a backup superblock
>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> This is bad, as it means using a superblock that has not been updated 
>> with the +4TB. Consequently it gets the location of the first block 
>> group wrong, or at the very least the first inode table that houses 
>> the root inode.
>> Forcing it to use the master superblock again makes things a bit better.
>>
>> I have some logs from various e2fsck runs with various amounts of 
>> hacks applied if they are of any interest to developers? I will also 
>> likely have the filesystem in this state for a week or two more if 
>> any other information I can extract is of interest to figure out what 
>> made resize2fs screw things up.
>>
>>
>>
>> In the end, the only actual change I have made to the filesystem to 
>> make it mountable is that I borrowed a root inode from a different 
>> filesystem and updated the i_block pointer to point to the extent 
>> tree corresponding to the root inode of my broken filesystem which 
>> was quite easy to find by just looking for the string "lost+found".
>>
>> # mount -o ro,noload /dev/md0 /mnt/loop
>> [2815465.034803] EXT4-fs (md0): mounted filesystem without journal. 
>> Opts: noload
>>
>> # df -h /dev/md0
>> Filesystem      Size  Used Avail Use% Mounted on
>> /dev/md0         22T -382T  404T    - /mnt/loop
>>
>> Uh oh, does not look to good.. But hey, doing some checks on the data 
>> contents and so far results are very promising. An "ls /" looks good 
>> and so does a lot of the data that I can verify checksums on, checks 
>> are still running...
>>
>> I really do not know how to move on with trying to repair the 
>> filesystem with e2fsck. I do not feel brave enough to let it run r/w 
>> on the given how many hacks that I consider very dirty were required 
>> to even get it this far. At this point letting it make changes to the 
>> filesystem may actually make it worse so I see no other way forward 
>> than extracting all the contents and recreating the filesystem from 
>> scratch.
>>
>> Question is though, what is the recommended way to create the 
>> filesystem? 64bit is clearly necessary, but what about the other 
>> feature flags like flex_bg/meta_bg/resize_inode...? I do not care 
>> much about slight gains in performance, robustness is more important, 
>> and that it can be resized in the future.
>>
>> Only online resize from now on, never offlline, I learned that lesson...
>>
>> Will it be possible to expand from 24 TB to 28 TB online?
>>
>> thanks,
>> -johan
>>
>>
>> On 2015-08-13 20:12, Johan Harvyl wrote:
>>> On 2015-08-13 15:27, Theodore Ts'o wrote:
>>>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>>>
>>>>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>>>>> you were originally using mke2fs and resize2fs 1.42.10, which did 
>>>>>> have
>>>>>> some bugs, and so the question is what sort of might it might have
>>>>>> left things.
>>>>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? 
>>>>> Any
>>>>> specific commits of interest?
>>>> I suspect it was caused by a bug in resize2fs 1.42.10.  The problem is
>>>> that off-line resize2fs is much more powerful; it can handle moving
>>>> file system metadata blocks around, so it can grow file systems in
>>>> cases which aren't supported by online resize --- and it can shrink
>>>> file systems when online resize doesn't support any kind of file
>>>> system shrink.  As such, the code is a lot more complicated, whereas
>>>> the online resize code is much simpler, and ultimately, much more
>>>> robust.
>>> Understood, so would it have been possible to move from my 20 TB -> 
>>> 24 TB fs with
>>> online resize? I am confused by the threads I see on the net with 
>>> regards to this.
>>>>> Can you think of why it would zero out the first thousands of
>>>>> inodes, like the root inode, lost+found and so on? I am thinking
>>>>> that would help me assess the potential damage to the files. Could I
>>>>> perhaps expect the same kind of zeroed out blocks at regular
>>>>> intervals all over the device?
>>>> I didn't realize that the first thousands of inodes had been zeroed;
>>>> either you didn't mention this earier or I had missed that from your
>>>> e-mail.  I suspect the resize inode before the resize was pretty
>>>> terribly corrupted, but in a way that e2fsck didn't complain.
>>>
>>> Hi,
>>>
>>> I may not have been clear on that it was not just the first handful 
>>> of inodes.
>>>
>>> When I manually sampled some inodes with debugfs and a disk editor, 
>>> the first group
>>> I found valid inodes in was:
>>>  Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode 
>>> table at 1572896
>>>
>>> With 512 inodes per group that would mean at least some 24k inodes 
>>> are blanked out,
>>> but I did not check them all, I just sampled groups manually so 
>>> there could be some
>>> valid in some of the groups below group 48 or a lot more invalid 
>>> afterwards.
>>>
>>>> I'll have to try to reproduce the problem based how you originally
>>>> created and grew the file system and see if I can somehow reproduce
>>>> the problem.  Obviously e2fsck and resize2fs should be changed to make
>>>> this operation much more robust.  If you can tell me the exact
>>>> original size (just under 16TB is probably good enough, but if you
>>>> know the exact starting size, that might be helpful), and then steps
>>>> by which the file system was grown, and which version of e2fsprogs was
>>>> installed at the time, that would be quite helpful.
>>>>
>>>> Thanks,
>>>>
>>>>                         - Ted
>>>
>>> Cool, I will try to go through its history in some detail below.
>>>
>>> If you have ideas on what I could look for, like ideas on if there 
>>> is a particular periodicity
>>> to the corruption I can write some python to explore such theories.
>>>
>>>
>>> The filesystem was originally created with e2fsprogs 1.42.10-1 and 
>>> most likely linux-image
>>> 3.14 from Debian.
>>>
>>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
>>> mke2fs 1.42.10 (18-May-2014)
>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
>>> Superblock backups stored on blocks:
>>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 
>>> 1605632, 2654208,
>>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
>>> 78675968,
>>>         102400000, 214990848, 512000000, 550731776, 644972544, 
>>> 1934917632,
>>>         2560000000, 3855122432
>>>
>>> Allocating group tables: done
>>> Writing inode tables: done
>>> Creating journal (32768 blocks): done
>>> Writing superblocks and filesystem accounting information: done
>>> #
>>>
>>> It was expanded with 4 TB (another 976721792 4k blocks). Best I can 
>>> tell from my logs this
>>> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 
>>> (debian packages) and
>>> Linux 3.16. Everything was running fine after this.
>>> NOTE #1: It does *not* look like this filesystem was ever touched by 
>>> resize2fs 1.42.10.
>>> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 
>>> appear to be this:
>>> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>>>
>>> Then for the final 4 TB for a total of 5860330752 4k blocks which 
>>> was done with
>>> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
>>> "Should never happen: resize inode corrupt"
>>> was seen.
>>>
>>> In both cases the same offline resize was done, with no exotic options:
>>> # umount /dev/md0
>>> # fsck.ext4 -f /dev/md0
>>> # resize2fs /dev/md0
>>>
>>> thanks,
>>> -johan
>>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-14 21:35             ` Johan Harvyl
@ 2015-09-15 17:55               ` Johan Harvyl
  2015-09-17  1:21                 ` Andreas Dilger
  0 siblings, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-09-15 17:55 UTC (permalink / raw)
  To: Theodore Ts'o, linux-ext4

I have now been able to reproduce the issue that resize2fs corrupts at 
least the root, resize and journal
inodes with versions 1.42.13 and the more recent commit 956b0f1 of 
e2fsprogs.

Note that older versions of e2fsprogs need *not* be involved, 1.42.13 
and newer also have issues.

Please advice on things I can try to narrow down the root cause of what 
has to be an e2fsprogs bug. In
particular it would be very useful to reproduce it faster, running 
through the mkfs and two resize steps
takes around ten minutes so iterative testing is a slow and I do not 
really have much of clue what steps
would be more likely to overwrite the inodes.

At some point I would like to return this array to service but I am not 
really comfortable creating a
new ext4 filesystem on it without first understanding how it can become 
corrupted without even
mounting the file system.

For 1.42.13:
# mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
# resize2fs -p /dev/md0 19534435840k
# resize2fs -p /dev/md0
# e2fsck -fn /dev/md0
e2fsck 1.42.13 (17-May-2015)
ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
e2fsck: Group descriptors look bad... trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

e2fsck: Illegal inode number while checking ext3 journal for /dev/md0

/dev/md0: ********** WARNING: Filesystem still has errors **********


or for 956b0f1:
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 
15627548672k
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/resize2fs -p /dev/md0 19534435840k
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/resize2fs -p /dev/md0
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/e2fsck -fn /dev/md0
e2fsck 1.43-WIP (18-May-2015)
ext2fs_open2: Superblock checksum does not match superblock
/root/elatest/out/sbin/e2fsck: Superblock invalid, trying backup blocks...
Superblock has an invalid journal (inode 8).
Clear? no

/root/elatest/out/sbin/e2fsck: Illegal inode number while checking ext3 
journal for /dev/md0

/dev/md0: ********** WARNING: Filesystem still has errors **********

# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/debugfs -c /dev/md0
debugfs 1.43-WIP (18-May-2015)
/dev/md0: Superblock checksum does not match superblock while opening 
filesystem
debugfs:  stat <2>
stat: Filesystem not open

# debugfs -c /dev/md0
debugfs 1.42.13 (17-May-2015)
/dev/md0: catastrophic mode - not reading inode or group bitmaps
debugfs:  stat <2>
Inode: 2   Type: bad type    Mode:  0004   Flags: 0x1
Generation: 1    Version: 0x00000001
User:  9440   Group:     0   Size: 618659860
File ACL: 1    Directory ACL: 0
Links: 0   Blockcount: 724107776
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x02008000 -- Sun Jan 24 18:46:40 1971
atime: 0x24e000a0 -- Wed Aug  9 12:00:00 1989
mtime: 0x00030000 -- Sat Jan  3 07:36:48 1970
Size of extra inode fields: 6
BLOCKS:
(0):1, (6):618659845 .... and it goes on...

On 2015-09-14 23:35, Johan Harvyl wrote:
> In an attempt to further isolate what versions of e2fsprogs, at a 
> commit level, that are
> needed to reproduce the bad behavior I tried my own step-by-step, 
> initially with a much
> higher -i 16777216 to mkfs.ext4 in the hope that fewer inodes would 
> make all the
> operations run faster.
>
> When I was unable to reproduce with -i 16777216 instead, I switched 
> back to exactly
> what I reproduced with the first time, and I *still* did not get the 
> "Should never happen:
> resize inode corrupt!".
>
> The only reasonable explanation I can come up with to this is that 
> something is not being
> initialized properly that resize2fs expects to be initialized. I have 
> no indications of any
> issues with any hardware or the underlying md block.
>
> What I did however notice is that I can have the same kind of 
> filesystem corruption
> *without* seeing the "Should never happen: resize inode corrupt!" 
> message using the
> following sequence, and this *is* reproducible one time after another:
>
> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
> /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 
> 15627548672k
> # e2fsck -fy /dev/md0 (using 1.42.13)
> # resize2fs -p /dev/md0 19534435840k (using 1.42.13)
> # resize2fs -p /dev/md0 (using 1.42.13)
> # e2fsck -fn /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
> e2fsck: Group descriptors look bad... trying backup blocks...
> Superblock has an invalid journal (inode 8).
> Clear? no
>
> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>
> At this point the root inode is also bad and this fails:
> # mount /dev/md0 /mnt/loop -o ro,noload
> mount: mount /dev/md0 on /mnt/loop failed: Stale file handle
> [3766493.732188] EXT4-fs (md0): get root inode failed
> [3766493.732190] EXT4-fs (md0): mount failed
>
> Note that only versions 1.42.10 and 1.42.13 are involved now, 1.42.12 
> is not needed.
>
> Kernel is the debian:
> ii  linux-image-4.0.0-2-amd64      4.0.8-2 amd64 Linux 4.0 for 64-bit PCs
>
> For the record I also tried a more recent e2fsprogs for the resize 
> (instead of 1.42.13),
> locally built from:
> 956b0f1 Merge branch 'maint' into next
> and I could still reproduce it on the first attempt.
>
> More verbose logs follows.
>
> Does anyone else have some kind of testbed to test the same sequence 
> of commands?
>
> ===
>
> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
> /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 
> 15627548672k
> mke2fs 1.42.10 (18-May-2014)
> /dev/md0 contains a ext4 file system
>         last mounted on Sun Sep 13 22:19:28 2015
> Proceed anyway? (y,n) y
> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
> Filesystem UUID: e263356e-4fe4-4e9b-bd0c-8edc2c411735
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
> 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
> 78675968,
>         102400000, 214990848, 512000000, 550731776, 644972544, 
> 1934917632,
>         2560000000, 3855122432
>
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (32768 blocks): done
> Writing superblocks and filesystem accounting information: done
>
> # e2fsck -fy /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Free blocks count wrong (512088558484167, counted=3902749383).
> Fix? yes
>
>
> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 
> blocks
>
> # resize2fs -p /dev/md0 19534435840k
> resize2fs 1.42.13 (17-May-2015)
> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
> Begin pass 2 (max = 6)
> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 3 (max = 119229)
> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 5 (max = 8)
> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>
> # resize2fs -p /dev/md0
> resize2fs 1.42.13 (17-May-2015)
> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
> Begin pass 2 (max = 6)
> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 3 (max = 149036)
> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> Begin pass 5 (max = 14)
> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
> The filesystem on /dev/md0 is now 5860330752 (4k) blocks long.
>
> # e2fsck -fn /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
> e2fsck: Group descriptors look bad... trying backup blocks...
> Superblock has an invalid journal (inode 8).
> Clear? no
>
> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>
> On 2015-09-12 12:27, Johan Harvyl wrote:
>> Hi,
>>
>> I have now evacuated the data on the filesystem and I *did* manage to 
>> recreate the
>> "Should never happen: resize inode corrupt!" using the versions of 
>> e2fsprogs I believe I was using at the time.
>>
>> The vast majority of the data that I was able to checksum was ok.
>>
>> For me I guess the way forward should be to recreate the fs with 
>> 1.42.13 and stick to online resize
>> from now on, correct?
>>
>> Are there any feature flags that I should not use when expanding file 
>> systems or any that I must use?
>>
>> -johan
>>
>>
>> Here is a step by step of what I did to reproduce
>>
>> I have built the following two versions of e2fsprogs (configure, 
>> make, make install, nothing else):
>> 421d693 (HEAD) libext2fs: fix potential buffer overflow in closefs()
>> 6a3741a (tag: v1.42.12) Update release notes, etc. for final 1.42.12 
>> release
>>
>> 9779e29 (HEAD, tag: v1.42.10) Update release notes, etc. for final 
>> 1.42.10 release
>>
>> ===
>>
>> First build the fs with 1.42.10 with the exact number of blocks I 
>> originally had.
>>
>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf 
>> /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 
>> 15627548672k
>> mke2fs 1.42.10 (18-May-2014)
>> /dev/md0 contains a ext4 file system
>>         created on Sat Sep 12 11:23:02 2015
>> Proceed anyway? (y,n) y
>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>> Superblock backups stored on blocks:
>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 
>> 1605632, 2654208,
>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
>> 78675968,
>>         102400000, 214990848, 512000000, 550731776, 644972544, 
>> 1934917632,
>>         2560000000, 3855122432
>>
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (32768 blocks): done
>> Writing superblocks and filesystem accounting information: done
>>
>> From dumpe2fs I observe:
>> 1) the fs features match what I had on my broken fs
>> 2) the number of free blocks is 512088558484167 which is clearly wrong.
>>
>> # e2fsck -fnv /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> Free blocks count wrong (512088558484167, counted=3902749383).
>> Fix? no
>>
>> So the initial fs created by 1.42.10 appear to be bad.
>>
>> Filesystem volume name:   <none>
>> Last mounted on:          <not available>
>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr resize_inode dir_index 
>> filetype extent 64bit flex_bg sparse_super large_file huge_file 
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags:         signed_directory_hash
>> Default mount options:    user_xattr acl
>> Filesystem state:         clean
>> Errors behavior:          Continue
>> Filesystem OS type:       Linux
>> Inode count:              61045248
>> Block count:              3906887168
>> Reserved block count:     0
>> Free blocks:              512088558484167
>> Free inodes:              61045237
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Group descriptor size:    64
>> Reserved GDT blocks:      185
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         512
>> Inode blocks per group:   32
>> Flex block group size:    16
>> Filesystem created:       Sat Sep 12 11:27:55 2015
>> Last mount time:          n/a
>> Last write time:          Sat Sep 12 11:27:55 2015
>> Mount count:              0
>> Maximum mount count:      -1
>> Last checked:             Sat Sep 12 11:27:55 2015
>> Check interval:           0 (<none>)
>> Lifetime writes:          158 MB
>> Reserved blocks uid:      0 (user root)
>> Reserved blocks gid:      0 (group root)
>> First inode:              11
>> Inode size:               256
>> Required extra isize:     28
>> Desired extra isize:      28
>> Journal inode:            8
>> Default directory hash:   half_md4
>> Directory Hash Seed: f252a723-7016-43d1-97f8-579062a215e1
>> Journal backup:           inode blocks
>> Journal features:         (none)
>> Journal size:             128M
>> Journal length:           32768
>> Journal sequence:         0x00000001
>> Journal start:            0
>>
>>
>>
>> The next step is resizing + 4 TB with 1.42.12.
>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
>> /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>> resize2fs 1.42.12 (29-Aug-2014)
>> <and nothing more>
>> It did *not* print the "Resizing the filesystem on /dev/md0 to 
>> 4883608960 (4k) blocks." that it should have.
>>
>> I let it run for 90+ minutes sampling CPU and IO usage with iotop 
>> from time to time. It was using more or less 100% CPU and no visible io.
>>
>> So, I let e2fsck fix the free block count and re-did the resize:
>> # e2fsck -f /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> Free blocks count wrong (512088558484167, counted=3902749383).
>> Fix<y>? yes
>>
>> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 
>> blocks
>>
>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf 
>> /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>> resize2fs 1.42.12 (29-Aug-2014)
>> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
>> Begin pass 2 (max = 6)
>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 3 (max = 119229)
>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 5 (max = 8)
>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>>
>> dumpe2fs 1.42.13 (17-May-2015)
>> Filesystem volume name:   <none>
>> Last mounted on:          <not available>
>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr resize_inode dir_index 
>> filetype extent 64bit flex_bg sparse_super large_file huge_file 
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags:         signed_directory_hash
>> Default mount options:    user_xattr acl
>> Filesystem state:         clean
>> Errors behavior:          Continue
>> Filesystem OS type:       Linux
>> Inode count:              76306432
>> Block count:              4883608960
>> Reserved block count:     0
>> Free blocks:              4878450712
>> Free inodes:              76306421
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Group descriptor size:    64
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         512
>> Inode blocks per group:   32
>> RAID stride:              32752
>> Flex block group size:    16
>> Filesystem created:       Sat Sep 12 11:41:10 2015
>> Last mount time:          n/a
>> Last write time:          Sat Sep 12 11:56:20 2015
>> Mount count:              0
>> Maximum mount count:      -1
>> Last checked:             Sat Sep 12 11:49:28 2015
>> Check interval:           0 (<none>)
>> Lifetime writes:          279 MB
>> Reserved blocks uid:      0 (user root)
>> Reserved blocks gid:      0 (group root)
>> First inode:              11
>> Inode size:               256
>> Required extra isize:     28
>> Desired extra isize:      28
>> Journal inode:            8
>> Default directory hash:   half_md4
>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>> Journal backup:           inode blocks
>> Journal features:         (none)
>> Journal size:             128M
>> Journal length:           32768
>> Journal sequence:         0x00000001
>> Journal start:            0
>>
>> Looking good so far, and now for the final resize to 24 TB using 
>> 1.42.13:
>> # resize2fs -p /dev/md0
>> resize2fs 1.42.13 (17-May-2015)
>> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
>> Begin pass 2 (max = 6)
>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 3 (max = 149036)
>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 5 (max = 14)
>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Should never happen: resize inode corrupt!
>>
>> # dumpe2fs -h /dev/md0
>> dumpe2fs 1.42.13 (17-May-2015)
>> Filesystem volume name:   <none>
>> Last mounted on:          <not available>
>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>> Filesystem magic number:  0xEF53
>> Filesystem revision #:    1 (dynamic)
>> Filesystem features:      has_journal ext_attr resize_inode dir_index 
>> filetype extent 64bit flex_bg sparse_super large_file huge_file 
>> uninit_bg dir_nlink extra_isize
>> Filesystem flags:         signed_directory_hash
>> Default mount options:    user_xattr acl
>> Filesystem state:         clean with errors
>> Errors behavior:          Continue
>> Filesystem OS type:       Linux
>> Inode count:              91568128
>> Block count:              5860330752
>> Reserved block count:     0
>> Free blocks:              5853069550
>> Free inodes:              91568117
>> First block:              0
>> Block size:               4096
>> Fragment size:            4096
>> Group descriptor size:    64
>> Blocks per group:         32768
>> Fragments per group:      32768
>> Inodes per group:         512
>> Inode blocks per group:   32
>> RAID stride:              32752
>> Flex block group size:    16
>> Filesystem created:       Sat Sep 12 11:41:10 2015
>> Last mount time:          n/a
>> Last write time:          Sat Sep 12 12:03:55 2015
>> Mount count:              0
>> Maximum mount count:      -1
>> Last checked:             Sat Sep 12 11:49:28 2015
>> Check interval:           0 (<none>)
>> Lifetime writes:          279 MB
>> Reserved blocks uid:      0 (user root)
>> Reserved blocks gid:      0 (group root)
>> First inode:              11
>> Inode size:               256
>> Required extra isize:     28
>> Desired extra isize:      28
>> Journal inode:            8
>> Default directory hash:   half_md4
>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>> Journal backup:           inode blocks
>> Journal superblock magic number invalid!
>>
>>
>> On 2015-09-04 00:16, Johan Harvyl wrote:
>>> Hello again,
>>>
>>> I finally got around to dig some more into this and made what I 
>>> consider some good progress as I am now able to mount the filesystem 
>>> read-only so I thought I would update this thread a bit.
>>>
>>> Short one sentence recap since it's been a while since the original 
>>> post: I am trying to recover a filesystem that was quite badly 
>>> damaged by an offline resize2fs of a fairly modern ext4fs from 20 TB 
>>> to 24 TB.
>>>
>>> I spent a lot of time trying to get something meaningful out of 
>>> e2fsck/debugfs and learned quite a bit in the process and I would 
>>> like to briefly share some observations.
>>>
>>> 1) The first hurdle running e2fsck -fnv is that the "Superblock has 
>>> an invalid journal (inode 8)" is considered fatal and cannot be 
>>> fixed, at least not in r/o mode so e2fsck just stops, this check 
>>> needed to go away.
>>>
>>> 2) e2fsck gets utterly confused by the "bad block inode" that 
>>> incorrectly gets identified as having something worth looking at and 
>>> spends days iterating through blocks (before I cancelled it). 
>>> Removing handling if ino == EXT2_BAD_INO in pass1 and pass1b made 
>>> things a bit better.
>>>
>>> 3) e2fsck using a backup superblock
>>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>> This is bad, as it means using a superblock that has not been 
>>> updated with the +4TB. Consequently it gets the location of the 
>>> first block group wrong, or at the very least the first inode table 
>>> that houses the root inode.
>>> Forcing it to use the master superblock again makes things a bit 
>>> better.
>>>
>>> I have some logs from various e2fsck runs with various amounts of 
>>> hacks applied if they are of any interest to developers? I will also 
>>> likely have the filesystem in this state for a week or two more if 
>>> any other information I can extract is of interest to figure out 
>>> what made resize2fs screw things up.
>>>
>>>
>>>
>>> In the end, the only actual change I have made to the filesystem to 
>>> make it mountable is that I borrowed a root inode from a different 
>>> filesystem and updated the i_block pointer to point to the extent 
>>> tree corresponding to the root inode of my broken filesystem which 
>>> was quite easy to find by just looking for the string "lost+found".
>>>
>>> # mount -o ro,noload /dev/md0 /mnt/loop
>>> [2815465.034803] EXT4-fs (md0): mounted filesystem without journal. 
>>> Opts: noload
>>>
>>> # df -h /dev/md0
>>> Filesystem      Size  Used Avail Use% Mounted on
>>> /dev/md0         22T -382T  404T    - /mnt/loop
>>>
>>> Uh oh, does not look to good.. But hey, doing some checks on the 
>>> data contents and so far results are very promising. An "ls /" looks 
>>> good and so does a lot of the data that I can verify checksums on, 
>>> checks are still running...
>>>
>>> I really do not know how to move on with trying to repair the 
>>> filesystem with e2fsck. I do not feel brave enough to let it run r/w 
>>> on the given how many hacks that I consider very dirty were required 
>>> to even get it this far. At this point letting it make changes to 
>>> the filesystem may actually make it worse so I see no other way 
>>> forward than extracting all the contents and recreating the 
>>> filesystem from scratch.
>>>
>>> Question is though, what is the recommended way to create the 
>>> filesystem? 64bit is clearly necessary, but what about the other 
>>> feature flags like flex_bg/meta_bg/resize_inode...? I do not care 
>>> much about slight gains in performance, robustness is more 
>>> important, and that it can be resized in the future.
>>>
>>> Only online resize from now on, never offlline, I learned that 
>>> lesson...
>>>
>>> Will it be possible to expand from 24 TB to 28 TB online?
>>>
>>> thanks,
>>> -johan
>>>
>>>
>>> On 2015-08-13 20:12, Johan Harvyl wrote:
>>>> On 2015-08-13 15:27, Theodore Ts'o wrote:
>>>>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>>>>
>>>>>>> I'm not aware of any offline resize with 1.42.13, but it sounds 
>>>>>>> like
>>>>>>> you were originally using mke2fs and resize2fs 1.42.10, which 
>>>>>>> did have
>>>>>>> some bugs, and so the question is what sort of might it might have
>>>>>>> left things.
>>>>>> What kind of bugs are we talking about, mke2fs? resize2fs? 
>>>>>> e2fsck? Any
>>>>>> specific commits of interest?
>>>>> I suspect it was caused by a bug in resize2fs 1.42.10. The problem is
>>>>> that off-line resize2fs is much more powerful; it can handle moving
>>>>> file system metadata blocks around, so it can grow file systems in
>>>>> cases which aren't supported by online resize --- and it can shrink
>>>>> file systems when online resize doesn't support any kind of file
>>>>> system shrink.  As such, the code is a lot more complicated, whereas
>>>>> the online resize code is much simpler, and ultimately, much more
>>>>> robust.
>>>> Understood, so would it have been possible to move from my 20 TB -> 
>>>> 24 TB fs with
>>>> online resize? I am confused by the threads I see on the net with 
>>>> regards to this.
>>>>>> Can you think of why it would zero out the first thousands of
>>>>>> inodes, like the root inode, lost+found and so on? I am thinking
>>>>>> that would help me assess the potential damage to the files. Could I
>>>>>> perhaps expect the same kind of zeroed out blocks at regular
>>>>>> intervals all over the device?
>>>>> I didn't realize that the first thousands of inodes had been zeroed;
>>>>> either you didn't mention this earier or I had missed that from your
>>>>> e-mail.  I suspect the resize inode before the resize was pretty
>>>>> terribly corrupted, but in a way that e2fsck didn't complain.
>>>>
>>>> Hi,
>>>>
>>>> I may not have been clear on that it was not just the first handful 
>>>> of inodes.
>>>>
>>>> When I manually sampled some inodes with debugfs and a disk editor, 
>>>> the first group
>>>> I found valid inodes in was:
>>>>  Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode 
>>>> table at 1572896
>>>>
>>>> With 512 inodes per group that would mean at least some 24k inodes 
>>>> are blanked out,
>>>> but I did not check them all, I just sampled groups manually so 
>>>> there could be some
>>>> valid in some of the groups below group 48 or a lot more invalid 
>>>> afterwards.
>>>>
>>>>> I'll have to try to reproduce the problem based how you originally
>>>>> created and grew the file system and see if I can somehow reproduce
>>>>> the problem.  Obviously e2fsck and resize2fs should be changed to 
>>>>> make
>>>>> this operation much more robust.  If you can tell me the exact
>>>>> original size (just under 16TB is probably good enough, but if you
>>>>> know the exact starting size, that might be helpful), and then steps
>>>>> by which the file system was grown, and which version of e2fsprogs 
>>>>> was
>>>>> installed at the time, that would be quite helpful.
>>>>>
>>>>> Thanks,
>>>>>
>>>>>                         - Ted
>>>>
>>>> Cool, I will try to go through its history in some detail below.
>>>>
>>>> If you have ideas on what I could look for, like ideas on if there 
>>>> is a particular periodicity
>>>> to the corruption I can write some python to explore such theories.
>>>>
>>>>
>>>> The filesystem was originally created with e2fsprogs 1.42.10-1 and 
>>>> most likely linux-image
>>>> 3.14 from Debian.
>>>>
>>>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
>>>> mke2fs 1.42.10 (18-May-2014)
>>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>>> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
>>>> Superblock backups stored on blocks:
>>>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 
>>>> 1605632, 2654208,
>>>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 
>>>> 78675968,
>>>>         102400000, 214990848, 512000000, 550731776, 644972544, 
>>>> 1934917632,
>>>>         2560000000, 3855122432
>>>>
>>>> Allocating group tables: done
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>> #
>>>>
>>>> It was expanded with 4 TB (another 976721792 4k blocks). Best I can 
>>>> tell from my logs this
>>>> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 
>>>> (debian packages) and
>>>> Linux 3.16. Everything was running fine after this.
>>>> NOTE #1: It does *not* look like this filesystem was ever touched 
>>>> by resize2fs 1.42.10.
>>>> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 
>>>> appear to be this:
>>>> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>>>>
>>>> Then for the final 4 TB for a total of 5860330752 4k blocks which 
>>>> was done with
>>>> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
>>>> "Should never happen: resize inode corrupt"
>>>> was seen.
>>>>
>>>> In both cases the same offline resize was done, with no exotic 
>>>> options:
>>>> # umount /dev/md0
>>>> # fsck.ext4 -f /dev/md0
>>>> # resize2fs /dev/md0
>>>>
>>>> thanks,
>>>> -johan
>>>
>>
>


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-15 17:55               ` Johan Harvyl
@ 2015-09-17  1:21                 ` Andreas Dilger
  2015-09-18 18:26                   ` Johan Harvyl
  2015-09-19  2:47                   ` Dave Chinner
  0 siblings, 2 replies; 15+ messages in thread
From: Andreas Dilger @ 2015-09-17  1:21 UTC (permalink / raw)
  To: Johan Harvyl; +Cc: Theodore Ts'o, linux-ext4@vger.kernel.org

If you add "-b 1024" to the mke2fs command line to use 1KB instead of 4KB blocks, and reduce the sizes by a factor of 4 does the problem still happen? That would make it easier for someone else to test, since it would only need a 4-5TB disk instead of a 19Tb array. 

Cheers, Andreas

> On Sep 15, 2015, at 11:55, Johan Harvyl <johan@harvyl.se> wrote:
> 
> I have now been able to reproduce the issue that resize2fs corrupts at least the root, resize and journal
> inodes with versions 1.42.13 and the more recent commit 956b0f1 of e2fsprogs.
> 
> Note that older versions of e2fsprogs need *not* be involved, 1.42.13 and newer also have issues.
> 
> Please advice on things I can try to narrow down the root cause of what has to be an e2fsprogs bug. In
> particular it would be very useful to reproduce it faster, running through the mkfs and two resize steps
> takes around ten minutes so iterative testing is a slow and I do not really have much of clue what steps
> would be more likely to overwrite the inodes.
> 
> At some point I would like to return this array to service but I am not really comfortable creating a
> new ext4 filesystem on it without first understanding how it can become corrupted without even
> mounting the file system.
> 
> For 1.42.13:
> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
> # resize2fs -p /dev/md0 19534435840k
> # resize2fs -p /dev/md0
> # e2fsck -fn /dev/md0
> e2fsck 1.42.13 (17-May-2015)
> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
> e2fsck: Group descriptors look bad... trying backup blocks...
> Superblock has an invalid journal (inode 8).
> Clear? no
> 
> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
> 
> /dev/md0: ********** WARNING: Filesystem still has errors **********
> 
> 
> or for 956b0f1:
> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/resize2fs -p /dev/md0 19534435840k
> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/resize2fs -p /dev/md0
> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/e2fsck -fn /dev/md0
> e2fsck 1.43-WIP (18-May-2015)
> ext2fs_open2: Superblock checksum does not match superblock
> /root/elatest/out/sbin/e2fsck: Superblock invalid, trying backup blocks...
> Superblock has an invalid journal (inode 8).
> Clear? no
> 
> /root/elatest/out/sbin/e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
> 
> /dev/md0: ********** WARNING: Filesystem still has errors **********
> 
> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/debugfs -c /dev/md0
> debugfs 1.43-WIP (18-May-2015)
> /dev/md0: Superblock checksum does not match superblock while opening filesystem
> debugfs:  stat <2>
> stat: Filesystem not open
> 
> # debugfs -c /dev/md0
> debugfs 1.42.13 (17-May-2015)
> /dev/md0: catastrophic mode - not reading inode or group bitmaps
> debugfs:  stat <2>
> Inode: 2   Type: bad type    Mode:  0004   Flags: 0x1
> Generation: 1    Version: 0x00000001
> User:  9440   Group:     0   Size: 618659860
> File ACL: 1    Directory ACL: 0
> Links: 0   Blockcount: 724107776
> Fragment:  Address: 0    Number: 0    Size: 0
> ctime: 0x02008000 -- Sun Jan 24 18:46:40 1971
> atime: 0x24e000a0 -- Wed Aug  9 12:00:00 1989
> mtime: 0x00030000 -- Sat Jan  3 07:36:48 1970
> Size of extra inode fields: 6
> BLOCKS:
> (0):1, (6):618659845 .... and it goes on...
> 
>> On 2015-09-14 23:35, Johan Harvyl wrote:
>> In an attempt to further isolate what versions of e2fsprogs, at a commit level, that are
>> needed to reproduce the bad behavior I tried my own step-by-step, initially with a much
>> higher -i 16777216 to mkfs.ext4 in the hope that fewer inodes would make all the
>> operations run faster.
>> 
>> When I was unable to reproduce with -i 16777216 instead, I switched back to exactly
>> what I reproduced with the first time, and I *still* did not get the "Should never happen:
>> resize inode corrupt!".
>> 
>> The only reasonable explanation I can come up with to this is that something is not being
>> initialized properly that resize2fs expects to be initialized. I have no indications of any
>> issues with any hardware or the underlying md block.
>> 
>> What I did however notice is that I can have the same kind of filesystem corruption
>> *without* seeing the "Should never happen: resize inode corrupt!" message using the
>> following sequence, and this *is* reproducible one time after another:
>> 
>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>> # e2fsck -fy /dev/md0 (using 1.42.13)
>> # resize2fs -p /dev/md0 19534435840k (using 1.42.13)
>> # resize2fs -p /dev/md0 (using 1.42.13)
>> # e2fsck -fn /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> Superblock has an invalid journal (inode 8).
>> Clear? no
>> 
>> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>> 
>> At this point the root inode is also bad and this fails:
>> # mount /dev/md0 /mnt/loop -o ro,noload
>> mount: mount /dev/md0 on /mnt/loop failed: Stale file handle
>> [3766493.732188] EXT4-fs (md0): get root inode failed
>> [3766493.732190] EXT4-fs (md0): mount failed
>> 
>> Note that only versions 1.42.10 and 1.42.13 are involved now, 1.42.12 is not needed.
>> 
>> Kernel is the debian:
>> ii  linux-image-4.0.0-2-amd64      4.0.8-2 amd64 Linux 4.0 for 64-bit PCs
>> 
>> For the record I also tried a more recent e2fsprogs for the resize (instead of 1.42.13),
>> locally built from:
>> 956b0f1 Merge branch 'maint' into next
>> and I could still reproduce it on the first attempt.
>> 
>> More verbose logs follows.
>> 
>> Does anyone else have some kind of testbed to test the same sequence of commands?
>> 
>> ===
>> 
>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>> mke2fs 1.42.10 (18-May-2014)
>> /dev/md0 contains a ext4 file system
>>        last mounted on Sun Sep 13 22:19:28 2015
>> Proceed anyway? (y,n) y
>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>> Filesystem UUID: e263356e-4fe4-4e9b-bd0c-8edc2c411735
>> Superblock backups stored on blocks:
>>        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>        102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>        2560000000, 3855122432
>> 
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (32768 blocks): done
>> Writing superblocks and filesystem accounting information: done
>> 
>> # e2fsck -fy /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> Pass 1: Checking inodes, blocks, and sizes
>> Pass 2: Checking directory structure
>> Pass 3: Checking directory connectivity
>> Pass 4: Checking reference counts
>> Pass 5: Checking group summary information
>> Free blocks count wrong (512088558484167, counted=3902749383).
>> Fix? yes
>> 
>> 
>> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks
>> 
>> # resize2fs -p /dev/md0 19534435840k
>> resize2fs 1.42.13 (17-May-2015)
>> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
>> Begin pass 2 (max = 6)
>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 3 (max = 119229)
>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 5 (max = 8)
>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>> 
>> # resize2fs -p /dev/md0
>> resize2fs 1.42.13 (17-May-2015)
>> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
>> Begin pass 2 (max = 6)
>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 3 (max = 149036)
>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> Begin pass 5 (max = 14)
>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>> The filesystem on /dev/md0 is now 5860330752 (4k) blocks long.
>> 
>> # e2fsck -fn /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> Superblock has an invalid journal (inode 8).
>> Clear? no
>> 
>> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>> 
>>> On 2015-09-12 12:27, Johan Harvyl wrote:
>>> Hi,
>>> 
>>> I have now evacuated the data on the filesystem and I *did* manage to recreate the
>>> "Should never happen: resize inode corrupt!" using the versions of e2fsprogs I believe I was using at the time.
>>> 
>>> The vast majority of the data that I was able to checksum was ok.
>>> 
>>> For me I guess the way forward should be to recreate the fs with 1.42.13 and stick to online resize
>>> from now on, correct?
>>> 
>>> Are there any feature flags that I should not use when expanding file systems or any that I must use?
>>> 
>>> -johan
>>> 
>>> 
>>> Here is a step by step of what I did to reproduce
>>> 
>>> I have built the following two versions of e2fsprogs (configure, make, make install, nothing else):
>>> 421d693 (HEAD) libext2fs: fix potential buffer overflow in closefs()
>>> 6a3741a (tag: v1.42.12) Update release notes, etc. for final 1.42.12 release
>>> 
>>> 9779e29 (HEAD, tag: v1.42.10) Update release notes, etc. for final 1.42.10 release
>>> 
>>> ===
>>> 
>>> First build the fs with 1.42.10 with the exact number of blocks I originally had.
>>> 
>>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>>> mke2fs 1.42.10 (18-May-2014)
>>> /dev/md0 contains a ext4 file system
>>>        created on Sat Sep 12 11:23:02 2015
>>> Proceed anyway? (y,n) y
>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>>> Superblock backups stored on blocks:
>>>        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>>        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>>        102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>>        2560000000, 3855122432
>>> 
>>> Allocating group tables: done
>>> Writing inode tables: done
>>> Creating journal (32768 blocks): done
>>> Writing superblocks and filesystem accounting information: done
>>> 
>>> From dumpe2fs I observe:
>>> 1) the fs features match what I had on my broken fs
>>> 2) the number of free blocks is 512088558484167 which is clearly wrong.
>>> 
>>> # e2fsck -fnv /dev/md0
>>> e2fsck 1.42.13 (17-May-2015)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 2: Checking directory structure
>>> Pass 3: Checking directory connectivity
>>> Pass 4: Checking reference counts
>>> Pass 5: Checking group summary information
>>> Free blocks count wrong (512088558484167, counted=3902749383).
>>> Fix? no
>>> 
>>> So the initial fs created by 1.42.10 appear to be bad.
>>> 
>>> Filesystem volume name:   <none>
>>> Last mounted on:          <not available>
>>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>>> Filesystem magic number:  0xEF53
>>> Filesystem revision #:    1 (dynamic)
>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>> Filesystem flags:         signed_directory_hash
>>> Default mount options:    user_xattr acl
>>> Filesystem state:         clean
>>> Errors behavior:          Continue
>>> Filesystem OS type:       Linux
>>> Inode count:              61045248
>>> Block count:              3906887168
>>> Reserved block count:     0
>>> Free blocks:              512088558484167
>>> Free inodes:              61045237
>>> First block:              0
>>> Block size:               4096
>>> Fragment size:            4096
>>> Group descriptor size:    64
>>> Reserved GDT blocks:      185
>>> Blocks per group:         32768
>>> Fragments per group:      32768
>>> Inodes per group:         512
>>> Inode blocks per group:   32
>>> Flex block group size:    16
>>> Filesystem created:       Sat Sep 12 11:27:55 2015
>>> Last mount time:          n/a
>>> Last write time:          Sat Sep 12 11:27:55 2015
>>> Mount count:              0
>>> Maximum mount count:      -1
>>> Last checked:             Sat Sep 12 11:27:55 2015
>>> Check interval:           0 (<none>)
>>> Lifetime writes:          158 MB
>>> Reserved blocks uid:      0 (user root)
>>> Reserved blocks gid:      0 (group root)
>>> First inode:              11
>>> Inode size:               256
>>> Required extra isize:     28
>>> Desired extra isize:      28
>>> Journal inode:            8
>>> Default directory hash:   half_md4
>>> Directory Hash Seed: f252a723-7016-43d1-97f8-579062a215e1
>>> Journal backup:           inode blocks
>>> Journal features:         (none)
>>> Journal size:             128M
>>> Journal length:           32768
>>> Journal sequence:         0x00000001
>>> Journal start:            0
>>> 
>>> 
>>> 
>>> The next step is resizing + 4 TB with 1.42.12.
>>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>>> resize2fs 1.42.12 (29-Aug-2014)
>>> <and nothing more>
>>> It did *not* print the "Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks." that it should have.
>>> 
>>> I let it run for 90+ minutes sampling CPU and IO usage with iotop from time to time. It was using more or less 100% CPU and no visible io.
>>> 
>>> So, I let e2fsck fix the free block count and re-did the resize:
>>> # e2fsck -f /dev/md0
>>> e2fsck 1.42.13 (17-May-2015)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 2: Checking directory structure
>>> Pass 3: Checking directory connectivity
>>> Pass 4: Checking reference counts
>>> Pass 5: Checking group summary information
>>> Free blocks count wrong (512088558484167, counted=3902749383).
>>> Fix<y>? yes
>>> 
>>> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>>> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks
>>> 
>>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>>> resize2fs 1.42.12 (29-Aug-2014)
>>> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
>>> Begin pass 2 (max = 6)
>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 3 (max = 119229)
>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 5 (max = 8)
>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>>> 
>>> dumpe2fs 1.42.13 (17-May-2015)
>>> Filesystem volume name:   <none>
>>> Last mounted on:          <not available>
>>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>>> Filesystem magic number:  0xEF53
>>> Filesystem revision #:    1 (dynamic)
>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>> Filesystem flags:         signed_directory_hash
>>> Default mount options:    user_xattr acl
>>> Filesystem state:         clean
>>> Errors behavior:          Continue
>>> Filesystem OS type:       Linux
>>> Inode count:              76306432
>>> Block count:              4883608960
>>> Reserved block count:     0
>>> Free blocks:              4878450712
>>> Free inodes:              76306421
>>> First block:              0
>>> Block size:               4096
>>> Fragment size:            4096
>>> Group descriptor size:    64
>>> Blocks per group:         32768
>>> Fragments per group:      32768
>>> Inodes per group:         512
>>> Inode blocks per group:   32
>>> RAID stride:              32752
>>> Flex block group size:    16
>>> Filesystem created:       Sat Sep 12 11:41:10 2015
>>> Last mount time:          n/a
>>> Last write time:          Sat Sep 12 11:56:20 2015
>>> Mount count:              0
>>> Maximum mount count:      -1
>>> Last checked:             Sat Sep 12 11:49:28 2015
>>> Check interval:           0 (<none>)
>>> Lifetime writes:          279 MB
>>> Reserved blocks uid:      0 (user root)
>>> Reserved blocks gid:      0 (group root)
>>> First inode:              11
>>> Inode size:               256
>>> Required extra isize:     28
>>> Desired extra isize:      28
>>> Journal inode:            8
>>> Default directory hash:   half_md4
>>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>>> Journal backup:           inode blocks
>>> Journal features:         (none)
>>> Journal size:             128M
>>> Journal length:           32768
>>> Journal sequence:         0x00000001
>>> Journal start:            0
>>> 
>>> Looking good so far, and now for the final resize to 24 TB using 1.42.13:
>>> # resize2fs -p /dev/md0
>>> resize2fs 1.42.13 (17-May-2015)
>>> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
>>> Begin pass 2 (max = 6)
>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 3 (max = 149036)
>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 5 (max = 14)
>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Should never happen: resize inode corrupt!
>>> 
>>> # dumpe2fs -h /dev/md0
>>> dumpe2fs 1.42.13 (17-May-2015)
>>> Filesystem volume name:   <none>
>>> Last mounted on:          <not available>
>>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>>> Filesystem magic number:  0xEF53
>>> Filesystem revision #:    1 (dynamic)
>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>> Filesystem flags:         signed_directory_hash
>>> Default mount options:    user_xattr acl
>>> Filesystem state:         clean with errors
>>> Errors behavior:          Continue
>>> Filesystem OS type:       Linux
>>> Inode count:              91568128
>>> Block count:              5860330752
>>> Reserved block count:     0
>>> Free blocks:              5853069550
>>> Free inodes:              91568117
>>> First block:              0
>>> Block size:               4096
>>> Fragment size:            4096
>>> Group descriptor size:    64
>>> Blocks per group:         32768
>>> Fragments per group:      32768
>>> Inodes per group:         512
>>> Inode blocks per group:   32
>>> RAID stride:              32752
>>> Flex block group size:    16
>>> Filesystem created:       Sat Sep 12 11:41:10 2015
>>> Last mount time:          n/a
>>> Last write time:          Sat Sep 12 12:03:55 2015
>>> Mount count:              0
>>> Maximum mount count:      -1
>>> Last checked:             Sat Sep 12 11:49:28 2015
>>> Check interval:           0 (<none>)
>>> Lifetime writes:          279 MB
>>> Reserved blocks uid:      0 (user root)
>>> Reserved blocks gid:      0 (group root)
>>> First inode:              11
>>> Inode size:               256
>>> Required extra isize:     28
>>> Desired extra isize:      28
>>> Journal inode:            8
>>> Default directory hash:   half_md4
>>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>>> Journal backup:           inode blocks
>>> Journal superblock magic number invalid!
>>> 
>>> 
>>>> On 2015-09-04 00:16, Johan Harvyl wrote:
>>>> Hello again,
>>>> 
>>>> I finally got around to dig some more into this and made what I consider some good progress as I am now able to mount the filesystem read-only so I thought I would update this thread a bit.
>>>> 
>>>> Short one sentence recap since it's been a while since the original post: I am trying to recover a filesystem that was quite badly damaged by an offline resize2fs of a fairly modern ext4fs from 20 TB to 24 TB.
>>>> 
>>>> I spent a lot of time trying to get something meaningful out of e2fsck/debugfs and learned quite a bit in the process and I would like to briefly share some observations.
>>>> 
>>>> 1) The first hurdle running e2fsck -fnv is that the "Superblock has an invalid journal (inode 8)" is considered fatal and cannot be fixed, at least not in r/o mode so e2fsck just stops, this check needed to go away.
>>>> 
>>>> 2) e2fsck gets utterly confused by the "bad block inode" that incorrectly gets identified as having something worth looking at and spends days iterating through blocks (before I cancelled it). Removing handling if ino == EXT2_BAD_INO in pass1 and pass1b made things a bit better.
>>>> 
>>>> 3) e2fsck using a backup superblock
>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>> This is bad, as it means using a superblock that has not been updated with the +4TB. Consequently it gets the location of the first block group wrong, or at the very least the first inode table that houses the root inode.
>>>> Forcing it to use the master superblock again makes things a bit better.
>>>> 
>>>> I have some logs from various e2fsck runs with various amounts of hacks applied if they are of any interest to developers? I will also likely have the filesystem in this state for a week or two more if any other information I can extract is of interest to figure out what made resize2fs screw things up.
>>>> 
>>>> 
>>>> 
>>>> In the end, the only actual change I have made to the filesystem to make it mountable is that I borrowed a root inode from a different filesystem and updated the i_block pointer to point to the extent tree corresponding to the root inode of my broken filesystem which was quite easy to find by just looking for the string "lost+found".
>>>> 
>>>> # mount -o ro,noload /dev/md0 /mnt/loop
>>>> [2815465.034803] EXT4-fs (md0): mounted filesystem without journal. Opts: noload
>>>> 
>>>> # df -h /dev/md0
>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>> /dev/md0         22T -382T  404T    - /mnt/loop
>>>> 
>>>> Uh oh, does not look to good.. But hey, doing some checks on the data contents and so far results are very promising. An "ls /" looks good and so does a lot of the data that I can verify checksums on, checks are still running...
>>>> 
>>>> I really do not know how to move on with trying to repair the filesystem with e2fsck. I do not feel brave enough to let it run r/w on the given how many hacks that I consider very dirty were required to even get it this far. At this point letting it make changes to the filesystem may actually make it worse so I see no other way forward than extracting all the contents and recreating the filesystem from scratch.
>>>> 
>>>> Question is though, what is the recommended way to create the filesystem? 64bit is clearly necessary, but what about the other feature flags like flex_bg/meta_bg/resize_inode...? I do not care much about slight gains in performance, robustness is more important, and that it can be resized in the future.
>>>> 
>>>> Only online resize from now on, never offlline, I learned that lesson...
>>>> 
>>>> Will it be possible to expand from 24 TB to 28 TB online?
>>>> 
>>>> thanks,
>>>> -johan
>>>> 
>>>> 
>>>>> On 2015-08-13 20:12, Johan Harvyl wrote:
>>>>>> On 2015-08-13 15:27, Theodore Ts'o wrote:
>>>>>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>>>>> 
>>>>>>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>>>>>>> you were originally using mke2fs and resize2fs 1.42.10, which did have
>>>>>>>> some bugs, and so the question is what sort of might it might have
>>>>>>>> left things.
>>>>>>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
>>>>>>> specific commits of interest?
>>>>>> I suspect it was caused by a bug in resize2fs 1.42.10. The problem is
>>>>>> that off-line resize2fs is much more powerful; it can handle moving
>>>>>> file system metadata blocks around, so it can grow file systems in
>>>>>> cases which aren't supported by online resize --- and it can shrink
>>>>>> file systems when online resize doesn't support any kind of file
>>>>>> system shrink.  As such, the code is a lot more complicated, whereas
>>>>>> the online resize code is much simpler, and ultimately, much more
>>>>>> robust.
>>>>> Understood, so would it have been possible to move from my 20 TB -> 24 TB fs with
>>>>> online resize? I am confused by the threads I see on the net with regards to this.
>>>>>>> Can you think of why it would zero out the first thousands of
>>>>>>> inodes, like the root inode, lost+found and so on? I am thinking
>>>>>>> that would help me assess the potential damage to the files. Could I
>>>>>>> perhaps expect the same kind of zeroed out blocks at regular
>>>>>>> intervals all over the device?
>>>>>> I didn't realize that the first thousands of inodes had been zeroed;
>>>>>> either you didn't mention this earier or I had missed that from your
>>>>>> e-mail.  I suspect the resize inode before the resize was pretty
>>>>>> terribly corrupted, but in a way that e2fsck didn't complain.
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> I may not have been clear on that it was not just the first handful of inodes.
>>>>> 
>>>>> When I manually sampled some inodes with debugfs and a disk editor, the first group
>>>>> I found valid inodes in was:
>>>>> Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode table at 1572896
>>>>> 
>>>>> With 512 inodes per group that would mean at least some 24k inodes are blanked out,
>>>>> but I did not check them all, I just sampled groups manually so there could be some
>>>>> valid in some of the groups below group 48 or a lot more invalid afterwards.
>>>>> 
>>>>>> I'll have to try to reproduce the problem based how you originally
>>>>>> created and grew the file system and see if I can somehow reproduce
>>>>>> the problem.  Obviously e2fsck and resize2fs should be changed to make
>>>>>> this operation much more robust.  If you can tell me the exact
>>>>>> original size (just under 16TB is probably good enough, but if you
>>>>>> know the exact starting size, that might be helpful), and then steps
>>>>>> by which the file system was grown, and which version of e2fsprogs was
>>>>>> installed at the time, that would be quite helpful.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>>                        - Ted
>>>>> 
>>>>> Cool, I will try to go through its history in some detail below.
>>>>> 
>>>>> If you have ideas on what I could look for, like ideas on if there is a particular periodicity
>>>>> to the corruption I can write some python to explore such theories.
>>>>> 
>>>>> 
>>>>> The filesystem was originally created with e2fsprogs 1.42.10-1 and most likely linux-image
>>>>> 3.14 from Debian.
>>>>> 
>>>>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
>>>>> mke2fs 1.42.10 (18-May-2014)
>>>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>>>> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
>>>>> Superblock backups stored on blocks:
>>>>>        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>>>>        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>>>>        102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>>>>        2560000000, 3855122432
>>>>> 
>>>>> Allocating group tables: done
>>>>> Writing inode tables: done
>>>>> Creating journal (32768 blocks): done
>>>>> Writing superblocks and filesystem accounting information: done
>>>>> #
>>>>> 
>>>>> It was expanded with 4 TB (another 976721792 4k blocks). Best I can tell from my logs this
>>>>> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 (debian packages) and
>>>>> Linux 3.16. Everything was running fine after this.
>>>>> NOTE #1: It does *not* look like this filesystem was ever touched by resize2fs 1.42.10.
>>>>> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 appear to be this:
>>>>> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>>>>> 
>>>>> Then for the final 4 TB for a total of 5860330752 4k blocks which was done with
>>>>> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
>>>>> "Should never happen: resize inode corrupt"
>>>>> was seen.
>>>>> 
>>>>> In both cases the same offline resize was done, with no exotic options:
>>>>> # umount /dev/md0
>>>>> # fsck.ext4 -f /dev/md0
>>>>> # resize2fs /dev/md0
>>>>> 
>>>>> thanks,
>>>>> -johan
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-17  1:21                 ` Andreas Dilger
@ 2015-09-18 18:26                   ` Johan Harvyl
  2015-09-19  2:47                   ` Dave Chinner
  1 sibling, 0 replies; 15+ messages in thread
From: Johan Harvyl @ 2015-09-18 18:26 UTC (permalink / raw)
  To: Andreas Dilger, Theodore Ts'o; +Cc: linux-ext4@vger.kernel.org

Hi,

I should have thought of that, but unfortunately it will not allow me to 
do so.

# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/mkfs.ext4 /dev/md0 -m 0 -b 1024 -O 64bit 3906887168k
mke2fs 1.43-WIP (18-May-2015)
Warning: specified blocksize 1024 is less than device physical 
sectorsize 4096
/dev/md0: Cannot create filesystem with requested number of inodes while 
setting up superblock
#

Instead, I stuck to the 1k blocks and divided by four again...
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/mkfs.ext4 /dev/md0 -m 0 -b 1024 -i 262144 -O 
64bit 976721792k
mke2fs 1.43-WIP (18-May-2015)
Warning: specified blocksize 1024 is less than device physical 
sectorsize 4096
Creating filesystem with 976721792 1k blocks and 3815328 inodes
Filesystem UUID: 2626eb2a-0691-48b2-a64c-2f4802437166
Superblock backups stored on blocks:
         8193, 24577, 40961, 57345, 73729, 204801, 221185, 401409, 663553,
         1024001, 1990657, 2809857, 5120001, 5971969, 17915905, 19668993,
         25600001, 53747713, 128000001, 137682945, 161243137, 483729409,
         640000001, 963780609

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

...e2fsck is ok here...

Now for a proportional resize, i.e. + 25 %:
# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/resize2fs -p /dev/md0 1220902240k
resize2fs 1.43-WIP (18-May-2015)
Resizing the filesystem on /dev/md0 to 1220902240 (1k) blocks.
Begin pass 2 (max = 14)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 119229)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 16)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on /dev/md0 is now 1220902240 (1k) blocks long.

Already after the first resize the fs seems much more corrupted and in a 
different way than my original report.
Below are a few of the errors, there are many many pages of them.

This appears to be completely reproducible. I'll try to shrink things 
further. Using 4k blocks instead of 1k it does not reproduce.

-johan

# MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf 
/root/elatest/out/sbin/e2fsck -fnv /dev/md0 2>&1 |less
e2fsck 1.43-WIP (18-May-2015)
Pass 1: Checking inodes, blocks, and sizes
Inode 13 passes checks, but checksum does not match inode.  Fix? no

Deleted inode 14 has zero dtime.  Fix? no
...
Inode 1024 passes checks, but checksum does not match inode.  Fix? no

Inode 1437 seems to contain garbage.  Clear? no

Inode 1437 is in use, but has dtime set.  Fix? no

Inode 1437 has a extra size (1656) which is invalid
Fix? no

Inode 1437 has INDEX_FL flag set but is not a directory.
Clear HTree index? no
...
Illegal block #11 (2674298790) in inode 1442.  IGNORED.
Illegal block number passed to ext2fs_test_block_bitmap #1906002301 for 
metadata block map
Too many illegal blocks in inode 1442.
Clear inode? no

Suppress messages? no

Illegal indirect block (1906002301) in inode 1442.  IGNORED.
Illegal block number passed to ext2fs_test_block_bitmap #3316469983 for 
metadata block map




On 2015-09-17 03:21, Andreas Dilger wrote:
> If you add "-b 1024" to the mke2fs command line to use 1KB instead of 4KB blocks, and reduce the sizes by a factor of 4 does the problem still happen? That would make it easier for someone else to test, since it would only need a 4-5TB disk instead of a 19Tb array.
>
> Cheers, Andreas
>
>> On Sep 15, 2015, at 11:55, Johan Harvyl <johan@harvyl.se> wrote:
>>
>> I have now been able to reproduce the issue that resize2fs corrupts at least the root, resize and journal
>> inodes with versions 1.42.13 and the more recent commit 956b0f1 of e2fsprogs.
>>
>> Note that older versions of e2fsprogs need *not* be involved, 1.42.13 and newer also have issues.
>>
>> Please advice on things I can try to narrow down the root cause of what has to be an e2fsprogs bug. In
>> particular it would be very useful to reproduce it faster, running through the mkfs and two resize steps
>> takes around ten minutes so iterative testing is a slow and I do not really have much of clue what steps
>> would be more likely to overwrite the inodes.
>>
>> At some point I would like to return this array to service but I am not really comfortable creating a
>> new ext4 filesystem on it without first understanding how it can become corrupted without even
>> mounting the file system.
>>
>> For 1.42.13:
>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>> # resize2fs -p /dev/md0 19534435840k
>> # resize2fs -p /dev/md0
>> # e2fsck -fn /dev/md0
>> e2fsck 1.42.13 (17-May-2015)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>> e2fsck: Group descriptors look bad... trying backup blocks...
>> Superblock has an invalid journal (inode 8).
>> Clear? no
>>
>> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>>
>> /dev/md0: ********** WARNING: Filesystem still has errors **********
>>
>>
>> or for 956b0f1:
>> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/resize2fs -p /dev/md0 19534435840k
>> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/resize2fs -p /dev/md0
>> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/e2fsck -fn /dev/md0
>> e2fsck 1.43-WIP (18-May-2015)
>> ext2fs_open2: Superblock checksum does not match superblock
>> /root/elatest/out/sbin/e2fsck: Superblock invalid, trying backup blocks...
>> Superblock has an invalid journal (inode 8).
>> Clear? no
>>
>> /root/elatest/out/sbin/e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>>
>> /dev/md0: ********** WARNING: Filesystem still has errors **********
>>
>> # MKE2FS_CONFIG=/root/elatest/out/etc/mke2fs.conf /root/elatest/out/sbin/debugfs -c /dev/md0
>> debugfs 1.43-WIP (18-May-2015)
>> /dev/md0: Superblock checksum does not match superblock while opening filesystem
>> debugfs:  stat <2>
>> stat: Filesystem not open
>>
>> # debugfs -c /dev/md0
>> debugfs 1.42.13 (17-May-2015)
>> /dev/md0: catastrophic mode - not reading inode or group bitmaps
>> debugfs:  stat <2>
>> Inode: 2   Type: bad type    Mode:  0004   Flags: 0x1
>> Generation: 1    Version: 0x00000001
>> User:  9440   Group:     0   Size: 618659860
>> File ACL: 1    Directory ACL: 0
>> Links: 0   Blockcount: 724107776
>> Fragment:  Address: 0    Number: 0    Size: 0
>> ctime: 0x02008000 -- Sun Jan 24 18:46:40 1971
>> atime: 0x24e000a0 -- Wed Aug  9 12:00:00 1989
>> mtime: 0x00030000 -- Sat Jan  3 07:36:48 1970
>> Size of extra inode fields: 6
>> BLOCKS:
>> (0):1, (6):618659845 .... and it goes on...
>>
>>> On 2015-09-14 23:35, Johan Harvyl wrote:
>>> In an attempt to further isolate what versions of e2fsprogs, at a commit level, that are
>>> needed to reproduce the bad behavior I tried my own step-by-step, initially with a much
>>> higher -i 16777216 to mkfs.ext4 in the hope that fewer inodes would make all the
>>> operations run faster.
>>>
>>> When I was unable to reproduce with -i 16777216 instead, I switched back to exactly
>>> what I reproduced with the first time, and I *still* did not get the "Should never happen:
>>> resize inode corrupt!".
>>>
>>> The only reasonable explanation I can come up with to this is that something is not being
>>> initialized properly that resize2fs expects to be initialized. I have no indications of any
>>> issues with any hardware or the underlying md block.
>>>
>>> What I did however notice is that I can have the same kind of filesystem corruption
>>> *without* seeing the "Should never happen: resize inode corrupt!" message using the
>>> following sequence, and this *is* reproducible one time after another:
>>>
>>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>>> # e2fsck -fy /dev/md0 (using 1.42.13)
>>> # resize2fs -p /dev/md0 19534435840k (using 1.42.13)
>>> # resize2fs -p /dev/md0 (using 1.42.13)
>>> # e2fsck -fn /dev/md0
>>> e2fsck 1.42.13 (17-May-2015)
>>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>> Superblock has an invalid journal (inode 8).
>>> Clear? no
>>>
>>> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>>>
>>> At this point the root inode is also bad and this fails:
>>> # mount /dev/md0 /mnt/loop -o ro,noload
>>> mount: mount /dev/md0 on /mnt/loop failed: Stale file handle
>>> [3766493.732188] EXT4-fs (md0): get root inode failed
>>> [3766493.732190] EXT4-fs (md0): mount failed
>>>
>>> Note that only versions 1.42.10 and 1.42.13 are involved now, 1.42.12 is not needed.
>>>
>>> Kernel is the debian:
>>> ii  linux-image-4.0.0-2-amd64      4.0.8-2 amd64 Linux 4.0 for 64-bit PCs
>>>
>>> For the record I also tried a more recent e2fsprogs for the resize (instead of 1.42.13),
>>> locally built from:
>>> 956b0f1 Merge branch 'maint' into next
>>> and I could still reproduce it on the first attempt.
>>>
>>> More verbose logs follows.
>>>
>>> Does anyone else have some kind of testbed to test the same sequence of commands?
>>>
>>> ===
>>>
>>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>>> mke2fs 1.42.10 (18-May-2014)
>>> /dev/md0 contains a ext4 file system
>>>         last mounted on Sun Sep 13 22:19:28 2015
>>> Proceed anyway? (y,n) y
>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>> Filesystem UUID: e263356e-4fe4-4e9b-bd0c-8edc2c411735
>>> Superblock backups stored on blocks:
>>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>>         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>>         2560000000, 3855122432
>>>
>>> Allocating group tables: done
>>> Writing inode tables: done
>>> Creating journal (32768 blocks): done
>>> Writing superblocks and filesystem accounting information: done
>>>
>>> # e2fsck -fy /dev/md0
>>> e2fsck 1.42.13 (17-May-2015)
>>> Pass 1: Checking inodes, blocks, and sizes
>>> Pass 2: Checking directory structure
>>> Pass 3: Checking directory connectivity
>>> Pass 4: Checking reference counts
>>> Pass 5: Checking group summary information
>>> Free blocks count wrong (512088558484167, counted=3902749383).
>>> Fix? yes
>>>
>>>
>>> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>>> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks
>>>
>>> # resize2fs -p /dev/md0 19534435840k
>>> resize2fs 1.42.13 (17-May-2015)
>>> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
>>> Begin pass 2 (max = 6)
>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 3 (max = 119229)
>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 5 (max = 8)
>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>>>
>>> # resize2fs -p /dev/md0
>>> resize2fs 1.42.13 (17-May-2015)
>>> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
>>> Begin pass 2 (max = 6)
>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 3 (max = 149036)
>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> Begin pass 5 (max = 14)
>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>> The filesystem on /dev/md0 is now 5860330752 (4k) blocks long.
>>>
>>> # e2fsck -fn /dev/md0
>>> e2fsck 1.42.13 (17-May-2015)
>>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>> Superblock has an invalid journal (inode 8).
>>> Clear? no
>>>
>>> e2fsck: Illegal inode number while checking ext3 journal for /dev/md0
>>>
>>>> On 2015-09-12 12:27, Johan Harvyl wrote:
>>>> Hi,
>>>>
>>>> I have now evacuated the data on the filesystem and I *did* manage to recreate the
>>>> "Should never happen: resize inode corrupt!" using the versions of e2fsprogs I believe I was using at the time.
>>>>
>>>> The vast majority of the data that I was able to checksum was ok.
>>>>
>>>> For me I guess the way forward should be to recreate the fs with 1.42.13 and stick to online resize
>>>> from now on, correct?
>>>>
>>>> Are there any feature flags that I should not use when expanding file systems or any that I must use?
>>>>
>>>> -johan
>>>>
>>>>
>>>> Here is a step by step of what I did to reproduce
>>>>
>>>> I have built the following two versions of e2fsprogs (configure, make, make install, nothing else):
>>>> 421d693 (HEAD) libext2fs: fix potential buffer overflow in closefs()
>>>> 6a3741a (tag: v1.42.12) Update release notes, etc. for final 1.42.12 release
>>>>
>>>> 9779e29 (HEAD, tag: v1.42.10) Update release notes, etc. for final 1.42.10 release
>>>>
>>>> ===
>>>>
>>>> First build the fs with 1.42.10 with the exact number of blocks I originally had.
>>>>
>>>> # MKE2FS_CONFIG=/root/e10/out/etc/mke2fs.conf /root/e10/out/sbin/mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit 15627548672k
>>>> mke2fs 1.42.10 (18-May-2014)
>>>> /dev/md0 contains a ext4 file system
>>>>         created on Sat Sep 12 11:23:02 2015
>>>> Proceed anyway? (y,n) y
>>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>>>> Superblock backups stored on blocks:
>>>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>>>         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>>>         2560000000, 3855122432
>>>>
>>>> Allocating group tables: done
>>>> Writing inode tables: done
>>>> Creating journal (32768 blocks): done
>>>> Writing superblocks and filesystem accounting information: done
>>>>
>>>>  From dumpe2fs I observe:
>>>> 1) the fs features match what I had on my broken fs
>>>> 2) the number of free blocks is 512088558484167 which is clearly wrong.
>>>>
>>>> # e2fsck -fnv /dev/md0
>>>> e2fsck 1.42.13 (17-May-2015)
>>>> Pass 1: Checking inodes, blocks, and sizes
>>>> Pass 2: Checking directory structure
>>>> Pass 3: Checking directory connectivity
>>>> Pass 4: Checking reference counts
>>>> Pass 5: Checking group summary information
>>>> Free blocks count wrong (512088558484167, counted=3902749383).
>>>> Fix? no
>>>>
>>>> So the initial fs created by 1.42.10 appear to be bad.
>>>>
>>>> Filesystem volume name:   <none>
>>>> Last mounted on:          <not available>
>>>> Filesystem UUID: d00e9e59-3756-4e59-9539-bc00fe2446b5
>>>> Filesystem magic number:  0xEF53
>>>> Filesystem revision #:    1 (dynamic)
>>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>>> Filesystem flags:         signed_directory_hash
>>>> Default mount options:    user_xattr acl
>>>> Filesystem state:         clean
>>>> Errors behavior:          Continue
>>>> Filesystem OS type:       Linux
>>>> Inode count:              61045248
>>>> Block count:              3906887168
>>>> Reserved block count:     0
>>>> Free blocks:              512088558484167
>>>> Free inodes:              61045237
>>>> First block:              0
>>>> Block size:               4096
>>>> Fragment size:            4096
>>>> Group descriptor size:    64
>>>> Reserved GDT blocks:      185
>>>> Blocks per group:         32768
>>>> Fragments per group:      32768
>>>> Inodes per group:         512
>>>> Inode blocks per group:   32
>>>> Flex block group size:    16
>>>> Filesystem created:       Sat Sep 12 11:27:55 2015
>>>> Last mount time:          n/a
>>>> Last write time:          Sat Sep 12 11:27:55 2015
>>>> Mount count:              0
>>>> Maximum mount count:      -1
>>>> Last checked:             Sat Sep 12 11:27:55 2015
>>>> Check interval:           0 (<none>)
>>>> Lifetime writes:          158 MB
>>>> Reserved blocks uid:      0 (user root)
>>>> Reserved blocks gid:      0 (group root)
>>>> First inode:              11
>>>> Inode size:               256
>>>> Required extra isize:     28
>>>> Desired extra isize:      28
>>>> Journal inode:            8
>>>> Default directory hash:   half_md4
>>>> Directory Hash Seed: f252a723-7016-43d1-97f8-579062a215e1
>>>> Journal backup:           inode blocks
>>>> Journal features:         (none)
>>>> Journal size:             128M
>>>> Journal length:           32768
>>>> Journal sequence:         0x00000001
>>>> Journal start:            0
>>>>
>>>>
>>>>
>>>> The next step is resizing + 4 TB with 1.42.12.
>>>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>>>> resize2fs 1.42.12 (29-Aug-2014)
>>>> <and nothing more>
>>>> It did *not* print the "Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks." that it should have.
>>>>
>>>> I let it run for 90+ minutes sampling CPU and IO usage with iotop from time to time. It was using more or less 100% CPU and no visible io.
>>>>
>>>> So, I let e2fsck fix the free block count and re-did the resize:
>>>> # e2fsck -f /dev/md0
>>>> e2fsck 1.42.13 (17-May-2015)
>>>> Pass 1: Checking inodes, blocks, and sizes
>>>> Pass 2: Checking directory structure
>>>> Pass 3: Checking directory connectivity
>>>> Pass 4: Checking reference counts
>>>> Pass 5: Checking group summary information
>>>> Free blocks count wrong (512088558484167, counted=3902749383).
>>>> Fix<y>? yes
>>>>
>>>> /dev/md0: ***** FILE SYSTEM WAS MODIFIED *****
>>>> /dev/md0: 11/61045248 files (0.0% non-contiguous), 4137785/3906887168 blocks
>>>>
>>>> # MKE2FS_CONFIG=/root/e12/out/etc/mke2fs.conf /root/e12/out/sbin/resize2fs -p /dev/md0 19534435840k
>>>> resize2fs 1.42.12 (29-Aug-2014)
>>>> Resizing the filesystem on /dev/md0 to 4883608960 (4k) blocks.
>>>> Begin pass 2 (max = 6)
>>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> Begin pass 3 (max = 119229)
>>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> Begin pass 5 (max = 8)
>>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> The filesystem on /dev/md0 is now 4883608960 (4k) blocks long.
>>>>
>>>> dumpe2fs 1.42.13 (17-May-2015)
>>>> Filesystem volume name:   <none>
>>>> Last mounted on:          <not available>
>>>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>>>> Filesystem magic number:  0xEF53
>>>> Filesystem revision #:    1 (dynamic)
>>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>>> Filesystem flags:         signed_directory_hash
>>>> Default mount options:    user_xattr acl
>>>> Filesystem state:         clean
>>>> Errors behavior:          Continue
>>>> Filesystem OS type:       Linux
>>>> Inode count:              76306432
>>>> Block count:              4883608960
>>>> Reserved block count:     0
>>>> Free blocks:              4878450712
>>>> Free inodes:              76306421
>>>> First block:              0
>>>> Block size:               4096
>>>> Fragment size:            4096
>>>> Group descriptor size:    64
>>>> Blocks per group:         32768
>>>> Fragments per group:      32768
>>>> Inodes per group:         512
>>>> Inode blocks per group:   32
>>>> RAID stride:              32752
>>>> Flex block group size:    16
>>>> Filesystem created:       Sat Sep 12 11:41:10 2015
>>>> Last mount time:          n/a
>>>> Last write time:          Sat Sep 12 11:56:20 2015
>>>> Mount count:              0
>>>> Maximum mount count:      -1
>>>> Last checked:             Sat Sep 12 11:49:28 2015
>>>> Check interval:           0 (<none>)
>>>> Lifetime writes:          279 MB
>>>> Reserved blocks uid:      0 (user root)
>>>> Reserved blocks gid:      0 (group root)
>>>> First inode:              11
>>>> Inode size:               256
>>>> Required extra isize:     28
>>>> Desired extra isize:      28
>>>> Journal inode:            8
>>>> Default directory hash:   half_md4
>>>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>>>> Journal backup:           inode blocks
>>>> Journal features:         (none)
>>>> Journal size:             128M
>>>> Journal length:           32768
>>>> Journal sequence:         0x00000001
>>>> Journal start:            0
>>>>
>>>> Looking good so far, and now for the final resize to 24 TB using 1.42.13:
>>>> # resize2fs -p /dev/md0
>>>> resize2fs 1.42.13 (17-May-2015)
>>>> Resizing the filesystem on /dev/md0 to 5860330752 (4k) blocks.
>>>> Begin pass 2 (max = 6)
>>>> Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> Begin pass 3 (max = 149036)
>>>> Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> Begin pass 5 (max = 14)
>>>> Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
>>>> Should never happen: resize inode corrupt!
>>>>
>>>> # dumpe2fs -h /dev/md0
>>>> dumpe2fs 1.42.13 (17-May-2015)
>>>> Filesystem volume name:   <none>
>>>> Last mounted on:          <not available>
>>>> Filesystem UUID: 159d3929-1842-4f8d-907f-7509c16f06df
>>>> Filesystem magic number:  0xEF53
>>>> Filesystem revision #:    1 (dynamic)
>>>> Filesystem features:      has_journal ext_attr resize_inode dir_index filetype extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
>>>> Filesystem flags:         signed_directory_hash
>>>> Default mount options:    user_xattr acl
>>>> Filesystem state:         clean with errors
>>>> Errors behavior:          Continue
>>>> Filesystem OS type:       Linux
>>>> Inode count:              91568128
>>>> Block count:              5860330752
>>>> Reserved block count:     0
>>>> Free blocks:              5853069550
>>>> Free inodes:              91568117
>>>> First block:              0
>>>> Block size:               4096
>>>> Fragment size:            4096
>>>> Group descriptor size:    64
>>>> Blocks per group:         32768
>>>> Fragments per group:      32768
>>>> Inodes per group:         512
>>>> Inode blocks per group:   32
>>>> RAID stride:              32752
>>>> Flex block group size:    16
>>>> Filesystem created:       Sat Sep 12 11:41:10 2015
>>>> Last mount time:          n/a
>>>> Last write time:          Sat Sep 12 12:03:55 2015
>>>> Mount count:              0
>>>> Maximum mount count:      -1
>>>> Last checked:             Sat Sep 12 11:49:28 2015
>>>> Check interval:           0 (<none>)
>>>> Lifetime writes:          279 MB
>>>> Reserved blocks uid:      0 (user root)
>>>> Reserved blocks gid:      0 (group root)
>>>> First inode:              11
>>>> Inode size:               256
>>>> Required extra isize:     28
>>>> Desired extra isize:      28
>>>> Journal inode:            8
>>>> Default directory hash:   half_md4
>>>> Directory Hash Seed: feeea566-bb38-44c6-a4d5-f97aa78001d4
>>>> Journal backup:           inode blocks
>>>> Journal superblock magic number invalid!
>>>>
>>>>
>>>>> On 2015-09-04 00:16, Johan Harvyl wrote:
>>>>> Hello again,
>>>>>
>>>>> I finally got around to dig some more into this and made what I consider some good progress as I am now able to mount the filesystem read-only so I thought I would update this thread a bit.
>>>>>
>>>>> Short one sentence recap since it's been a while since the original post: I am trying to recover a filesystem that was quite badly damaged by an offline resize2fs of a fairly modern ext4fs from 20 TB to 24 TB.
>>>>>
>>>>> I spent a lot of time trying to get something meaningful out of e2fsck/debugfs and learned quite a bit in the process and I would like to briefly share some observations.
>>>>>
>>>>> 1) The first hurdle running e2fsck -fnv is that the "Superblock has an invalid journal (inode 8)" is considered fatal and cannot be fixed, at least not in r/o mode so e2fsck just stops, this check needed to go away.
>>>>>
>>>>> 2) e2fsck gets utterly confused by the "bad block inode" that incorrectly gets identified as having something worth looking at and spends days iterating through blocks (before I cancelled it). Removing handling if ino == EXT2_BAD_INO in pass1 and pass1b made things a bit better.
>>>>>
>>>>> 3) e2fsck using a backup superblock
>>>>> ext2fs_check_desc: Corrupt group descriptor: bad block for inode table
>>>>> e2fsck: Group descriptors look bad... trying backup blocks...
>>>>> This is bad, as it means using a superblock that has not been updated with the +4TB. Consequently it gets the location of the first block group wrong, or at the very least the first inode table that houses the root inode.
>>>>> Forcing it to use the master superblock again makes things a bit better.
>>>>>
>>>>> I have some logs from various e2fsck runs with various amounts of hacks applied if they are of any interest to developers? I will also likely have the filesystem in this state for a week or two more if any other information I can extract is of interest to figure out what made resize2fs screw things up.
>>>>>
>>>>>
>>>>>
>>>>> In the end, the only actual change I have made to the filesystem to make it mountable is that I borrowed a root inode from a different filesystem and updated the i_block pointer to point to the extent tree corresponding to the root inode of my broken filesystem which was quite easy to find by just looking for the string "lost+found".
>>>>>
>>>>> # mount -o ro,noload /dev/md0 /mnt/loop
>>>>> [2815465.034803] EXT4-fs (md0): mounted filesystem without journal. Opts: noload
>>>>>
>>>>> # df -h /dev/md0
>>>>> Filesystem      Size  Used Avail Use% Mounted on
>>>>> /dev/md0         22T -382T  404T    - /mnt/loop
>>>>>
>>>>> Uh oh, does not look to good.. But hey, doing some checks on the data contents and so far results are very promising. An "ls /" looks good and so does a lot of the data that I can verify checksums on, checks are still running...
>>>>>
>>>>> I really do not know how to move on with trying to repair the filesystem with e2fsck. I do not feel brave enough to let it run r/w on the given how many hacks that I consider very dirty were required to even get it this far. At this point letting it make changes to the filesystem may actually make it worse so I see no other way forward than extracting all the contents and recreating the filesystem from scratch.
>>>>>
>>>>> Question is though, what is the recommended way to create the filesystem? 64bit is clearly necessary, but what about the other feature flags like flex_bg/meta_bg/resize_inode...? I do not care much about slight gains in performance, robustness is more important, and that it can be resized in the future.
>>>>>
>>>>> Only online resize from now on, never offlline, I learned that lesson...
>>>>>
>>>>> Will it be possible to expand from 24 TB to 28 TB online?
>>>>>
>>>>> thanks,
>>>>> -johan
>>>>>
>>>>>
>>>>>> On 2015-08-13 20:12, Johan Harvyl wrote:
>>>>>>> On 2015-08-13 15:27, Theodore Ts'o wrote:
>>>>>>> On Thu, Aug 13, 2015 at 12:00:50AM +0200, Johan Harvyl wrote:
>>>>>>>
>>>>>>>>> I'm not aware of any offline resize with 1.42.13, but it sounds like
>>>>>>>>> you were originally using mke2fs and resize2fs 1.42.10, which did have
>>>>>>>>> some bugs, and so the question is what sort of might it might have
>>>>>>>>> left things.
>>>>>>>> What kind of bugs are we talking about, mke2fs? resize2fs? e2fsck? Any
>>>>>>>> specific commits of interest?
>>>>>>> I suspect it was caused by a bug in resize2fs 1.42.10. The problem is
>>>>>>> that off-line resize2fs is much more powerful; it can handle moving
>>>>>>> file system metadata blocks around, so it can grow file systems in
>>>>>>> cases which aren't supported by online resize --- and it can shrink
>>>>>>> file systems when online resize doesn't support any kind of file
>>>>>>> system shrink.  As such, the code is a lot more complicated, whereas
>>>>>>> the online resize code is much simpler, and ultimately, much more
>>>>>>> robust.
>>>>>> Understood, so would it have been possible to move from my 20 TB -> 24 TB fs with
>>>>>> online resize? I am confused by the threads I see on the net with regards to this.
>>>>>>>> Can you think of why it would zero out the first thousands of
>>>>>>>> inodes, like the root inode, lost+found and so on? I am thinking
>>>>>>>> that would help me assess the potential damage to the files. Could I
>>>>>>>> perhaps expect the same kind of zeroed out blocks at regular
>>>>>>>> intervals all over the device?
>>>>>>> I didn't realize that the first thousands of inodes had been zeroed;
>>>>>>> either you didn't mention this earier or I had missed that from your
>>>>>>> e-mail.  I suspect the resize inode before the resize was pretty
>>>>>>> terribly corrupted, but in a way that e2fsck didn't complain.
>>>>>> Hi,
>>>>>>
>>>>>> I may not have been clear on that it was not just the first handful of inodes.
>>>>>>
>>>>>> When I manually sampled some inodes with debugfs and a disk editor, the first group
>>>>>> I found valid inodes in was:
>>>>>> Group 48: block bitmap at 1572864, inode bitmap at 1572880, inode table at 1572896
>>>>>>
>>>>>> With 512 inodes per group that would mean at least some 24k inodes are blanked out,
>>>>>> but I did not check them all, I just sampled groups manually so there could be some
>>>>>> valid in some of the groups below group 48 or a lot more invalid afterwards.
>>>>>>
>>>>>>> I'll have to try to reproduce the problem based how you originally
>>>>>>> created and grew the file system and see if I can somehow reproduce
>>>>>>> the problem.  Obviously e2fsck and resize2fs should be changed to make
>>>>>>> this operation much more robust.  If you can tell me the exact
>>>>>>> original size (just under 16TB is probably good enough, but if you
>>>>>>> know the exact starting size, that might be helpful), and then steps
>>>>>>> by which the file system was grown, and which version of e2fsprogs was
>>>>>>> installed at the time, that would be quite helpful.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>>                         - Ted
>>>>>> Cool, I will try to go through its history in some detail below.
>>>>>>
>>>>>> If you have ideas on what I could look for, like ideas on if there is a particular periodicity
>>>>>> to the corruption I can write some python to explore such theories.
>>>>>>
>>>>>>
>>>>>> The filesystem was originally created with e2fsprogs 1.42.10-1 and most likely linux-image
>>>>>> 3.14 from Debian.
>>>>>>
>>>>>> # mkfs.ext4 /dev/md0 -i 262144 -m 0 -O 64bit
>>>>>> mke2fs 1.42.10 (18-May-2014)
>>>>>> Creating filesystem with 3906887168 4k blocks and 61045248 inodes
>>>>>> Filesystem UUID: 13c2eb37-e951-4ad1-b194-21f0880556db
>>>>>> Superblock backups stored on blocks:
>>>>>>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>>>>>>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>>>>>>         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
>>>>>>         2560000000, 3855122432
>>>>>>
>>>>>> Allocating group tables: done
>>>>>> Writing inode tables: done
>>>>>> Creating journal (32768 blocks): done
>>>>>> Writing superblocks and filesystem accounting information: done
>>>>>> #
>>>>>>
>>>>>> It was expanded with 4 TB (another 976721792 4k blocks). Best I can tell from my logs this
>>>>>> was done with either e2fsprogs:amd64 1.42.12-1 or 1.42.12-1.1 (debian packages) and
>>>>>> Linux 3.16. Everything was running fine after this.
>>>>>> NOTE #1: It does *not* look like this filesystem was ever touched by resize2fs 1.42.10.
>>>>>> NOTE #2: The diff between debian packages 1.42.12-1 and 1.42.12-1.1 appear to be this:
>>>>>> 49d0fe2 libext2fs: fix potential buffer overflow in closefs()
>>>>>>
>>>>>> Then for the final 4 TB for a total of 5860330752 4k blocks which was done with
>>>>>> e2fsprogs:amd64 1.42.13-1 and Linux 4.0. This is where the:
>>>>>> "Should never happen: resize inode corrupt"
>>>>>> was seen.
>>>>>>
>>>>>> In both cases the same offline resize was done, with no exotic options:
>>>>>> # umount /dev/md0
>>>>>> # fsck.ext4 -f /dev/md0
>>>>>> # resize2fs /dev/md0
>>>>>>
>>>>>> thanks,
>>>>>> -johan
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-17  1:21                 ` Andreas Dilger
  2015-09-18 18:26                   ` Johan Harvyl
@ 2015-09-19  2:47                   ` Dave Chinner
  2015-09-19  5:23                     ` Darrick J. Wong
  2015-09-19 14:11                     ` Johan Harvyl
  1 sibling, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2015-09-19  2:47 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Johan Harvyl, Theodore Ts'o, linux-ext4@vger.kernel.org

On Wed, Sep 16, 2015 at 07:21:59PM -0600, Andreas Dilger wrote:
> If you add "-b 1024" to the mke2fs command line to use 1KB instead of 4KB blocks, and reduce the sizes by a factor of 4 does the problem still happen? That would make it easier for someone else to test, since it would only need a 4-5TB disk instead of a 19Tb array. 

Sparse files on XFS using loopback will allow you to simulate
devices larger than 16TB easily. You can turtle it all the way down,
too, to create the xfs filesystem on a loopback device on a sparse
file on ext4....

Doing this sort of thing lets me know, for example, that the
mkfs.ext4 defaults fail on a 500TB device...

# xfs_io -f -c 'truncate 500t' /mnt/xfs/fs.img
# ls -lh /mnt/xfs
total 0
-rw------- 1 root root 500T Sep 19 12:41 fs.img
# mkfs.ext4 /mnt/xfs/fs.img
mke2fs 1.42.13 (17-May-2015)
/mnt/xfs/fs.img: Cannot create filesystem with requested number of inodes while setting up superblock
#

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-19  2:47                   ` Dave Chinner
@ 2015-09-19  5:23                     ` Darrick J. Wong
  2015-09-19 14:11                     ` Johan Harvyl
  1 sibling, 0 replies; 15+ messages in thread
From: Darrick J. Wong @ 2015-09-19  5:23 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Andreas Dilger, Johan Harvyl, Theodore Ts'o,
	linux-ext4@vger.kernel.org

On Sat, Sep 19, 2015 at 12:47:25PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 07:21:59PM -0600, Andreas Dilger wrote:
> > If you add "-b 1024" to the mke2fs command line to use 1KB instead of 4KB blocks, and reduce the sizes by a factor of 4 does the problem still happen? That would make it easier for someone else to test, since it would only need a 4-5TB disk instead of a 19Tb array. 
> 
> Sparse files on XFS using loopback will allow you to simulate
> devices larger than 16TB easily. You can turtle it all the way down,
> too, to create the xfs filesystem on a loopback device on a sparse
> file on ext4....
> 
> Doing this sort of thing lets me know, for example, that the
> mkfs.ext4 defaults fail on a 500TB device...
> 
> # xfs_io -f -c 'truncate 500t' /mnt/xfs/fs.img
> # ls -lh /mnt/xfs
> total 0
> -rw------- 1 root root 500T Sep 19 12:41 fs.img
> # mkfs.ext4 /mnt/xfs/fs.img
> mke2fs 1.42.13 (17-May-2015)
> /mnt/xfs/fs.img: Cannot create filesystem with requested number of inodes while setting up superblock

Whee.  I guess one would need to turn on meta_bg at mkfs time (which scatters
the group descriptors across the disk instead of (failing to) sandwich them in
a single blockgroup... and fix the overhead calculation in ext2fs_initialize to
calculate the maximum BG overhead correctly, since it doesn't seem to know
about metabg.

Of course there's the question of whether or not we really /want/ people
formatting 500T ext4 filesystems.  meta_bg is not turned on by default, so
the defaults will still fail unless they know to pass that option.

(Frankly, doing so is probably insane.)

--D

> #
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-19  2:47                   ` Dave Chinner
  2015-09-19  5:23                     ` Darrick J. Wong
@ 2015-09-19 14:11                     ` Johan Harvyl
  2015-09-19 15:02                       ` Theodore Ts'o
  1 sibling, 1 reply; 15+ messages in thread
From: Johan Harvyl @ 2015-09-19 14:11 UTC (permalink / raw)
  To: Dave Chinner, Andreas Dilger, Theodore Ts'o
  Cc: linux-ext4@vger.kernel.org

Thanks for the tip about XFS Dave, I have never used it before but I 
decided to give it a try and managed to reproduce my original issue 
there quite quickly.

I took an old 1 TB disk, put it in a USB cradle and attached it to a 
Linux box running Linux 4.1.0-2-amd64, put XFS on it and created a 24T 
sparse file.

# mkfs.xfs /dev/sda1
# truncate test.img -s 24T

Note that this setup shares no hardware components with the box I 
originally noticed the issue on.
The USB cradle is attached to a different box.

Should this be reported in a bug tracker rather than here?

# mkfs.ext4 test.img -i 262144 -m 0 -O 64bit 15627548672k
mke2fs 1.42.13 (17-May-2015)
Discarding device blocks: done
Creating filesystem with 3906887168 4k blocks and 61045248 inodes
Filesystem UUID: 53b8a330-beba-4bc4-ab34-5d57c0f457fb
Superblock backups stored on blocks:
         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 
2654208,
         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
         102400000, 214990848, 512000000, 550731776, 644972544, 1934917632,
         2560000000, 3855122432

Allocating group tables: done
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

# resize2fs -p test.img 19534435840k
resize2fs 1.42.13 (17-May-2015)
Resizing the filesystem on test.img to 4883608960 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 119229)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 8)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
The filesystem on test.img is now 4883608960 (4k) blocks long.

# resize2fs -p test.img 23441323008k
resize2fs 1.42.13 (17-May-2015)
Resizing the filesystem on test.img to 5860330752 (4k) blocks.
Begin pass 2 (max = 6)
Relocating blocks XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 3 (max = 149036)
Scanning inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Begin pass 5 (max = 14)
Moving inode table XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Should never happen: resize inode corrupt!

# debugfs -c test.img
debugfs 1.42.13 (17-May-2015)
test.img: catastrophic mode - not reading inode or group bitmaps
debugfs:  stat <2>
Inode: 2   Type: bad type    Mode:  0000   Flags: 0x0

So, again the root inode is trashed.

-johan

On 2015-09-19 04:47, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 07:21:59PM -0600, Andreas Dilger wrote:
>> If you add "-b 1024" to the mke2fs command line to use 1KB instead of 4KB blocks, and reduce the sizes by a factor of 4 does the problem still happen? That would make it easier for someone else to test, since it would only need a 4-5TB disk instead of a 19Tb array.
> Sparse files on XFS using loopback will allow you to simulate
> devices larger than 16TB easily. You can turtle it all the way down,
> too, to create the xfs filesystem on a loopback device on a sparse
> file on ext4....
>
> Doing this sort of thing lets me know, for example, that the
> mkfs.ext4 defaults fail on a 500TB device...
>
> # xfs_io -f -c 'truncate 500t' /mnt/xfs/fs.img
> # ls -lh /mnt/xfs
> total 0
> -rw------- 1 root root 500T Sep 19 12:41 fs.img
> # mkfs.ext4 /mnt/xfs/fs.img
> mke2fs 1.42.13 (17-May-2015)
> /mnt/xfs/fs.img: Cannot create filesystem with requested number of inodes while setting up superblock
> #
>
> Cheers,
>
> Dave.


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: resize2fs: Should never happen: resize inode corrupt! - lost key inodes
  2015-09-19 14:11                     ` Johan Harvyl
@ 2015-09-19 15:02                       ` Theodore Ts'o
  0 siblings, 0 replies; 15+ messages in thread
From: Theodore Ts'o @ 2015-09-19 15:02 UTC (permalink / raw)
  To: Johan Harvyl; +Cc: Dave Chinner, Andreas Dilger, linux-ext4@vger.kernel.org

On Sat, Sep 19, 2015 at 04:11:50PM +0200, Johan Harvyl wrote:
> 
> Should this be reported in a bug tracker rather than here?

Yes, please do.  I can reproduce this using e2fsprogs 1.43's next
branch, so there is definitely a real bug in e2fsprogs's resize2fs.

The fastest way to reproduce this is using tmpfs (it only requires
275MB of ram):

#!/bin/bash
FS=/tmp/foo.img
touch $FS
mkfs.ext4 $FS -i 262144 -m 0 -O 64bit 15627548672k
resize2fs -p $FS 19534435840k
resize2fs -p $FS 23441323008k
debugfs -c $FS -R "stat <2>"

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2015-09-19 15:02 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-08-11 18:15 resize2fs: Should never happen: resize inode corrupt! - lost key inodes Johan Harvyl
2015-08-11 22:47 ` Theodore Ts'o
2015-08-12 22:00   ` Johan Harvyl
2015-08-13 13:27     ` Theodore Ts'o
2015-08-13 18:12       ` Johan Harvyl
2015-09-03 22:16         ` Johan Harvyl
2015-09-12 10:27           ` Johan Harvyl
2015-09-14 21:35             ` Johan Harvyl
2015-09-15 17:55               ` Johan Harvyl
2015-09-17  1:21                 ` Andreas Dilger
2015-09-18 18:26                   ` Johan Harvyl
2015-09-19  2:47                   ` Dave Chinner
2015-09-19  5:23                     ` Darrick J. Wong
2015-09-19 14:11                     ` Johan Harvyl
2015-09-19 15:02                       ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).