* e4defrag: Corrupt file after running e4defrag
@ 2017-06-08 14:12 Marc Thomas
2017-06-08 22:35 ` Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Marc Thomas @ 2017-06-08 14:12 UTC (permalink / raw)
To: linux-ext4
Hello All,
After running "e4defrag" on an idle filesystem, I found a single file
out of 893084 regular files had become corrupt (md5sum changed).
I have repeated the process using an almost identical copy of the
original filesystem, and a different single file became corrupt.
In both cases the nature of the corruption was the first 4096 bytes of
the file being replaced with something different. The remainder of each
file was intact.
I don't think it's a hardware issue. The machine has ECC RAM, and no I/O
errors are being reported.
The ext4 filesystem is inside an LVM volume, and the underlying storage
is RAID1 md-raid.
The system is running vanilla kernel 4.11.3, with e2fsprogs-1.43.4 on an
8 core / 16 thread x86_64 CPU.
After the first e4defrag run:
# time md5sum --quiet -c CHECKSUMS_home_050617.md5
./marc/LinuxStuff/Linux4.x/patch-4.10.12.xz: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
real 62m30.816s
user 10m15.144s
sys 2m29.436s
After the second run:
# time md5sum --quiet -c CHECKSUMS_home_050617.md5
./marc/.mozilla/firefox/jh6wc1il.default/formhistory.sqlite: FAILED
md5sum: WARNING: 1 computed checksum did NOT match
real 61m45.874s
user 10m24.339s
sys 2m12.107s
Can anyone help pin down the cause of the corruption? I'm happy to
repeat the process again (I have an image copy of the original source
filesystem) - but the "e4defrag" process takes about 4hours, and it
takes a further hour to do the md5sum check - so it's not quick to
reproduce.
Thanks & Kind Regards,
Marc
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag
2017-06-08 14:12 e4defrag: Corrupt file after running e4defrag Marc Thomas
@ 2017-06-08 22:35 ` Theodore Ts'o
2017-06-09 1:31 ` Marc Thomas
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2017-06-08 22:35 UTC (permalink / raw)
To: Marc Thomas; +Cc: linux-ext4
On Thu, Jun 08, 2017 at 03:12:27PM +0100, Marc Thomas wrote:
> Hello All,
>
> After running "e4defrag" on an idle filesystem, I found a single file
> out of 893084 regular files had become corrupt (md5sum changed).
> I have repeated the process using an almost identical copy of the
> original filesystem, and a different single file became corrupt.
How big is the file system, and what file system features are enabled.
Can you send me a copy of "dumpe2fs -h" on the file system? Something
else that would be useful would be to unmount the file system after
seeing the md5sum failure, and run e2fsck -fn to see if e2fsck noticed
any file system corruption.
I don't know if this will be doable, since it depends on how big the
file system is and how much extra space you have in your LVM volume
group, but what would be *wonderful* would be if you can do a full
image backup, or failing that, an compressed raw e2image backup (see
the e2image man page, or the "REPORTING BUGS" section of the e2fsck
man page). The compressed raw e2image backup only contains the file
system metadata, and no the data blocks, so it takes much less space.
The idea then is after you discover which file is getting corrupted,
you can look that inode both on the "before" file system image
(looking only at the metadata blocks) and the "after" file system
image, and see if there are any clues about how to make a reproducible
test case, using the "stat" command of debugfs. Getting a full
dumpe2fs output of the filesystem before and after might give us a
clue.
Thanks,
- Ted
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag
2017-06-08 22:35 ` Theodore Ts'o
@ 2017-06-09 1:31 ` Marc Thomas
2017-06-09 15:27 ` Theodore Ts'o
0 siblings, 1 reply; 5+ messages in thread
From: Marc Thomas @ 2017-06-09 1:31 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
Hi Ted,
Many thanks for the quick reply.
On 08/06/17 23:35, Theodore Ts'o wrote:
> How big is the file system, and what file system features are enabled.
> Can you send me a copy of "dumpe2fs -h" on the file system? Something
> else that would be useful would be to unmount the file system after
> seeing the md5sum failure, and run e2fsck -fn to see if e2fsck noticed
> any file system corruption.
Filesystem size is:
root@deepthought:~# df -h /old
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/storage_vg-old_home_lv 345G 240G 101G 71% /old
dumpe2fs output:
root@deepthought:~# umount /old
root@deepthought:~# dumpe2fs -h /dev/storage_vg/old_home_lv
dumpe2fs 1.43.4 (31-Jan-2017)
Filesystem volume name: home
Last mounted on: /old
Filesystem UUID: 29b7a949-8f67-454b-8d76-99bf656b8ffb
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent sparse_super large_file
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 22937600
Block count: 91750400
Reserved block count: 917504
Free blocks: 27373220
Free inodes: 21856957
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1002
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Filesystem created: Wed Oct 15 22:34:49 2008
Last mount time: Wed Jun 7 01:56:45 2017
Last write time: Thu Jun 8 12:12:09 2017
Mount count: 0
Maximum mount count: 21
Last checked: Thu Jun 8 12:12:09 2017
Check interval: 15552000 (6 months)
Next check after: Tue Dec 5 11:12:09 2017
Lifetime writes: 1553 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Journal inode: 8
Default directory hash: tea
Directory Hash Seed: d863c54f-8fd7-40ee-b43a-119ebee8c4f0
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length: 32768
Journal sequence: 0x009ddb9b
Journal start: 0
e2fsck after corruption detected:
root@deepthought:~# time e2fsck -fn /dev/storage_vg/old_home_lv
e2fsck 1.43.5-WIP (17-Feb-2017)
Pass 1: Checking inodes, blocks, and sizes
Inode 141148 extent tree (at level 2) could be narrower. Fix? no
Inode 362394 extent tree (at level 2) could be narrower. Fix? no
Inode 394645 extent tree (at level 1) could be narrower. Fix? no
Inode 412793 extent tree (at level 1) could be narrower. Fix? no
[...]
Inode 16228394 extent tree (at level 1) could be narrower. Fix? no
Inode 16318845 extent tree (at level 1) could be narrower. Fix? no
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
home: 1080643/22937600 files (0.4% non-contiguous), 64379509/91750400 blocks
real 1m47.814s
user 0m13.810s
sys 0m3.478s
If it helps, here's a debugfs "stat" of a file post-corruption. What I
don't have at present is the "before" case.
debugfs: stat ./marc/.mozilla/firefox/jh6wc1il.default/formhistory.sqlite
Inode: 2074844 Type: regular Mode: 0644 Flags: 0x80000
Generation: 196400461 Version: 0x00000000:00000000
User: 510 Group: 100 Project: 0 Size: 92160
File ACL: 0 Directory ACL: 0
Links: 1 Blockcount: 184
Fragment: Address: 0 Number: 0 Size: 0
ctime: 0x59308cb7:aa7df8c0 -- Thu Jun 1 22:52:55 2017
atime: 0x59308cb7:e6cfad44 -- Thu Jun 1 22:52:55 2017
mtime: 0x59308cb7:aa7df8c0 -- Thu Jun 1 22:52:55 2017
crtime: 0x00000000:00000000 -- Thu Jan 1 01:00:00 1970
Size of extra inode fields: 32
EXTENTS:
(0-22):38929313-38929335
> I don't know if this will be doable, since it depends on how big the
> file system is and how much extra space you have in your LVM volume
> group, but what would be *wonderful* would be if you can do a full
> image backup, or failing that, an compressed raw e2image backup (see
> the e2image man page, or the "REPORTING BUGS" section of the e2fsck
> man page). The compressed raw e2image backup only contains the file
> system metadata, and no the data blocks, so it takes much less space.
>
> The idea then is after you discover which file is getting corrupted,
> you can look that inode both on the "before" file system image
> (looking only at the metadata blocks) and the "after" file system
> image, and see if there are any clues about how to make a reproducible
> test case, using the "stat" command of debugfs. Getting a full
> dumpe2fs output of the filesystem before and after might give us a
> clue.
I have around 3TB unallocated space in the LVM group, so should be able
to hold a pre-defrag and post-defrag copy of the filesystem.
What I'll do is re-copy the source filesystem (ext3), and do the
conversion to ext4. I'll then make a block level copy of that and
e4defrag the copy.
I'm currently running:
# e2image -rs /dev/storage_vg/old_home_lv - | bzip2 >
/tmp/old_home_lv.e2i.bz2
...but it looks like that might take a while. I'll report back tomorrow.
Thanks & Kind Regards,
Marc
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag
2017-06-09 1:31 ` Marc Thomas
@ 2017-06-09 15:27 ` Theodore Ts'o
2017-07-07 18:02 ` Marc Thomas
0 siblings, 1 reply; 5+ messages in thread
From: Theodore Ts'o @ 2017-06-09 15:27 UTC (permalink / raw)
To: Marc Thomas; +Cc: linux-ext4
On Fri, Jun 09, 2017 at 02:31:08AM +0100, Marc Thomas wrote:
>
> I have around 3TB unallocated space in the LVM group, so should be able
> to hold a pre-defrag and post-defrag copy of the filesystem.
> What I'll do is re-copy the source filesystem (ext3), and do the
> conversion to ext4. I'll then make a block level copy of that and
> e4defrag the copy.
*Ah*. So this is a file system was originally ext3, which you made an
image copy, then enabled various ext4 features using tune2fs, and then
ran e4defrag, right? That's useful to know, thanks. I assume that
means that the "before" file was mapped using the old-style ext2/ext3
indirect block map scheme (and not extent-mapped).
I will say that e4defrag is something that wasn't well supported, and
the distributions decided not support it. In fact, with Red Hat they
don't support using tune2fs to add ext4 features at all, because they
didn't want to deal with the QA test matrix and support effort that
this would involve.
At Google we did take file systems that were indirect block mapped
(ext2, specifically), and run add extent maps and a few other ext4
features, and so I know that works. I can also tell you that for our
data center workload at the time, a converted file system using
tune2fs has about half of the performance improvement compared to
switching to a "native" ext4 file system.
But we never used e4defrag because it burns a lot of disk bandwidth,
and even after the defrag, the low-level layout of the inode table
blocks, bitmap allocation bitmaps, etc., of an ext2/ext3 file system
are different enough from a native ext4 file system that we didn't
think it would be worth it. That is, even after converting a file
system to have some (but not all) of the ext4 features by using
tune2fs, the incremental improvement of running e4defrag was never
going to be the same as a fully native ext4 file system, and to be
honest, if you have the disk space, reformatting and copying probably
would be faster in the end *and* result in a more performant file
system.
So that doesn't mean we shouldn't fix the bug if we can find the root
cause, but understand that in the end you may end up find that all of
this effort may not be worth it. (But thank you if you decide to help
gather the information so we can try to fix the bug anyway. :-)
Cheers,
- Ted
P.S. This is also why companies often decide to deprecate features
that very few people are using. It may not make business sense to
keep a feature alive just for a very few set of people using said
feature, especially if you're looking at it from a cold-hearted
business perspective, or from the "but what about the careers of the
engineers stuck maintaining a backwater feature?" But if you are one
of the people using the feature, especially if you are journalists or
people with blogs that get lot of traffic, you tend to lash out about
how you can't trust that company because it deprecated your favorite
feature. (Even if the numbers are that monthly active user count was
pathetically low.)
In the open source world we're less likely to do that unless the
feature is actively harmful, but we will sometimes just post "danger,
this may corrupt your data" signs if we can't get people to volunteer
to support or fix bugs on that low-use feature. E4defrag is right on
the cusp of that, especially now that I know that it can possibly
corrupt data. If we can fix the bug, well and good, but if not, and
no one else wants to (or has time) to fix the bug, we may just put a
"you probably don't want to use this" sign in the man page, and
perhaps stop building it by default. Hopefully, though, the fix will
be obvious once we get a bit more data.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag
2017-06-09 15:27 ` Theodore Ts'o
@ 2017-07-07 18:02 ` Marc Thomas
0 siblings, 0 replies; 5+ messages in thread
From: Marc Thomas @ 2017-07-07 18:02 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
Hi Ted,
Many thanks for your reply, and apologies for the delay in responding.
The good news is I have not been able to reproduce the e4defrag induced
file corruption since upgrading from kernel 4.11.3 to 4.11.5 or above,
so I suspect it was a kernel issue after all.
I note there were some extent manipulation fixes included in 4.11.5.
I'm going to have another attempt at the full data migration process
I've been working on, and will report back if I run into any further
problems.
For anyone else TL;DR - you can stop reading at this point.
On 09/06/17 16:27, Theodore Ts'o wrote:
> *Ah*. So this is a file system was originally ext3, which you made an
> image copy, then enabled various ext4 features using tune2fs, and then
> ran e4defrag, right? That's useful to know, thanks. I assume that
> means that the "before" file was mapped using the old-style ext2/ext3
> indirect block map scheme (and not extent-mapped).
Yes, that's correct. The migrated filesystem was also expanded as well.
I verified the md5sums after each step so I knew when the corruption
occurred.
JFYI - I was also able to reproduce the corruption issue on a native
ext4 fs at kernel 4.11.3.
Prior to e4defrag, the "before" files were extent mapped as I'd also run
"e2fsck -E bmap2extent" with a patched e2fsck containing Darrick Wong's
fix "e2fsck: fix sparse bmap to extent conversion". (Commit:
855c2ecb21d1556c26d61df9b014e1c79dbbc956).
> I will say that e4defrag is something that wasn't well supported, and
> the distributions decided not support it. In fact, with Red Hat they
> don't support using tune2fs to add ext4 features at all, because they
> didn't want to deal with the QA test matrix and support effort that
> this would involve.
>
> At Google we did take file systems that were indirect block mapped
> (ext2, specifically), and run add extent maps and a few other ext4
> features, and so I know that works. I can also tell you that for our
> data center workload at the time, a converted file system using
> tune2fs has about half of the performance improvement compared to
> switching to a "native" ext4 file system.
>
> But we never used e4defrag because it burns a lot of disk bandwidth,
> and even after the defrag, the low-level layout of the inode table
> blocks, bitmap allocation bitmaps, etc., of an ext2/ext3 file system
> are different enough from a native ext4 file system that we didn't
> think it would be worth it. That is, even after converting a file
> system to have some (but not all) of the ext4 features by using
> tune2fs, the incremental improvement of running e4defrag was never
> going to be the same as a fully native ext4 file system, and to be
> honest, if you have the disk space, reformatting and copying probably
> would be faster in the end *and* result in a more performant file
> system.
Understood. I would like to keep the original filesystems intact if
possible, because there is some metadata (ctime for example) which is
lost with a backup/restore or file copy.
As regards filesystem performance; I think e4defrag does have some
value. For example, with the data I'm migrating it takes approx 78mins
to verify all the md5sums on the converted ext4 filesystem. After
defragging it takes 62mins to do the same thing. This is around a 20%
improvement, which is good enough for me. The defrag itself takes around
4 hours.
It does use a lot of disk bandwidth, but these are enterprise class
drives - so hopefully they can take it.
I don't propose to use e4defrag on a regular basis, but as a one-off
post migration task, it makes sense to me.
> So that doesn't mean we shouldn't fix the bug if we can find the root
> cause, but understand that in the end you may end up find that all of
> this effort may not be worth it. (But thank you if you decide to help
> gather the information so we can try to fix the bug anyway. :-)
Again, understood. I'll report back if I encounter any further issues.
> P.S. This is also why companies often decide to deprecate features
> that very few people are using. It may not make business sense to
> keep a feature alive just for a very few set of people using said
> feature, especially if you're looking at it from a cold-hearted
> business perspective, or from the "but what about the careers of the
> engineers stuck maintaining a backwater feature?" But if you are one
> of the people using the feature, especially if you are journalists or
> people with blogs that get lot of traffic, you tend to lash out about
> how you can't trust that company because it deprecated your favorite
> feature. (Even if the numbers are that monthly active user count was
> pathetically low.)
>
> In the open source world we're less likely to do that unless the
> feature is actively harmful, but we will sometimes just post "danger,
> this may corrupt your data" signs if we can't get people to volunteer
> to support or fix bugs on that low-use feature. E4defrag is right on
> the cusp of that, especially now that I know that it can possibly
> corrupt data. If we can fix the bug, well and good, but if not, and
> no one else wants to (or has time) to fix the bug, we may just put a
> "you probably don't want to use this" sign in the man page, and
> perhaps stop building it by default. Hopefully, though, the fix will
> be obvious once we get a bit more data.
That's fair enough. Hopefully e4defrag can have a stay of execution for
a while yet.
Finally, apologies for the malformed patches I sent, and thanks for
fixing them up and applying them anyway.
Kind Regards,
Marc
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-07-07 18:04 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-06-08 14:12 e4defrag: Corrupt file after running e4defrag Marc Thomas
2017-06-08 22:35 ` Theodore Ts'o
2017-06-09 1:31 ` Marc Thomas
2017-06-09 15:27 ` Theodore Ts'o
2017-07-07 18:02 ` Marc Thomas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).