* e4defrag: Corrupt file after running e4defrag @ 2017-06-08 14:12 Marc Thomas 2017-06-08 22:35 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Marc Thomas @ 2017-06-08 14:12 UTC (permalink / raw) To: linux-ext4 Hello All, After running "e4defrag" on an idle filesystem, I found a single file out of 893084 regular files had become corrupt (md5sum changed). I have repeated the process using an almost identical copy of the original filesystem, and a different single file became corrupt. In both cases the nature of the corruption was the first 4096 bytes of the file being replaced with something different. The remainder of each file was intact. I don't think it's a hardware issue. The machine has ECC RAM, and no I/O errors are being reported. The ext4 filesystem is inside an LVM volume, and the underlying storage is RAID1 md-raid. The system is running vanilla kernel 4.11.3, with e2fsprogs-1.43.4 on an 8 core / 16 thread x86_64 CPU. After the first e4defrag run: # time md5sum --quiet -c CHECKSUMS_home_050617.md5 ./marc/LinuxStuff/Linux4.x/patch-4.10.12.xz: FAILED md5sum: WARNING: 1 computed checksum did NOT match real 62m30.816s user 10m15.144s sys 2m29.436s After the second run: # time md5sum --quiet -c CHECKSUMS_home_050617.md5 ./marc/.mozilla/firefox/jh6wc1il.default/formhistory.sqlite: FAILED md5sum: WARNING: 1 computed checksum did NOT match real 61m45.874s user 10m24.339s sys 2m12.107s Can anyone help pin down the cause of the corruption? I'm happy to repeat the process again (I have an image copy of the original source filesystem) - but the "e4defrag" process takes about 4hours, and it takes a further hour to do the md5sum check - so it's not quick to reproduce. Thanks & Kind Regards, Marc ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag 2017-06-08 14:12 e4defrag: Corrupt file after running e4defrag Marc Thomas @ 2017-06-08 22:35 ` Theodore Ts'o 2017-06-09 1:31 ` Marc Thomas 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2017-06-08 22:35 UTC (permalink / raw) To: Marc Thomas; +Cc: linux-ext4 On Thu, Jun 08, 2017 at 03:12:27PM +0100, Marc Thomas wrote: > Hello All, > > After running "e4defrag" on an idle filesystem, I found a single file > out of 893084 regular files had become corrupt (md5sum changed). > I have repeated the process using an almost identical copy of the > original filesystem, and a different single file became corrupt. How big is the file system, and what file system features are enabled. Can you send me a copy of "dumpe2fs -h" on the file system? Something else that would be useful would be to unmount the file system after seeing the md5sum failure, and run e2fsck -fn to see if e2fsck noticed any file system corruption. I don't know if this will be doable, since it depends on how big the file system is and how much extra space you have in your LVM volume group, but what would be *wonderful* would be if you can do a full image backup, or failing that, an compressed raw e2image backup (see the e2image man page, or the "REPORTING BUGS" section of the e2fsck man page). The compressed raw e2image backup only contains the file system metadata, and no the data blocks, so it takes much less space. The idea then is after you discover which file is getting corrupted, you can look that inode both on the "before" file system image (looking only at the metadata blocks) and the "after" file system image, and see if there are any clues about how to make a reproducible test case, using the "stat" command of debugfs. Getting a full dumpe2fs output of the filesystem before and after might give us a clue. Thanks, - Ted ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag 2017-06-08 22:35 ` Theodore Ts'o @ 2017-06-09 1:31 ` Marc Thomas 2017-06-09 15:27 ` Theodore Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Marc Thomas @ 2017-06-09 1:31 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Hi Ted, Many thanks for the quick reply. On 08/06/17 23:35, Theodore Ts'o wrote: > How big is the file system, and what file system features are enabled. > Can you send me a copy of "dumpe2fs -h" on the file system? Something > else that would be useful would be to unmount the file system after > seeing the md5sum failure, and run e2fsck -fn to see if e2fsck noticed > any file system corruption. Filesystem size is: root@deepthought:~# df -h /old Filesystem Size Used Avail Use% Mounted on /dev/mapper/storage_vg-old_home_lv 345G 240G 101G 71% /old dumpe2fs output: root@deepthought:~# umount /old root@deepthought:~# dumpe2fs -h /dev/storage_vg/old_home_lv dumpe2fs 1.43.4 (31-Jan-2017) Filesystem volume name: home Last mounted on: /old Filesystem UUID: 29b7a949-8f67-454b-8d76-99bf656b8ffb Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent sparse_super large_file Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 22937600 Block count: 91750400 Reserved block count: 917504 Free blocks: 27373220 Free inodes: 21856957 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1002 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Filesystem created: Wed Oct 15 22:34:49 2008 Last mount time: Wed Jun 7 01:56:45 2017 Last write time: Thu Jun 8 12:12:09 2017 Mount count: 0 Maximum mount count: 21 Last checked: Thu Jun 8 12:12:09 2017 Check interval: 15552000 (6 months) Next check after: Tue Dec 5 11:12:09 2017 Lifetime writes: 1553 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Journal inode: 8 Default directory hash: tea Directory Hash Seed: d863c54f-8fd7-40ee-b43a-119ebee8c4f0 Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x009ddb9b Journal start: 0 e2fsck after corruption detected: root@deepthought:~# time e2fsck -fn /dev/storage_vg/old_home_lv e2fsck 1.43.5-WIP (17-Feb-2017) Pass 1: Checking inodes, blocks, and sizes Inode 141148 extent tree (at level 2) could be narrower. Fix? no Inode 362394 extent tree (at level 2) could be narrower. Fix? no Inode 394645 extent tree (at level 1) could be narrower. Fix? no Inode 412793 extent tree (at level 1) could be narrower. Fix? no [...] Inode 16228394 extent tree (at level 1) could be narrower. Fix? no Inode 16318845 extent tree (at level 1) could be narrower. Fix? no Pass 2: Checking directory structure Pass 3: Checking directory connectivity Pass 4: Checking reference counts Pass 5: Checking group summary information home: 1080643/22937600 files (0.4% non-contiguous), 64379509/91750400 blocks real 1m47.814s user 0m13.810s sys 0m3.478s If it helps, here's a debugfs "stat" of a file post-corruption. What I don't have at present is the "before" case. debugfs: stat ./marc/.mozilla/firefox/jh6wc1il.default/formhistory.sqlite Inode: 2074844 Type: regular Mode: 0644 Flags: 0x80000 Generation: 196400461 Version: 0x00000000:00000000 User: 510 Group: 100 Project: 0 Size: 92160 File ACL: 0 Directory ACL: 0 Links: 1 Blockcount: 184 Fragment: Address: 0 Number: 0 Size: 0 ctime: 0x59308cb7:aa7df8c0 -- Thu Jun 1 22:52:55 2017 atime: 0x59308cb7:e6cfad44 -- Thu Jun 1 22:52:55 2017 mtime: 0x59308cb7:aa7df8c0 -- Thu Jun 1 22:52:55 2017 crtime: 0x00000000:00000000 -- Thu Jan 1 01:00:00 1970 Size of extra inode fields: 32 EXTENTS: (0-22):38929313-38929335 > I don't know if this will be doable, since it depends on how big the > file system is and how much extra space you have in your LVM volume > group, but what would be *wonderful* would be if you can do a full > image backup, or failing that, an compressed raw e2image backup (see > the e2image man page, or the "REPORTING BUGS" section of the e2fsck > man page). The compressed raw e2image backup only contains the file > system metadata, and no the data blocks, so it takes much less space. > > The idea then is after you discover which file is getting corrupted, > you can look that inode both on the "before" file system image > (looking only at the metadata blocks) and the "after" file system > image, and see if there are any clues about how to make a reproducible > test case, using the "stat" command of debugfs. Getting a full > dumpe2fs output of the filesystem before and after might give us a > clue. I have around 3TB unallocated space in the LVM group, so should be able to hold a pre-defrag and post-defrag copy of the filesystem. What I'll do is re-copy the source filesystem (ext3), and do the conversion to ext4. I'll then make a block level copy of that and e4defrag the copy. I'm currently running: # e2image -rs /dev/storage_vg/old_home_lv - | bzip2 > /tmp/old_home_lv.e2i.bz2 ...but it looks like that might take a while. I'll report back tomorrow. Thanks & Kind Regards, Marc ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag 2017-06-09 1:31 ` Marc Thomas @ 2017-06-09 15:27 ` Theodore Ts'o 2017-07-07 18:02 ` Marc Thomas 0 siblings, 1 reply; 5+ messages in thread From: Theodore Ts'o @ 2017-06-09 15:27 UTC (permalink / raw) To: Marc Thomas; +Cc: linux-ext4 On Fri, Jun 09, 2017 at 02:31:08AM +0100, Marc Thomas wrote: > > I have around 3TB unallocated space in the LVM group, so should be able > to hold a pre-defrag and post-defrag copy of the filesystem. > What I'll do is re-copy the source filesystem (ext3), and do the > conversion to ext4. I'll then make a block level copy of that and > e4defrag the copy. *Ah*. So this is a file system was originally ext3, which you made an image copy, then enabled various ext4 features using tune2fs, and then ran e4defrag, right? That's useful to know, thanks. I assume that means that the "before" file was mapped using the old-style ext2/ext3 indirect block map scheme (and not extent-mapped). I will say that e4defrag is something that wasn't well supported, and the distributions decided not support it. In fact, with Red Hat they don't support using tune2fs to add ext4 features at all, because they didn't want to deal with the QA test matrix and support effort that this would involve. At Google we did take file systems that were indirect block mapped (ext2, specifically), and run add extent maps and a few other ext4 features, and so I know that works. I can also tell you that for our data center workload at the time, a converted file system using tune2fs has about half of the performance improvement compared to switching to a "native" ext4 file system. But we never used e4defrag because it burns a lot of disk bandwidth, and even after the defrag, the low-level layout of the inode table blocks, bitmap allocation bitmaps, etc., of an ext2/ext3 file system are different enough from a native ext4 file system that we didn't think it would be worth it. That is, even after converting a file system to have some (but not all) of the ext4 features by using tune2fs, the incremental improvement of running e4defrag was never going to be the same as a fully native ext4 file system, and to be honest, if you have the disk space, reformatting and copying probably would be faster in the end *and* result in a more performant file system. So that doesn't mean we shouldn't fix the bug if we can find the root cause, but understand that in the end you may end up find that all of this effort may not be worth it. (But thank you if you decide to help gather the information so we can try to fix the bug anyway. :-) Cheers, - Ted P.S. This is also why companies often decide to deprecate features that very few people are using. It may not make business sense to keep a feature alive just for a very few set of people using said feature, especially if you're looking at it from a cold-hearted business perspective, or from the "but what about the careers of the engineers stuck maintaining a backwater feature?" But if you are one of the people using the feature, especially if you are journalists or people with blogs that get lot of traffic, you tend to lash out about how you can't trust that company because it deprecated your favorite feature. (Even if the numbers are that monthly active user count was pathetically low.) In the open source world we're less likely to do that unless the feature is actively harmful, but we will sometimes just post "danger, this may corrupt your data" signs if we can't get people to volunteer to support or fix bugs on that low-use feature. E4defrag is right on the cusp of that, especially now that I know that it can possibly corrupt data. If we can fix the bug, well and good, but if not, and no one else wants to (or has time) to fix the bug, we may just put a "you probably don't want to use this" sign in the man page, and perhaps stop building it by default. Hopefully, though, the fix will be obvious once we get a bit more data. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: e4defrag: Corrupt file after running e4defrag 2017-06-09 15:27 ` Theodore Ts'o @ 2017-07-07 18:02 ` Marc Thomas 0 siblings, 0 replies; 5+ messages in thread From: Marc Thomas @ 2017-07-07 18:02 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 Hi Ted, Many thanks for your reply, and apologies for the delay in responding. The good news is I have not been able to reproduce the e4defrag induced file corruption since upgrading from kernel 4.11.3 to 4.11.5 or above, so I suspect it was a kernel issue after all. I note there were some extent manipulation fixes included in 4.11.5. I'm going to have another attempt at the full data migration process I've been working on, and will report back if I run into any further problems. For anyone else TL;DR - you can stop reading at this point. On 09/06/17 16:27, Theodore Ts'o wrote: > *Ah*. So this is a file system was originally ext3, which you made an > image copy, then enabled various ext4 features using tune2fs, and then > ran e4defrag, right? That's useful to know, thanks. I assume that > means that the "before" file was mapped using the old-style ext2/ext3 > indirect block map scheme (and not extent-mapped). Yes, that's correct. The migrated filesystem was also expanded as well. I verified the md5sums after each step so I knew when the corruption occurred. JFYI - I was also able to reproduce the corruption issue on a native ext4 fs at kernel 4.11.3. Prior to e4defrag, the "before" files were extent mapped as I'd also run "e2fsck -E bmap2extent" with a patched e2fsck containing Darrick Wong's fix "e2fsck: fix sparse bmap to extent conversion". (Commit: 855c2ecb21d1556c26d61df9b014e1c79dbbc956). > I will say that e4defrag is something that wasn't well supported, and > the distributions decided not support it. In fact, with Red Hat they > don't support using tune2fs to add ext4 features at all, because they > didn't want to deal with the QA test matrix and support effort that > this would involve. > > At Google we did take file systems that were indirect block mapped > (ext2, specifically), and run add extent maps and a few other ext4 > features, and so I know that works. I can also tell you that for our > data center workload at the time, a converted file system using > tune2fs has about half of the performance improvement compared to > switching to a "native" ext4 file system. > > But we never used e4defrag because it burns a lot of disk bandwidth, > and even after the defrag, the low-level layout of the inode table > blocks, bitmap allocation bitmaps, etc., of an ext2/ext3 file system > are different enough from a native ext4 file system that we didn't > think it would be worth it. That is, even after converting a file > system to have some (but not all) of the ext4 features by using > tune2fs, the incremental improvement of running e4defrag was never > going to be the same as a fully native ext4 file system, and to be > honest, if you have the disk space, reformatting and copying probably > would be faster in the end *and* result in a more performant file > system. Understood. I would like to keep the original filesystems intact if possible, because there is some metadata (ctime for example) which is lost with a backup/restore or file copy. As regards filesystem performance; I think e4defrag does have some value. For example, with the data I'm migrating it takes approx 78mins to verify all the md5sums on the converted ext4 filesystem. After defragging it takes 62mins to do the same thing. This is around a 20% improvement, which is good enough for me. The defrag itself takes around 4 hours. It does use a lot of disk bandwidth, but these are enterprise class drives - so hopefully they can take it. I don't propose to use e4defrag on a regular basis, but as a one-off post migration task, it makes sense to me. > So that doesn't mean we shouldn't fix the bug if we can find the root > cause, but understand that in the end you may end up find that all of > this effort may not be worth it. (But thank you if you decide to help > gather the information so we can try to fix the bug anyway. :-) Again, understood. I'll report back if I encounter any further issues. > P.S. This is also why companies often decide to deprecate features > that very few people are using. It may not make business sense to > keep a feature alive just for a very few set of people using said > feature, especially if you're looking at it from a cold-hearted > business perspective, or from the "but what about the careers of the > engineers stuck maintaining a backwater feature?" But if you are one > of the people using the feature, especially if you are journalists or > people with blogs that get lot of traffic, you tend to lash out about > how you can't trust that company because it deprecated your favorite > feature. (Even if the numbers are that monthly active user count was > pathetically low.) > > In the open source world we're less likely to do that unless the > feature is actively harmful, but we will sometimes just post "danger, > this may corrupt your data" signs if we can't get people to volunteer > to support or fix bugs on that low-use feature. E4defrag is right on > the cusp of that, especially now that I know that it can possibly > corrupt data. If we can fix the bug, well and good, but if not, and > no one else wants to (or has time) to fix the bug, we may just put a > "you probably don't want to use this" sign in the man page, and > perhaps stop building it by default. Hopefully, though, the fix will > be obvious once we get a bit more data. That's fair enough. Hopefully e4defrag can have a stay of execution for a while yet. Finally, apologies for the malformed patches I sent, and thanks for fixing them up and applying them anyway. Kind Regards, Marc ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2017-07-07 18:04 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-06-08 14:12 e4defrag: Corrupt file after running e4defrag Marc Thomas 2017-06-08 22:35 ` Theodore Ts'o 2017-06-09 1:31 ` Marc Thomas 2017-06-09 15:27 ` Theodore Ts'o 2017-07-07 18:02 ` Marc Thomas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).