* Filesystem corruption on Fedora 17 @ 2012-11-27 13:31 Adam Huffman 2012-11-27 16:47 ` Theodore Ts'o 0 siblings, 1 reply; 7+ messages in thread From: Adam Huffman @ 2012-11-27 13:31 UTC (permalink / raw) To: linux-ext4 Hello On two machines now I've had severe filesystem corruption. They are both Fedora 17 machines, and they both have, at some point, run the kernels that have been mentioned recently as possibly suffering from ext4 corruption problems. In the worst case, fsck is unable to fix the problems: fsck from util-linux 2.20.1 e2fsck 1.42.4 (12-June-2012) ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap fsck.ext4: Group descriptors look bad... trying backup blocks... /dev/mapper/heppc128-lv_home: recovering journal fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home /dev/mapper/heppc128-lv_home: ***** FILE SYSTEM WAS MODIFIED ***** /dev/mapper/heppc128-lv_home: ********** WARNING: Filesystem still has errors ********** Here's the output of dumpe2fs: dumpe2fs 1.42.4 (12-June-2012) Filesystem volume name: <none> Last mounted on: /home Filesystem UUID: b0b53537-bcc0-4006-bc32-5b55e13a4b94 Filesystem magic number: 0xEF53 Filesystem revision #: 1 (dynamic) Filesystem features: has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize Filesystem flags: signed_directory_hash Default mount options: (none) Filesystem state: clean Errors behavior: Continue Filesystem OS type: Linux Inode count: 3670016 Block count: 14680064 Reserved block count: 670950 Free blocks: 2150657 Free inodes: 2544162 First block: 0 Block size: 4096 Fragment size: 4096 Reserved GDT blocks: 1020 Blocks per group: 32768 Fragments per group: 32768 Inodes per group: 8192 Inode blocks per group: 512 Flex block group size: 16 Filesystem created: Mon Apr 2 10:45:35 2012 Last mount time: Fri May 11 10:05:54 2012 Last write time: Tue Nov 27 13:18:35 2012 Mount count: 7 Maximum mount count: 35 Last checked: Mon Apr 2 10:45:35 2012 Check interval: 15552000 (6 months) Next check after: Sat Sep 29 10:45:35 2012 Lifetime writes: 56 GB Reserved blocks uid: 0 (user root) Reserved blocks gid: 0 (group root) First inode: 11 Inode size: 256 Required extra isize: 28 Desired extra isize: 28 Journal inode: 8 First orphan inode: 1574824 Default directory hash: half_md4 Directory Hash Seed: 32f30e91-a55b-4e69-b95e-3bb3f04f34a9 Journal backup: inode blocks Journal features: journal_incompat_revoke Journal size: 128M Journal length: 32768 Journal sequence: 0x003506be Journal start: 0 During various other repair attempts, I've seen this message: e2fsck 1.42.4 (12-June-2012) /dev/mapper/vg0majh-lv_root contains a file system with errors, check forced. Resize inode not valid. Recreate? yes Pass 1: Checking inodes, blocks, and sizes Inode 4122234 has illegal block(s). Clear? yes Illegal block #256918621 (1313286244) in inode 4122234. CLEARED. Error storing directory block information (inode=4122234, block=0, num=78646612): Memory allocation failed /dev/mapper/vg0majh-lv_root: ***** FILE SYSTEM WAS MODIFIED ***** e2fsck: aborted /dev/mapper/vg0majh-lv_root: ***** FILE SYSTEM WAS MODIFIED ***** Both machines are running the most recent Fedora kernel, which is 3.6.7-4. I just tried mounting the /home LV, which seemed to succeed, but any file accesses didn't work: [ 1176.385418] EXT4-fs (dm-8): warning: checktime reached, running e2fsck is recommended [ 1176.403296] EXT4-fs warning (device dm-8): ext4_orphan_get:1014: bad orphan inode 1574824! e2fsck was run? [ 1176.403299] ext4_test_bit(bit=1959, block=6291472) = 0 [ 1176.403301] inode= (null) [ 1176.403304] EXT4-fs (dm-8): recovery complete [ 1176.403308] EXT4-fs (dm-8): mounted filesystem with ordered data mode. Opts: (null) [ 1250.457438] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1572865: comm rsync: deleted inode referenced: 2621441 [ 1250.578786] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1671420: comm rsync: deleted inode referenced: 2229739 [ 1250.654595] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1572894: comm rsync: deleted inode referenced: 2228725 [ 1250.654703] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1572894: comm rsync: deleted inode referenced: 2621702 [ 1250.683319] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1576085: comm rsync: deleted inode referenced: 2621449 [ 1250.695378] EXT4-fs error (device dm-8): ext4_lookup:1050: inode #1576085: comm rsync: deleted inode referenced: 2621450 Any help greatly appreciated... Best Wishes, Adam ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-27 13:31 Filesystem corruption on Fedora 17 Adam Huffman @ 2012-11-27 16:47 ` Theodore Ts'o 2012-11-27 16:59 ` Adam Huffman 0 siblings, 1 reply; 7+ messages in thread From: Theodore Ts'o @ 2012-11-27 16:47 UTC (permalink / raw) To: Adam Huffman; +Cc: linux-ext4 [-- Attachment #1: Type: text/plain, Size: 2703 bytes --] On Tue, Nov 27, 2012 at 01:31:18PM +0000, Adam Huffman wrote: > > On two machines now I've had severe filesystem corruption. They are > both Fedora 17 machines, and they both have, at some point, run the > kernels that have been mentioned recently as possibly suffering from > ext4 corruption problems. I don't know if you followed the story that closely, but the hysteria over the "ext4 corruption problems" were caused by users who were using non-standard mount options or other ext4 features.... > In the worst case, fsck is unable to fix the problems: > > fsck from util-linux 2.20.1 > e2fsck 1.42.4 (12-June-2012) > ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap > fsck.ext4: Group descriptors look bad... trying backup blocks... > /dev/mapper/heppc128-lv_home: recovering journal > fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home Furthermore, this doesn't look like any of the problems that people have reported. The corruption pattern looks most like what you would see if the blocks in the beginning (low numbered blocks) part of the file system have been overwritten with garbage. So first of all, if there is critical data that you want to preserve, the first thing I'd suggest doing is to make a image copy of the partition; it's only 56 GB, so hopefluly you have space to make a copy before you do any further experimentation to try to recover things. As far as the "unable to set superblock flags" error, I think I can see how that can happen (and in fact I've created a short test case which demonstrates the problem --- see attached), but that appears to be a one shot failure. That is, the second time you run e2fsck, it should be able to make progress. is that the case for you? (It's also possible that there are hardware bugs which is triggering this problem, however, and if in fact you're seeing this happen repeatably, I'd have seriously suspect some kind of hardware failure.) - Ted P.S. In order to get this failure I had to basically use a block editor, since there are software safeguards which prevent e2fsprogs or ext4 from setting the needs_recovery bit on backup superblocks, and this is what was necessary to trigger the bug. I'll fix this for the next release of e2fsprogs. The reason why we hadn't noticed was because (a) it basically requires a very specific hardware-induced bit-flip to trigger, and (b) even when it does, the second run of e2fsck makes the problem go away, so typically it gets noticed when system fails to boot due to e2fsck blowing out, and then when the system administrator runs fsck a second time on the file system, forward progress gets made. [-- Attachment #2: testcase.img.gz --] [-- Type: application/octet-stream, Size: 37512 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-27 16:47 ` Theodore Ts'o @ 2012-11-27 16:59 ` Adam Huffman 2012-11-27 17:31 ` Theodore Ts'o 0 siblings, 1 reply; 7+ messages in thread From: Adam Huffman @ 2012-11-27 16:59 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Tue, Nov 27, 2012 at 4:47 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Tue, Nov 27, 2012 at 01:31:18PM +0000, Adam Huffman wrote: >> >> On two machines now I've had severe filesystem corruption. They are >> both Fedora 17 machines, and they both have, at some point, run the >> kernels that have been mentioned recently as possibly suffering from >> ext4 corruption problems. > > I don't know if you followed the story that closely, but the hysteria > over the "ext4 corruption problems" were caused by users who were > using non-standard mount options or other ext4 features.... > Yes, I only mentioned that "just in case". I certainly don't have any exotic mount options. >> In the worst case, fsck is unable to fix the problems: >> >> fsck from util-linux 2.20.1 >> e2fsck 1.42.4 (12-June-2012) >> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap >> fsck.ext4: Group descriptors look bad... trying backup blocks... >> /dev/mapper/heppc128-lv_home: recovering journal >> fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home > > Furthermore, this doesn't look like any of the problems that people > have reported. The corruption pattern looks most like what you would > see if the blocks in the beginning (low numbered blocks) part of the > file system have been overwritten with garbage. > > So first of all, if there is critical data that you want to preserve, > the first thing I'd suggest doing is to make a image copy of the > partition; it's only 56 GB, so hopefluly you have space to make a copy > before you do any further experimentation to try to recover things. > I took a copy using dd_rescue yesterday, and that's what I've been running fsck against. (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...) The images comprises an LVM PV and VG, so I've used kpartx to make it available, if that makes a difference. There is one person claiming that it does: http://j-b.livejournal.com/334065.html > As far as the "unable to set superblock flags" error, I think I can > see how that can happen (and in fact I've created a short test case > which demonstrates the problem --- see attached), but that appears to > be a one shot failure. That is, the second time you run e2fsck, it > should be able to make progress. is that the case for you? > No, I see the same error no matter how many times I run e2fsck. > (It's also possible that there are hardware bugs which is triggering > this problem, however, and if in fact you're seeing this happen > repeatably, I'd have seriously suspect some kind of hardware failure.) > While I did suspect hardware problems, there hasn't been any sign of them in the system logs so far. Do you have any ideas about this error, with a different LV from the same disk?: Pass 1: Checking inodes, blocks, and sizes Inode 4122234 has illegal block(s). Clear? yes Illegal block #256918621 (1313286244) in inode 4122234. CLEARED. Error storing directory block information (inode=4122234, block=0, num=78646612): Memory allocation failed Many thanks for taking a look. Best Wishes, Adam > - Ted > > P.S. In order to get this failure I had to basically use a block > editor, since there are software safeguards which prevent e2fsprogs or > ext4 from setting the needs_recovery bit on backup superblocks, and > this is what was necessary to trigger the bug. I'll fix this for the > next release of e2fsprogs. The reason why we hadn't noticed was > because (a) it basically requires a very specific hardware-induced > bit-flip to trigger, and (b) even when it does, the second run of > e2fsck makes the problem go away, so typically it gets noticed when > system fails to boot due to e2fsck blowing out, and then when the > system administrator runs fsck a second time on the file system, > forward progress gets made. > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-27 16:59 ` Adam Huffman @ 2012-11-27 17:31 ` Theodore Ts'o 2012-11-27 18:40 ` Adam Huffman 2012-11-28 18:16 ` Adam Huffman 0 siblings, 2 replies; 7+ messages in thread From: Theodore Ts'o @ 2012-11-27 17:31 UTC (permalink / raw) To: Adam Huffman; +Cc: linux-ext4 On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote: > > I took a copy using dd_rescue yesterday, and that's what I've been > running fsck against. > (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...) On the disk itself? Instead of another copy of the disk? That was unfortunate.... mke2fs -S is very destructive when it doesn't work out.... and what happened after you tried that, BTW? What were the e2fsck failures that you were seeing? If you're seeing the same repeated journal failures, you might as well go for broke and see if zapping the journal helps: debugfs -w /dev/XXXX -R "clri <8>" Again, I always recommend issuing these sorts of commands on copies, and to never tamper with the initial image backup of the file system.... > The images comprises an LVM PV and VG, so I've used kpartx to make it > available, if that makes a difference. > > There is one person claiming that it does: > > http://j-b.livejournal.com/334065.html Hmm... I don't see why that would make a difference. At this point what I'd really need is an e2image dump of the file system. Please read the e2image man page, especially the sections regarding a raw e2image dump and a qcow e2image dump. If you are willing to send me a copy of your metadata blocks, please send me a qcow e2image dump and I'll take a look at it. > Do you have any ideas about this error, with a different LV from the same disk?: > > Pass 1: Checking inodes, blocks, and sizes > Inode 4122234 has illegal block(s). Clear? yes > > Illegal block #256918621 (1313286244) in inode 4122234. CLEARED. > Error storing directory block information (inode=4122234, block=0, > num=78646612): Memory allocation failed That's the sign of a very badly corrupted inode data structure. We should do a better job of handling this case automatically. Can you send me a copy of the output of: debugfs -w /dev/XXXX debugfs: stat <4122234> Then what I'd recommend doing is to use the debugfs command "clri <4122234>" to zap the the corrupted inode, and then rerunning e2fsck. This is relatively safe thing to try as these things go, so I won't strongly recommend that you take an image backup of the file system image in question before proceeding --- but in general, it's still a good idea if you are paranoid. :-) The fact that you are seeing multiple errors like this really makes me wonder.... what kind of storage device is this? An external USB drive? A SATA drive? A software raid device? Something else? Thanks, - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-27 17:31 ` Theodore Ts'o @ 2012-11-27 18:40 ` Adam Huffman 2012-11-28 18:16 ` Adam Huffman 1 sibling, 0 replies; 7+ messages in thread From: Adam Huffman @ 2012-11-27 18:40 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Tue, Nov 27, 2012 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote: >> >> I took a copy using dd_rescue yesterday, and that's what I've been >> running fsck against. >> (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...) > > On the disk itself? Instead of another copy of the disk? That was > unfortunate.... mke2fs -S is very destructive when it doesn't work > out.... and what happened after you tried that, BTW? What were the It's fair to say I was desperate at that point... Effectively, that filesystem was wiped. > e2fsck failures that you were seeing? If you're seeing the same > repeated journal failures, you might as well go for broke and see if > zapping the journal helps: > > debugfs -w /dev/XXXX -R "clri <8>" > > Again, I always recommend issuing these sorts of commands on copies, > and to never tamper with the initial image backup of the file > system.... > Indeed. >> The images comprises an LVM PV and VG, so I've used kpartx to make it >> available, if that makes a difference. >> >> There is one person claiming that it does: >> >> http://j-b.livejournal.com/334065.html > > Hmm... I don't see why that would make a difference. At this point > what I'd really need is an e2image dump of the file system. Please > read the e2image man page, especially the sections regarding a raw > e2image dump and a qcow e2image dump. If you are willing to send me a > copy of your metadata blocks, please send me a qcow e2image dump and > I'll take a look at it. > >> Do you have any ideas about this error, with a different LV from the same disk?: >> >> Pass 1: Checking inodes, blocks, and sizes >> Inode 4122234 has illegal block(s). Clear? yes >> >> Illegal block #256918621 (1313286244) in inode 4122234. CLEARED. >> Error storing directory block information (inode=4122234, block=0, >> num=78646612): Memory allocation failed > > That's the sign of a very badly corrupted inode data structure. We > should do a better job of handling this case automatically. > > Can you send me a copy of the output of: > > debugfs -w /dev/XXXX > debugfs: stat <4122234> > > Then what I'd recommend doing is to use the debugfs command "clri > <4122234>" to zap the the corrupted inode, and then rerunning e2fsck. > This is relatively safe thing to try as these things go, so I won't > strongly recommend that you take an image backup of the file system > image in question before proceeding --- but in general, it's still a > good idea if you are paranoid. :-) > > The fact that you are seeing multiple errors like this really makes me > wonder.... what kind of storage device is this? An external USB > drive? A SATA drive? A software raid device? Something else? > I tried to mount the image once more, and this time it worked. There were system log errors about a specific inode, but everything else copied to a different disk intact. Hence I'll try to get the machine back up and running. Once I've done that, I'll send you the information and files you asked for, if you're still interested. Thanks again, Adam ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-27 17:31 ` Theodore Ts'o 2012-11-27 18:40 ` Adam Huffman @ 2012-11-28 18:16 ` Adam Huffman 2012-11-28 21:15 ` Theodore Ts'o 1 sibling, 1 reply; 7+ messages in thread From: Adam Huffman @ 2012-11-28 18:16 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On Tue, Nov 27, 2012 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote: >> >> I took a copy using dd_rescue yesterday, and that's what I've been >> running fsck against. >> (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...) > > On the disk itself? Instead of another copy of the disk? That was > unfortunate.... mke2fs -S is very destructive when it doesn't work > out.... and what happened after you tried that, BTW? What were the > e2fsck failures that you were seeing? If you're seeing the same > repeated journal failures, you might as well go for broke and see if > zapping the journal helps: > > debugfs -w /dev/XXXX -R "clri <8>" > > Again, I always recommend issuing these sorts of commands on copies, > and to never tamper with the initial image backup of the file > system.... > >> The images comprises an LVM PV and VG, so I've used kpartx to make it >> available, if that makes a difference. >> >> There is one person claiming that it does: >> >> http://j-b.livejournal.com/334065.html > > Hmm... I don't see why that would make a difference. At this point > what I'd really need is an e2image dump of the file system. Please > read the e2image man page, especially the sections regarding a raw > e2image dump and a qcow e2image dump. If you are willing to send me a > copy of your metadata blocks, please send me a qcow e2image dump and > I'll take a look at it. > I'll send you that off-list. >> Do you have any ideas about this error, with a different LV from the same disk?: >> >> Pass 1: Checking inodes, blocks, and sizes >> Inode 4122234 has illegal block(s). Clear? yes >> >> Illegal block #256918621 (1313286244) in inode 4122234. CLEARED. >> Error storing directory block information (inode=4122234, block=0, >> num=78646612): Memory allocation failed > > That's the sign of a very badly corrupted inode data structure. We > should do a better job of handling this case automatically. > > Can you send me a copy of the output of: > > debugfs -w /dev/XXXX > debugfs: stat <4122234> > Here you go: debugfs: stat 4122234 4122234: File not found by ext2_lookup > Then what I'd recommend doing is to use the debugfs command "clri > <4122234>" to zap the the corrupted inode, and then rerunning e2fsck. > This is relatively safe thing to try as these things go, so I won't > strongly recommend that you take an image backup of the file system > image in question before proceeding --- but in general, it's still a > good idea if you are paranoid. :-) > > The fact that you are seeing multiple errors like this really makes me > wonder.... what kind of storage device is this? An external USB > drive? A SATA drive? A software raid device? Something else? > It was a simple internal SATA disk - no RAID. I ran a memory tester over the weekend in case bad RAM was causing the corruption, and in 32 passes no errors were found. As I said in the other reply, I was able to mount the image in the end. Perhaps one of those fsck invocations made a difference, even though the same error appeared each time? Thanks, Adam > Thanks, > > - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17 2012-11-28 18:16 ` Adam Huffman @ 2012-11-28 21:15 ` Theodore Ts'o 0 siblings, 0 replies; 7+ messages in thread From: Theodore Ts'o @ 2012-11-28 21:15 UTC (permalink / raw) To: Adam Huffman; +Cc: linux-ext4 On Wed, Nov 28, 2012 at 06:16:40PM +0000, Adam Huffman wrote: > > Can you send me a copy of the output of: > > > > debugfs -w /dev/XXXX > > debugfs: stat <4122234> > > debugfs: stat 4122234 > 4122234: File not found by ext2_lookup You need the angle brackets. A number in angle brackets is interpreted as an inode number. Without the angle brackets then debugfs tries to do a lookup in the debugfs's current working directory. > As I said in the other reply, I was able to mount the image in the > end. Perhaps one of those fsck invocations made a difference, even > though the same error appeared each time? Well, if e2fsck doesn't fix a corruption in a single pass, barring hardware failures, it's a bug in e2fsck by definition (at least in my book). If the same error is appearing each time, that doesn't mean that the file system can't be mounted. Unless you actually try to reference the corrupted inode in question, you might never know about the corruption. You can use the ncheck command in debugfs if you want to map an inode number to a pathname. ("ncheck 4122234" --- no angle brackets since ncheck only takes inode numbers and maps them to pathnames, just as icheck takes block numbers and maps them to inode numbers). - Ted ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-11-28 21:15 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-11-27 13:31 Filesystem corruption on Fedora 17 Adam Huffman 2012-11-27 16:47 ` Theodore Ts'o 2012-11-27 16:59 ` Adam Huffman 2012-11-27 17:31 ` Theodore Ts'o 2012-11-27 18:40 ` Adam Huffman 2012-11-28 18:16 ` Adam Huffman 2012-11-28 21:15 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).