* Filesystem corruption on Fedora 17
@ 2012-11-27 13:31 Adam Huffman
2012-11-27 16:47 ` Theodore Ts'o
0 siblings, 1 reply; 7+ messages in thread
From: Adam Huffman @ 2012-11-27 13:31 UTC (permalink / raw)
To: linux-ext4
Hello
On two machines now I've had severe filesystem corruption. They are
both Fedora 17 machines, and they both have, at some point, run the
kernels that have been mentioned recently as possibly suffering from
ext4 corruption problems.
In the worst case, fsck is unable to fix the problems:
fsck from util-linux 2.20.1
e2fsck 1.42.4 (12-June-2012)
ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
fsck.ext4: Group descriptors look bad... trying backup blocks...
/dev/mapper/heppc128-lv_home: recovering journal
fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home
/dev/mapper/heppc128-lv_home: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/heppc128-lv_home: ********** WARNING: Filesystem still has
errors **********
Here's the output of dumpe2fs:
dumpe2fs 1.42.4 (12-June-2012)
Filesystem volume name: <none>
Last mounted on: /home
Filesystem UUID: b0b53537-bcc0-4006-bc32-5b55e13a4b94
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index
filetype extent flex_bg sparse_super large_file huge_file uninit_bg
dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 3670016
Block count: 14680064
Reserved block count: 670950
Free blocks: 2150657
Free inodes: 2544162
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 1020
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Mon Apr 2 10:45:35 2012
Last mount time: Fri May 11 10:05:54 2012
Last write time: Tue Nov 27 13:18:35 2012
Mount count: 7
Maximum mount count: 35
Last checked: Mon Apr 2 10:45:35 2012
Check interval: 15552000 (6 months)
Next check after: Sat Sep 29 10:45:35 2012
Lifetime writes: 56 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
First orphan inode: 1574824
Default directory hash: half_md4
Directory Hash Seed: 32f30e91-a55b-4e69-b95e-3bb3f04f34a9
Journal backup: inode blocks
Journal features: journal_incompat_revoke
Journal size: 128M
Journal length: 32768
Journal sequence: 0x003506be
Journal start: 0
During various other repair attempts, I've seen this message:
e2fsck 1.42.4 (12-June-2012)
/dev/mapper/vg0majh-lv_root contains a file system with errors, check forced.
Resize inode not valid. Recreate? yes
Pass 1: Checking inodes, blocks, and sizes
Inode 4122234 has illegal block(s). Clear? yes
Illegal block #256918621 (1313286244) in inode 4122234. CLEARED.
Error storing directory block information (inode=4122234, block=0,
num=78646612): Memory allocation failed
/dev/mapper/vg0majh-lv_root: ***** FILE SYSTEM WAS MODIFIED *****
e2fsck: aborted
/dev/mapper/vg0majh-lv_root: ***** FILE SYSTEM WAS MODIFIED *****
Both machines are running the most recent Fedora kernel, which is 3.6.7-4.
I just tried mounting the /home LV, which seemed to succeed, but any
file accesses didn't work:
[ 1176.385418] EXT4-fs (dm-8): warning: checktime reached, running
e2fsck is recommended
[ 1176.403296] EXT4-fs warning (device dm-8): ext4_orphan_get:1014:
bad orphan inode 1574824! e2fsck was run?
[ 1176.403299] ext4_test_bit(bit=1959, block=6291472) = 0
[ 1176.403301] inode= (null)
[ 1176.403304] EXT4-fs (dm-8): recovery complete
[ 1176.403308] EXT4-fs (dm-8): mounted filesystem with ordered data
mode. Opts: (null)
[ 1250.457438] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1572865: comm rsync: deleted inode referenced: 2621441
[ 1250.578786] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1671420: comm rsync: deleted inode referenced: 2229739
[ 1250.654595] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1572894: comm rsync: deleted inode referenced: 2228725
[ 1250.654703] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1572894: comm rsync: deleted inode referenced: 2621702
[ 1250.683319] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1576085: comm rsync: deleted inode referenced: 2621449
[ 1250.695378] EXT4-fs error (device dm-8): ext4_lookup:1050: inode
#1576085: comm rsync: deleted inode referenced: 2621450
Any help greatly appreciated...
Best Wishes,
Adam
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-27 13:31 Filesystem corruption on Fedora 17 Adam Huffman
@ 2012-11-27 16:47 ` Theodore Ts'o
2012-11-27 16:59 ` Adam Huffman
0 siblings, 1 reply; 7+ messages in thread
From: Theodore Ts'o @ 2012-11-27 16:47 UTC (permalink / raw)
To: Adam Huffman; +Cc: linux-ext4
[-- Attachment #1: Type: text/plain, Size: 2703 bytes --]
On Tue, Nov 27, 2012 at 01:31:18PM +0000, Adam Huffman wrote:
>
> On two machines now I've had severe filesystem corruption. They are
> both Fedora 17 machines, and they both have, at some point, run the
> kernels that have been mentioned recently as possibly suffering from
> ext4 corruption problems.
I don't know if you followed the story that closely, but the hysteria
over the "ext4 corruption problems" were caused by users who were
using non-standard mount options or other ext4 features....
> In the worst case, fsck is unable to fix the problems:
>
> fsck from util-linux 2.20.1
> e2fsck 1.42.4 (12-June-2012)
> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
> fsck.ext4: Group descriptors look bad... trying backup blocks...
> /dev/mapper/heppc128-lv_home: recovering journal
> fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home
Furthermore, this doesn't look like any of the problems that people
have reported. The corruption pattern looks most like what you would
see if the blocks in the beginning (low numbered blocks) part of the
file system have been overwritten with garbage.
So first of all, if there is critical data that you want to preserve,
the first thing I'd suggest doing is to make a image copy of the
partition; it's only 56 GB, so hopefluly you have space to make a copy
before you do any further experimentation to try to recover things.
As far as the "unable to set superblock flags" error, I think I can
see how that can happen (and in fact I've created a short test case
which demonstrates the problem --- see attached), but that appears to
be a one shot failure. That is, the second time you run e2fsck, it
should be able to make progress. is that the case for you?
(It's also possible that there are hardware bugs which is triggering
this problem, however, and if in fact you're seeing this happen
repeatably, I'd have seriously suspect some kind of hardware failure.)
- Ted
P.S. In order to get this failure I had to basically use a block
editor, since there are software safeguards which prevent e2fsprogs or
ext4 from setting the needs_recovery bit on backup superblocks, and
this is what was necessary to trigger the bug. I'll fix this for the
next release of e2fsprogs. The reason why we hadn't noticed was
because (a) it basically requires a very specific hardware-induced
bit-flip to trigger, and (b) even when it does, the second run of
e2fsck makes the problem go away, so typically it gets noticed when
system fails to boot due to e2fsck blowing out, and then when the
system administrator runs fsck a second time on the file system,
forward progress gets made.
[-- Attachment #2: testcase.img.gz --]
[-- Type: application/octet-stream, Size: 37512 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-27 16:47 ` Theodore Ts'o
@ 2012-11-27 16:59 ` Adam Huffman
2012-11-27 17:31 ` Theodore Ts'o
0 siblings, 1 reply; 7+ messages in thread
From: Adam Huffman @ 2012-11-27 16:59 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On Tue, Nov 27, 2012 at 4:47 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 27, 2012 at 01:31:18PM +0000, Adam Huffman wrote:
>>
>> On two machines now I've had severe filesystem corruption. They are
>> both Fedora 17 machines, and they both have, at some point, run the
>> kernels that have been mentioned recently as possibly suffering from
>> ext4 corruption problems.
>
> I don't know if you followed the story that closely, but the hysteria
> over the "ext4 corruption problems" were caused by users who were
> using non-standard mount options or other ext4 features....
>
Yes, I only mentioned that "just in case". I certainly don't have any
exotic mount options.
>> In the worst case, fsck is unable to fix the problems:
>>
>> fsck from util-linux 2.20.1
>> e2fsck 1.42.4 (12-June-2012)
>> ext2fs_check_desc: Corrupt group descriptor: bad block for block bitmap
>> fsck.ext4: Group descriptors look bad... trying backup blocks...
>> /dev/mapper/heppc128-lv_home: recovering journal
>> fsck.ext4: unable to set superblock flags on /dev/mapper/heppc128-lv_home
>
> Furthermore, this doesn't look like any of the problems that people
> have reported. The corruption pattern looks most like what you would
> see if the blocks in the beginning (low numbered blocks) part of the
> file system have been overwritten with garbage.
>
> So first of all, if there is critical data that you want to preserve,
> the first thing I'd suggest doing is to make a image copy of the
> partition; it's only 56 GB, so hopefluly you have space to make a copy
> before you do any further experimentation to try to recover things.
>
I took a copy using dd_rescue yesterday, and that's what I've been
running fsck against.
(After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...)
The images comprises an LVM PV and VG, so I've used kpartx to make it
available, if that makes a difference.
There is one person claiming that it does:
http://j-b.livejournal.com/334065.html
> As far as the "unable to set superblock flags" error, I think I can
> see how that can happen (and in fact I've created a short test case
> which demonstrates the problem --- see attached), but that appears to
> be a one shot failure. That is, the second time you run e2fsck, it
> should be able to make progress. is that the case for you?
>
No, I see the same error no matter how many times I run e2fsck.
> (It's also possible that there are hardware bugs which is triggering
> this problem, however, and if in fact you're seeing this happen
> repeatably, I'd have seriously suspect some kind of hardware failure.)
>
While I did suspect hardware problems, there hasn't been any sign of
them in the system logs so far.
Do you have any ideas about this error, with a different LV from the same disk?:
Pass 1: Checking inodes, blocks, and sizes
Inode 4122234 has illegal block(s). Clear? yes
Illegal block #256918621 (1313286244) in inode 4122234. CLEARED.
Error storing directory block information (inode=4122234, block=0,
num=78646612): Memory allocation failed
Many thanks for taking a look.
Best Wishes,
Adam
> - Ted
>
> P.S. In order to get this failure I had to basically use a block
> editor, since there are software safeguards which prevent e2fsprogs or
> ext4 from setting the needs_recovery bit on backup superblocks, and
> this is what was necessary to trigger the bug. I'll fix this for the
> next release of e2fsprogs. The reason why we hadn't noticed was
> because (a) it basically requires a very specific hardware-induced
> bit-flip to trigger, and (b) even when it does, the second run of
> e2fsck makes the problem go away, so typically it gets noticed when
> system fails to boot due to e2fsck blowing out, and then when the
> system administrator runs fsck a second time on the file system,
> forward progress gets made.
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-27 16:59 ` Adam Huffman
@ 2012-11-27 17:31 ` Theodore Ts'o
2012-11-27 18:40 ` Adam Huffman
2012-11-28 18:16 ` Adam Huffman
0 siblings, 2 replies; 7+ messages in thread
From: Theodore Ts'o @ 2012-11-27 17:31 UTC (permalink / raw)
To: Adam Huffman; +Cc: linux-ext4
On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote:
>
> I took a copy using dd_rescue yesterday, and that's what I've been
> running fsck against.
> (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...)
On the disk itself? Instead of another copy of the disk? That was
unfortunate.... mke2fs -S is very destructive when it doesn't work
out.... and what happened after you tried that, BTW? What were the
e2fsck failures that you were seeing? If you're seeing the same
repeated journal failures, you might as well go for broke and see if
zapping the journal helps:
debugfs -w /dev/XXXX -R "clri <8>"
Again, I always recommend issuing these sorts of commands on copies,
and to never tamper with the initial image backup of the file
system....
> The images comprises an LVM PV and VG, so I've used kpartx to make it
> available, if that makes a difference.
>
> There is one person claiming that it does:
>
> http://j-b.livejournal.com/334065.html
Hmm... I don't see why that would make a difference. At this point
what I'd really need is an e2image dump of the file system. Please
read the e2image man page, especially the sections regarding a raw
e2image dump and a qcow e2image dump. If you are willing to send me a
copy of your metadata blocks, please send me a qcow e2image dump and
I'll take a look at it.
> Do you have any ideas about this error, with a different LV from the same disk?:
>
> Pass 1: Checking inodes, blocks, and sizes
> Inode 4122234 has illegal block(s). Clear? yes
>
> Illegal block #256918621 (1313286244) in inode 4122234. CLEARED.
> Error storing directory block information (inode=4122234, block=0,
> num=78646612): Memory allocation failed
That's the sign of a very badly corrupted inode data structure. We
should do a better job of handling this case automatically.
Can you send me a copy of the output of:
debugfs -w /dev/XXXX
debugfs: stat <4122234>
Then what I'd recommend doing is to use the debugfs command "clri
<4122234>" to zap the the corrupted inode, and then rerunning e2fsck.
This is relatively safe thing to try as these things go, so I won't
strongly recommend that you take an image backup of the file system
image in question before proceeding --- but in general, it's still a
good idea if you are paranoid. :-)
The fact that you are seeing multiple errors like this really makes me
wonder.... what kind of storage device is this? An external USB
drive? A SATA drive? A software raid device? Something else?
Thanks,
- Ted
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-27 17:31 ` Theodore Ts'o
@ 2012-11-27 18:40 ` Adam Huffman
2012-11-28 18:16 ` Adam Huffman
1 sibling, 0 replies; 7+ messages in thread
From: Adam Huffman @ 2012-11-27 18:40 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On Tue, Nov 27, 2012 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote:
>>
>> I took a copy using dd_rescue yesterday, and that's what I've been
>> running fsck against.
>> (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...)
>
> On the disk itself? Instead of another copy of the disk? That was
> unfortunate.... mke2fs -S is very destructive when it doesn't work
> out.... and what happened after you tried that, BTW? What were the
It's fair to say I was desperate at that point...
Effectively, that filesystem was wiped.
> e2fsck failures that you were seeing? If you're seeing the same
> repeated journal failures, you might as well go for broke and see if
> zapping the journal helps:
>
> debugfs -w /dev/XXXX -R "clri <8>"
>
> Again, I always recommend issuing these sorts of commands on copies,
> and to never tamper with the initial image backup of the file
> system....
>
Indeed.
>> The images comprises an LVM PV and VG, so I've used kpartx to make it
>> available, if that makes a difference.
>>
>> There is one person claiming that it does:
>>
>> http://j-b.livejournal.com/334065.html
>
> Hmm... I don't see why that would make a difference. At this point
> what I'd really need is an e2image dump of the file system. Please
> read the e2image man page, especially the sections regarding a raw
> e2image dump and a qcow e2image dump. If you are willing to send me a
> copy of your metadata blocks, please send me a qcow e2image dump and
> I'll take a look at it.
>
>> Do you have any ideas about this error, with a different LV from the same disk?:
>>
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 4122234 has illegal block(s). Clear? yes
>>
>> Illegal block #256918621 (1313286244) in inode 4122234. CLEARED.
>> Error storing directory block information (inode=4122234, block=0,
>> num=78646612): Memory allocation failed
>
> That's the sign of a very badly corrupted inode data structure. We
> should do a better job of handling this case automatically.
>
> Can you send me a copy of the output of:
>
> debugfs -w /dev/XXXX
> debugfs: stat <4122234>
>
> Then what I'd recommend doing is to use the debugfs command "clri
> <4122234>" to zap the the corrupted inode, and then rerunning e2fsck.
> This is relatively safe thing to try as these things go, so I won't
> strongly recommend that you take an image backup of the file system
> image in question before proceeding --- but in general, it's still a
> good idea if you are paranoid. :-)
>
> The fact that you are seeing multiple errors like this really makes me
> wonder.... what kind of storage device is this? An external USB
> drive? A SATA drive? A software raid device? Something else?
>
I tried to mount the image once more, and this time it worked. There
were system log errors about a specific inode, but everything else
copied to a different disk intact. Hence I'll try to get the machine
back up and running. Once I've done that, I'll send you the
information and files you asked for, if you're still interested.
Thanks again,
Adam
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-27 17:31 ` Theodore Ts'o
2012-11-27 18:40 ` Adam Huffman
@ 2012-11-28 18:16 ` Adam Huffman
2012-11-28 21:15 ` Theodore Ts'o
1 sibling, 1 reply; 7+ messages in thread
From: Adam Huffman @ 2012-11-28 18:16 UTC (permalink / raw)
To: Theodore Ts'o; +Cc: linux-ext4
On Tue, Nov 27, 2012 at 5:31 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Tue, Nov 27, 2012 at 04:59:05PM +0000, Adam Huffman wrote:
>>
>> I took a copy using dd_rescue yesterday, and that's what I've been
>> running fsck against.
>> (After that I tried mkfs.ext4 -S on the disk itself, which wasn't successful...)
>
> On the disk itself? Instead of another copy of the disk? That was
> unfortunate.... mke2fs -S is very destructive when it doesn't work
> out.... and what happened after you tried that, BTW? What were the
> e2fsck failures that you were seeing? If you're seeing the same
> repeated journal failures, you might as well go for broke and see if
> zapping the journal helps:
>
> debugfs -w /dev/XXXX -R "clri <8>"
>
> Again, I always recommend issuing these sorts of commands on copies,
> and to never tamper with the initial image backup of the file
> system....
>
>> The images comprises an LVM PV and VG, so I've used kpartx to make it
>> available, if that makes a difference.
>>
>> There is one person claiming that it does:
>>
>> http://j-b.livejournal.com/334065.html
>
> Hmm... I don't see why that would make a difference. At this point
> what I'd really need is an e2image dump of the file system. Please
> read the e2image man page, especially the sections regarding a raw
> e2image dump and a qcow e2image dump. If you are willing to send me a
> copy of your metadata blocks, please send me a qcow e2image dump and
> I'll take a look at it.
>
I'll send you that off-list.
>> Do you have any ideas about this error, with a different LV from the same disk?:
>>
>> Pass 1: Checking inodes, blocks, and sizes
>> Inode 4122234 has illegal block(s). Clear? yes
>>
>> Illegal block #256918621 (1313286244) in inode 4122234. CLEARED.
>> Error storing directory block information (inode=4122234, block=0,
>> num=78646612): Memory allocation failed
>
> That's the sign of a very badly corrupted inode data structure. We
> should do a better job of handling this case automatically.
>
> Can you send me a copy of the output of:
>
> debugfs -w /dev/XXXX
> debugfs: stat <4122234>
>
Here you go:
debugfs: stat 4122234
4122234: File not found by ext2_lookup
> Then what I'd recommend doing is to use the debugfs command "clri
> <4122234>" to zap the the corrupted inode, and then rerunning e2fsck.
> This is relatively safe thing to try as these things go, so I won't
> strongly recommend that you take an image backup of the file system
> image in question before proceeding --- but in general, it's still a
> good idea if you are paranoid. :-)
>
> The fact that you are seeing multiple errors like this really makes me
> wonder.... what kind of storage device is this? An external USB
> drive? A SATA drive? A software raid device? Something else?
>
It was a simple internal SATA disk - no RAID.
I ran a memory tester over the weekend in case bad RAM was causing the
corruption, and in 32 passes no errors were found.
As I said in the other reply, I was able to mount the image in the
end. Perhaps one of those fsck invocations made a difference, even
though the same error appeared each time?
Thanks,
Adam
> Thanks,
>
> - Ted
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Filesystem corruption on Fedora 17
2012-11-28 18:16 ` Adam Huffman
@ 2012-11-28 21:15 ` Theodore Ts'o
0 siblings, 0 replies; 7+ messages in thread
From: Theodore Ts'o @ 2012-11-28 21:15 UTC (permalink / raw)
To: Adam Huffman; +Cc: linux-ext4
On Wed, Nov 28, 2012 at 06:16:40PM +0000, Adam Huffman wrote:
> > Can you send me a copy of the output of:
> >
> > debugfs -w /dev/XXXX
> > debugfs: stat <4122234>
>
> debugfs: stat 4122234
> 4122234: File not found by ext2_lookup
You need the angle brackets. A number in angle brackets is
interpreted as an inode number. Without the angle brackets then
debugfs tries to do a lookup in the debugfs's current working directory.
> As I said in the other reply, I was able to mount the image in the
> end. Perhaps one of those fsck invocations made a difference, even
> though the same error appeared each time?
Well, if e2fsck doesn't fix a corruption in a single pass, barring
hardware failures, it's a bug in e2fsck by definition (at least in my
book). If the same error is appearing each time, that doesn't mean
that the file system can't be mounted. Unless you actually try to
reference the corrupted inode in question, you might never know about
the corruption.
You can use the ncheck command in debugfs if you want to map an inode
number to a pathname. ("ncheck 4122234" --- no angle brackets since
ncheck only takes inode numbers and maps them to pathnames, just as
icheck takes block numbers and maps them to inode numbers).
- Ted
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2012-11-28 21:15 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-11-27 13:31 Filesystem corruption on Fedora 17 Adam Huffman
2012-11-27 16:47 ` Theodore Ts'o
2012-11-27 16:59 ` Adam Huffman
2012-11-27 17:31 ` Theodore Ts'o
2012-11-27 18:40 ` Adam Huffman
2012-11-28 18:16 ` Adam Huffman
2012-11-28 21:15 ` Theodore Ts'o
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).