* corrupted ext4 1000gb filesystem (2.6.32, debian stable) @ 2011-07-25 12:15 Luke Kenneth Casson Leighton 2011-08-02 0:15 ` Luke Kenneth Casson Leighton 0 siblings, 1 reply; 5+ messages in thread From: Luke Kenneth Casson Leighton @ 2011-07-25 12:15 UTC (permalink / raw) To: linux-kernel folks, hi, i set up a 1000gb RAID1 mirrored filesystem, possibly in a foolish way (multiple external USB drives) and had not added a fsck check to the script which dynamically assembles the drives. naively and yet happily i followed this advice, here: http://lists.debian.org/debian-user/2011/06/msg02164.html the problem was that uh, yeah, the ext4 filesystem _did_ end up getting massively corrupted, very very quickly (within weeks), and i don't entirely know how. at approx 50% filled, i carried out another copy operation of several tens of thousands of files (backup of a system with about 10gb usage) and it was these files and directories that had the worst level of filesystem corruption. so my question is: has anyone else encountered significant filesystem corruption of large (1000gb+) ext4 filesystems, with any *recent* 2.6 linux kernels? tia, l. p.s. yes fsck has now been added to the drive assembly script, but i am not yet filled with confidence that everything will be hunky-dory. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: corrupted ext4 1000gb filesystem (2.6.32, debian stable) 2011-07-25 12:15 corrupted ext4 1000gb filesystem (2.6.32, debian stable) Luke Kenneth Casson Leighton @ 2011-08-02 0:15 ` Luke Kenneth Casson Leighton 2011-08-02 13:14 ` Luke Kenneth Casson Leighton 0 siblings, 1 reply; 5+ messages in thread From: Luke Kenneth Casson Leighton @ 2011-08-02 0:15 UTC (permalink / raw) To: linux-kernel ok, um... i have a bit more information about this situation, to report. two consecutive runs of fsck.ext4, and the filesystem still reports errors after the first run "corrected" all errors. i'd say that was a bit serious. l. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: corrupted ext4 1000gb filesystem (2.6.32, debian stable) 2011-08-02 0:15 ` Luke Kenneth Casson Leighton @ 2011-08-02 13:14 ` Luke Kenneth Casson Leighton 2011-08-02 14:27 ` Ted Ts'o 0 siblings, 1 reply; 5+ messages in thread From: Luke Kenneth Casson Leighton @ 2011-08-02 13:14 UTC (permalink / raw) To: linux-kernel On Tue, Aug 2, 2011 at 1:15 AM, Luke Kenneth Casson Leighton <luke.leighton@gmail.com> wrote: > ok, um... i have a bit more information about this situation, to > report. two consecutive runs of fsck.ext4, and the filesystem still > reports errors after the first run "corrected" all errors. i'd say > that was a bit serious. rright - apologies but i've located the likely source of the problem - e2fsck. the issue is that the bitmaps for the 3-way RAID1 mirror were corrupted. thus, the filesystem would be fixed by e2fsck, only to be completely buggered up by picking wildly inappropriate sections of the drive... that presumably by either bad luck or by a powercut and writes occurring at the time happened to be on inode blocks. whilst it is not entirely possible, due to this being a complex interdependent set of empirical observations, to rule out ext4, i believe it's safe to say that ext4 is not a high candidate, and i apologise for taking up peoples' time. now all i can hope for is that the data isn't completely buggered.. :) l. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: corrupted ext4 1000gb filesystem (2.6.32, debian stable) 2011-08-02 13:14 ` Luke Kenneth Casson Leighton @ 2011-08-02 14:27 ` Ted Ts'o 2011-08-02 14:54 ` Luke Kenneth Casson Leighton 0 siblings, 1 reply; 5+ messages in thread From: Ted Ts'o @ 2011-08-02 14:27 UTC (permalink / raw) To: Luke Kenneth Casson Leighton; +Cc: linux-kernel On Tue, Aug 02, 2011 at 02:14:19PM +0100, Luke Kenneth Casson Leighton wrote: > On Tue, Aug 2, 2011 at 1:15 AM, Luke Kenneth Casson Leighton > <luke.leighton@gmail.com> wrote: > > > ok, um... i have a bit more information about this situation, to > > report. two consecutive runs of fsck.ext4, and the filesystem still > > reports errors after the first run "corrected" all errors. i'd say > > that was a bit serious. > > rright - apologies but i've located the likely source of the problem > - e2fsck. the issue is that the bitmaps for the 3-way RAID1 mirror > were corrupted. thus, the filesystem would be fixed by e2fsck, only > to be completely buggered up by picking wildly inappropriate sections > of the drive... that presumably by either bad luck or by a powercut > and writes occurring at the time happened to be on inode blocks. E2fsck doesn't depend on the bitmaps; those are regenerated based on the information from the inode tables. Assuming that the disks are stable --- that is, a read from a block returns the same contents all the time, and writes are not lost (i.e., after a write, reads to that block return the written data consistently), then there should not be any corruptions found after the first run of e2fsck fixes all errors. That being said, there have been cases where that's not true, and I consider that a bug in e2fsck. *However*, if you have a RAID1 setup where the data on the disks are consistent, this can be the cause of much mischief. Depending on which disk you read from the mirror, you might get different results. Once that's the case, all bets with e2fsck are off. I suggest you make sure that your RAID1 mirror is stable first of all; in general, you *have* to fix problems with the storage stack from the lowest level on up. First make sure the hard drives are all sane; then make sure the partition table and/or LVM setups are sane; then make sure any RAID setups are sane; and only *then* run a filesystem-level checker. This is true regardless of what file system you use. Finally, I strongly recommend that when you are doing this kind of repair work, that you save a copy of everything you do useing a program like "script". A transcript of the e2fsck output can be critally useful. Reviewing the transcript can also be useful in identifying mistakes that you might have made during the recovery process. Regards, - Ted P.S. Note that if you are running e2fsck, and you haven't mounted the disk yet, if you are seeing failures after a second run of e2fsck, then it obviously can be a failing in the ext4 kernel code. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: corrupted ext4 1000gb filesystem (2.6.32, debian stable) 2011-08-02 14:27 ` Ted Ts'o @ 2011-08-02 14:54 ` Luke Kenneth Casson Leighton 0 siblings, 0 replies; 5+ messages in thread From: Luke Kenneth Casson Leighton @ 2011-08-02 14:54 UTC (permalink / raw) To: Ted Ts'o, Luke Kenneth Casson Leighton, linux-kernel On Tue, Aug 2, 2011 at 3:27 PM, Ted Ts'o <tytso@mit.edu> wrote: > *However*, if you have a RAID1 setup where the data on the disks are > consistent, this can be the cause of much mischief. Depending on > which disk you read from the mirror, you might get different results. > Once that's the case, all bets with e2fsck are off. ok - that would explain why the problem has appeared to go away when i reduced the RAID1 to a single drive. i'm presently waiting for the two 1.5Tb drives to sync up (estimated another 24 hrs) > Finally, I strongly recommend that when you are doing this kind of > repair work, that you save a copy of everything you do useing a > program like "script". A transcript of the e2fsck output can be > critally useful. Reviewing the transcript can also be useful in > identifying mistakes that you might have made during the recovery > process. ok i'll bear that in mind. > Regards, > > - Ted > > P.S. Note that if you are running e2fsck, and you haven't mounted > the disk yet, if you are seeing failures after a second run of e2fsck, > then it obviously can be a failing in the ext4 kernel code. ok, thank you ted. based on that, i'll keep doing reports (and make sure i use typescript). l. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-08-02 14:54 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-07-25 12:15 corrupted ext4 1000gb filesystem (2.6.32, debian stable) Luke Kenneth Casson Leighton 2011-08-02 0:15 ` Luke Kenneth Casson Leighton 2011-08-02 13:14 ` Luke Kenneth Casson Leighton 2011-08-02 14:27 ` Ted Ts'o 2011-08-02 14:54 ` Luke Kenneth Casson Leighton
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.