* ext4: media error but where? [not found] ` <20140630134313.GA3753@thunk.org> @ 2014-07-04 10:23 ` Pavel Machek 2014-07-04 12:11 ` Theodore Ts'o 0 siblings, 1 reply; 17+ messages in thread From: Pavel Machek @ 2014-07-04 10:23 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 Hi! (Note that this drive is in thinkpad x60, and never met olpc or nor had any problems). pavel@duo:~$ uname -a Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686 GNU/Linux EXT4-fs (sda3): error count: 11 EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756 EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877 That sounds like media error to me? But there's nothing in smart: SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 I rebooted (into 3.14), and fsck claims filesystem is marked as clean...? I did fsck -f, no problems. Heh, now fsck -cf runs, and I got the same kernel messages. fsck says: "updating bad block inode", but it does not say how many badblocks it found (if any). At the end it says "filesystem was modified" and "reboot linux", so I assume it found something? OTOH dumpe2fs -b /dev/sda3 does not report anything. What is going on there? Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 10:23 ` ext4: media error but where? Pavel Machek @ 2014-07-04 12:11 ` Theodore Ts'o 2014-07-04 17:21 ` Pavel Machek 0 siblings, 1 reply; 17+ messages in thread From: Theodore Ts'o @ 2014-07-04 12:11 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, adilger.kernel, linux-ext4 On Fri, Jul 04, 2014 at 12:23:07PM +0200, Pavel Machek wrote: > > pavel@duo:~$ uname -a > Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686 > GNU/Linux > > EXT4-fs (sda3): error count: 11 > EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756 > EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877 > > That sounds like media error to me? If you search your system logs since the last fsck, you should find 11 instances of "EXT4-fs error" message, which means that there was some file system inconsisntencies detected. The first error was detected at: % date -d @1401714179 Mon Jun 2 09:02:59 EDT 2014 ... which means that you haven't rebooted in a month, or your boot scripts aren't automatically running fsck, or your clock is incorrect. The first inconsistency was detected in the function ext4_mb_generate_buddy(), in line 756. This means there's an inconsistency between the number of blocks marked as in use in a block allocation bitmap, and summary statistics in the block group descriptor. This can be caused by a hardware hiccup, or some kind of kernel bug. People have been reporting an increased incidence rate of this bug since 3.15, so it's something we're trying to track down. There have been some reports of eMMC bugs in 3.15 (see one such report at: https://lkml.org/lkml/2014/6/12/19). But other people are reporting this on SSD's such as the Samsung 840 PRO, which is a SATA attached device. See some of the messages on ext4 with the subject line: "ext4: journal has aborted"). At this point I suspect we have multiple causes that result in the same symptom that have all appeared at about the same time, which has made tracking down the root cause(s) very difficult. It does seem to happen more often after an unclean shutdown, and there does seem to be a very high correlation with eMMC devices. It's possible there is a jbd2 bug that got introduced recently, where ext4 is modifying some field outside of a journal transaction. But I haven't been able to reproduce this yet in controlled circumstances. What I need from people reporting problems: * What is the HDD/SSD/eMMC device involved * What kernel version were you running * What distribution are you running (more so I know what the init scripts might or might not have been doing vis-a-vis running fsck after a crash) * Was there an unclean shutdown / power drop / hard reset involved? If so, did the HDD/SSD/eMMC lose power, or was the reset button hit on the machine? * What sort of workload / application / test program running before the crash, if any? I really need all of this information, especially since at this point I suspect there may be more than one cause with similar symptoms. So it's important that just because someone else reports a similar symptom, that folks not assume because one person has reported one set of hardware / software details, that it's the same problem as theirs, and so they don't need to report anymore info. I need as many data points as possible at this point. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 12:11 ` Theodore Ts'o @ 2014-07-04 17:21 ` Pavel Machek 2014-07-04 18:06 ` Pavel Machek ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Pavel Machek @ 2014-07-04 17:21 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 Hi! > > pavel@duo:~$ uname -a > > Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686 > > GNU/Linux > > > > EXT4-fs (sda3): error count: 11 > > EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756 > > EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877 > > > > That sounds like media error to me? > > If you search your system logs since the last fsck, you should find 11 > instances of "EXT4-fs error" message, which means that there was some > file system inconsisntencies detected. The first error was detected at: > > % date -d @1401714179 > Mon Jun 2 09:02:59 EDT 2014 Interesting. I always assumed 140... was block number. > ... which means that you haven't rebooted in a month, or your boot > scripts aren't automatically running fsck, or your clock is > incorrect. I suspect something is wrong with the reporting. I got this in kernel log _while running fsck_. fsck was clean (take a look in the original email). I got weird report with fsck -c, it told me filesystem modified but I don't think I got bad blocks there. I believe my scripts are running fsck automatically, and yes, I rebooted a lot in a last month. It _may_ be possible that last month this x60 had different hard drive, and I copied it bit-by-bit. > It does seem to happen more often after an unclean shutdown, and there > does seem to be a very high correlation with eMMC devices. It's > possible there is a jbd2 bug that got introduced recently, where ext4 > is modifying some field outside of a journal transaction. But I > haven't been able to reproduce this yet in controlled circumstances. > > What I need from people reporting problems: > > * What is the HDD/SSD/eMMC device involved SATA hdd, will get you exact data. > * What kernel version were you running For last month? Various, 3.10 to 3.16-rc, mostly 3.15+. > * What distribution are you running (more so I know what the init > scripts might or might not have been doing vis-a-vis running fsck > after a crash) Debian 6. > * Was there an unclean shutdown / power drop / hard reset involved? > If so, did the HDD/SSD/eMMC lose power, or was the reset button hit > on the machine? Crash in last month? Probably yes. > * What sort of workload / application / test program running before > the crash, if any? Just usual desktop / kernel development. > and so they don't need to report anymore info. I need as many data > points as possible at this point. You'll get them. Is it possible that my fsck is so old it does not clear this "filesystem had error in past" flag? Because I strongly suspect I'll boot into init=/bin/bash, run fsck, it will tell me "all clean", and the messages will repeat in the middle of fsck run. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 17:21 ` Pavel Machek @ 2014-07-04 18:06 ` Pavel Machek 2014-07-04 18:56 ` Theodore Ts'o 2014-07-04 19:17 ` Andreas Dilger 2 siblings, 0 replies; 17+ messages in thread From: Pavel Machek @ 2014-07-04 18:06 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 Hi! > > What I need from people reporting problems: > > > > * What is the HDD/SSD/eMMC device involved > > SATA hdd, will get you exact data. Hitachi HTS545050A7E380, got it from ps/3 at april 25, 2014, never had problems according to smart. > > * What kernel version were you running > > For last month? Various, 3.10 to 3.16-rc, mostly 3.15+. > > > * What distribution are you running (more so I know what the init > > scripts might or might not have been doing vis-a-vis running fsck > > after a crash) > > Debian 6. 6.0.9 > Is it possible that my fsck is so old it does not clear this "filesystem > had error in past" flag? Because I strongly suspect I'll boot into > init=/bin/bash, run fsck, it will tell me "all clean", and the messages > will repeat in the middle of fsck run. And indeed: ..init=/bin/bash # fsck /dev/sda3 e2fsck 1.41.12 rootfs: clean # date +%s 1404496... # EXT4-fs (sda3): error count: 11 EXT4-fs (sda3): initial error at 1401741... ... (hand copied) Thanks, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 17:21 ` Pavel Machek 2014-07-04 18:06 ` Pavel Machek @ 2014-07-04 18:56 ` Theodore Ts'o 2014-07-06 13:32 ` Pavel Machek 2014-07-04 19:17 ` Andreas Dilger 2 siblings, 1 reply; 17+ messages in thread From: Theodore Ts'o @ 2014-07-04 18:56 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, adilger.kernel, linux-ext4 On Fri, Jul 04, 2014 at 07:21:04PM +0200, Pavel Machek wrote: > > Is it possible that my fsck is so old it does not clear this "filesystem > had error in past" flag? Because I strongly suspect I'll boot into > init=/bin/bash, run fsck, it will tell me "all clean", and the messages > will repeat in the middle of fsck run. Yes, that's what's going on. E2fsprogs v1.41.12 does not have the code to clear those fields in the superblock; that code was added in v1.41.13. (There have also been a ****huge**** number of bug fixes since May 2010, which is when 1.41.12 was released, so I'd strongly suggest that you upgrade to a newer version of e2fsprogs. In particular DON'T try resizing an an ext4 file system, either on-line or off-line with a version of e2fsprogs that ancient; there is a very good chance you will badly corrupt the file system.) Cheers, - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 18:56 ` Theodore Ts'o @ 2014-07-06 13:32 ` Pavel Machek 2014-07-06 13:43 ` Pavel Machek 0 siblings, 1 reply; 17+ messages in thread From: Pavel Machek @ 2014-07-06 13:32 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 On Fri 2014-07-04 14:56:26, Theodore Ts'o wrote: > On Fri, Jul 04, 2014 at 07:21:04PM +0200, Pavel Machek wrote: > > > > Is it possible that my fsck is so old it does not clear this "filesystem > > had error in past" flag? Because I strongly suspect I'll boot into > > init=/bin/bash, run fsck, it will tell me "all clean", and the messages > > will repeat in the middle of fsck run. > > Yes, that's what's going on. E2fsprogs v1.41.12 does not have the > code to clear those fields in the superblock; that code was added in > v1.41.13. > > (There have also been a ****huge**** number of bug fixes since May > 2010, which is when 1.41.12 was released, so I'd strongly suggest that > you upgrade to a newer version of e2fsprogs. In particular DON'T try > resizing an an ext4 file system, either on-line or off-line with a > version of e2fsprogs that ancient; there is a very good chance you > will badly corrupt the file system.) Ok, I have compiled fsck from git, it calls itself 1.43-WIP (18-May-2014). If I run it on my /dev/sda3, it still calls it clean and quits (even through it should still have the "filesystem had error in past" flag). I ran it -f, and it said all clean. Did not mention modifying the filesystem. Now I'm running fsck.new -cf. I don't think this filesystem has any bad blocks. Still, it says "rootfs: Updating bad block inode." ... "FILE SYSTEM WAS MODIFIED", "REBOOT LINUX". While looking at e2fsck sources: sprintf(buf, "badblocks -b %d -X %s%s%s %llu", fs->blocksize, (ctx->options & E2F_OPT_PREEN) ? "" : "-s ", (ctx->options & E2F_OPT_WRITECHECK) ? "-n " : "", fs->device_name, ext2fs_blocks_count(fs->super)-1); f = popen(buf, "r"); ...is it really good idea? I think it will do the bad thing in (crazy) setup such as this, or in any setup with space in filename: root@duo:/dev# ls -al | grep echo brw-rw---- 1 root disk 8, 3 Jul 6 14:56 `echo ownered` root@duo:/dev# /usr/local/bin/ e2fsck.new unrar2 root@duo:/dev# /usr/local/bin/e2fsck.new '`echo ownered`' e2fsck 1.43-WIP (18-May-2014) `echo ownered` is mounted. e2fsck: Cannot continue, aborting. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-06 13:32 ` Pavel Machek @ 2014-07-06 13:43 ` Pavel Machek 2014-07-06 18:29 ` Theodore Ts'o 0 siblings, 1 reply; 17+ messages in thread From: Pavel Machek @ 2014-07-06 13:43 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 Hi! > Now I'm running fsck.new -cf. I don't think this filesystem has any > bad blocks. Still, it says "rootfs: Updating bad block inode." > ... "FILE SYSTEM WAS MODIFIED", "REBOOT LINUX". And here's patch to fix this uglyness. Unfortunately, it makes it read the inode... but perhaps it is good idea as we are able to print before/after bad block counts...? Signed-off-by: Pavel Machek <pavel@ucw.cz> Thanks, Pavel diff --git a/e2fsck/badblocks.c b/e2fsck/badblocks.c index 7f3641b..32e08bf 100644 --- a/e2fsck/badblocks.c +++ b/e2fsck/badblocks.c @@ -30,6 +30,7 @@ void read_bad_blocks_file(e2fsck_t ctx, const char *bad_blocks_file, ext2_filsys fs = ctx->fs; errcode_t retval; badblocks_list bb_list = 0; + int old_bb_count = -1; FILE *f; char buf[1024]; @@ -51,14 +52,16 @@ void read_bad_blocks_file(e2fsck_t ctx, const char *bad_blocks_file, * If we're appending to the bad blocks inode, read in the * current bad blocks. */ - if (!replace_bad_blocks) { - retval = ext2fs_read_bb_inode(fs, &bb_list); - if (retval) { - com_err("ext2fs_read_bb_inode", retval, "%s", - _("while reading the bad blocks inode")); - goto fatal; - } + retval = ext2fs_read_bb_inode(fs, &bb_list); + if (retval) { + com_err("ext2fs_read_bb_inode", retval, "%s", + _("while reading the bad blocks inode")); + goto fatal; } + old_bb_count = ext2fs_u32_list_count(bb_list); + printf("%s: Currently %d bad blocks.\n", ctx->device_name, old_bb_count); + if (replace_bad_blocks) + bb_list = 0; /* * Now read in the bad blocks from the file; if @@ -95,10 +98,16 @@ void read_bad_blocks_file(e2fsck_t ctx, const char *bad_blocks_file, goto fatal; } + if ((ext2fs_u32_list_count(bb_list) == 0) && + ((!replace_bad_blocks) || (!old_bb_count))) { + printf("%s: No bad blocks found, no update neeeded.\n", ctx->device_name); + return; + } + /* * Finally, update the bad blocks from the bad_block_map */ - printf("%s: Updating bad block inode.\n", ctx->device_name); + printf("%s: Updating bad block inode (%d bad blocks).\n", ctx->device_name, ext2fs_u32_list_count(bb_list)); retval = ext2fs_update_bb_inode(fs, bb_list); if (retval) { com_err("ext2fs_update_bb_inode", retval, "%s", -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-06 13:43 ` Pavel Machek @ 2014-07-06 18:29 ` Theodore Ts'o 2014-07-06 21:37 ` Pavel Machek 0 siblings, 1 reply; 17+ messages in thread From: Theodore Ts'o @ 2014-07-06 18:29 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, adilger.kernel, linux-ext4 On Sun, Jul 06, 2014 at 03:43:25PM +0200, Pavel Machek wrote: > Hi! > > > Now I'm running fsck.new -cf. I don't think this filesystem has any > > bad blocks. Still, it says "rootfs: Updating bad block inode." > > ... "FILE SYSTEM WAS MODIFIED", "REBOOT LINUX". > > And here's patch to fix this uglyness. Unfortunately, it makes it read > the inode... but perhaps it is good idea as we are able to print > before/after bad block counts...? > > Signed-off-by: Pavel Machek <pavel@ucw.cz> Thanks, I'll take a look at these patches. Honestly, I've been half tempted to remove the e2fsck -c option entirely. 99.9% of the time, with modern disks, which has bad block remapping, it doesn't do any good, and often, it's harmful. In general, e2fsck -c is not something I recommend people use. If you want to use badblocks by itself to see if there are any blocks that are suffering read problems, that's fine, but if there is, in general the safest thing to do is to mount the disk read-only, back it up, and then either (a) reformat and see if you can restore onto it with backups w/o any further errors, or (b) just trash the disk, and get a new one, since in general the contents are way more valuable than the disk itself. Certainly after trying (a), you get any further errors, (b) is defintely the way to go. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-06 18:29 ` Theodore Ts'o @ 2014-07-06 21:37 ` Pavel Machek 2014-07-07 1:00 ` Theodore Ts'o 0 siblings, 1 reply; 17+ messages in thread From: Pavel Machek @ 2014-07-06 21:37 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 Hi! > > > Now I'm running fsck.new -cf. I don't think this filesystem has any > > > bad blocks. Still, it says "rootfs: Updating bad block inode." > > > ... "FILE SYSTEM WAS MODIFIED", "REBOOT LINUX". > > > > And here's patch to fix this uglyness. Unfortunately, it makes it read > > the inode... but perhaps it is good idea as we are able to print > > before/after bad block counts...? > > > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > Thanks, I'll take a look at these patches. Honestly, I've been half > tempted to remove the e2fsck -c option entirely. 99.9% of the time, > with modern disks, which has bad block remapping, it doesn't do any > good, and often, it's harmful. Well, when I got report about hw problems, badblocks -c was my first instinct. On the usb hdd, the most errors were due to 3.16-rc1 kernel bug, not real problems. > In general, e2fsck -c is not something I recommend people use. If you > want to use badblocks by itself to see if there are any blocks that > are suffering read problems, that's fine, but if there is, in > general Actually, badblocks is really tricky to use, I'd not trust myself to get parameters right. > the safest thing to do is to mount the disk read-only, back it up, and > then either (a) reformat and see if you can restore onto it with > backups w/o any further errors, or (b) just trash the disk, and get a > new one, since in general the contents are way more valuable than the > disk itself. Certainly after trying (a), you get any further errors, > (b) is defintely the way to go. Well, 500GB disk takes a while to back up, plus you need the space. a) will take few hours... And sometimes, data are much less valuable then the HDD. I do have 2 copies of data I care about, using unison to keep it in sync, and I plan to add 3rd, encrypted copy to Seagate Momentus 5400.6 series that failed (a). It seems that Seagate just got their firmware wrong, while in thinkpad, the drive worked very much ok, with exception with few sectors that could not be remapped. Now, USB envelope seems to be much harsher evnironment for a HDD, and it has few more bad sectors now, but that's somehow expected. I was not treating the hdd as if it had valuable data. So... please keep fsck -c :-). [Actually, badblocks documentation leaves something to be desired. Is ^C safe w.r.t. badblocks -n? Is hard poweroff safe?] Thanks, Pavel (Actually it looks I forgot to free the badlist. Incremental patch:) diff --git a/e2fsck/badblocks.c b/e2fsck/badblocks.c index 32e08bf..7ae7a61 100644 --- a/e2fsck/badblocks.c +++ b/e2fsck/badblocks.c @@ -60,8 +60,10 @@ void read_bad_blocks_file(e2fsck_t ctx, const char *bad_blocks_file, } old_bb_count = ext2fs_u32_list_count(bb_list); printf("%s: Currently %d bad blocks.\n", ctx->device_name, old_bb_count); - if (replace_bad_blocks) + if (replace_bad_blocks) { + ext2fs_badblocks_list_free(bb_list); bb_list = 0; + } /* * Now read in the bad blocks from the file; if -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-06 21:37 ` Pavel Machek @ 2014-07-07 1:00 ` Theodore Ts'o 2014-07-07 18:55 ` Pavel Machek 0 siblings, 1 reply; 17+ messages in thread From: Theodore Ts'o @ 2014-07-07 1:00 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, adilger.kernel, linux-ext4 On Sun, Jul 06, 2014 at 11:37:11PM +0200, Pavel Machek wrote: > > Well, when I got report about hw problems, badblocks -c was my first > instinct. On the usb hdd, the most errors were due to 3.16-rc1 kernel > bug, not real problems. The problem is with modern disk drives, this is a *wrong* instinct. That's my point. In general, trying to mess with the bad blocks list in the ext2/3/4 file system is just not the right thing to do with modern disk drives. That's because with modern disk drives, the hard drives will do bad block remapping. Basically, with modern disks, if the HDD has a hard ECC error, it will return an error --- but if you write to the sector, it will either rewrite onto that location on the platter, or if that part of the platter is truly gone, it will remap to the bad block spare pool. So telling the disk to never use that block again isn't going to be the right answer. The badblocks approach to dealing with hardware problems made sense back when we had IDE disks. But that's been over a decade ago. These days, it's horribly obsolete. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-07 1:00 ` Theodore Ts'o @ 2014-07-07 18:55 ` Pavel Machek 2014-07-07 23:18 ` 3.16-rc, ext4: oopses, OOMs after hard powerdown Pavel Machek 2014-07-07 23:21 ` ext4: media error but where? Theodore Ts'o 0 siblings, 2 replies; 17+ messages in thread From: Pavel Machek @ 2014-07-07 18:55 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 On Sun 2014-07-06 21:00:02, Theodore Ts'o wrote: > On Sun, Jul 06, 2014 at 11:37:11PM +0200, Pavel Machek wrote: > > > > Well, when I got report about hw problems, badblocks -c was my first > > instinct. On the usb hdd, the most errors were due to 3.16-rc1 kernel > > bug, not real problems. > > The problem is with modern disk drives, this is a *wrong* instinct. > That's my point. In general, trying to mess with the bad blocks list > in the ext2/3/4 file system is just not the right thing to do with > modern disk drives. That's because with modern disk drives, the hard > drives will do bad block remapping. Actually... I believe it was the right instinct. If I wanted to recover the data... remount-r would be the way to go. Then back it up using dd_rescue. ... But that way I'd turn bad sectors into silent data corruption. If I wanted to recover data from that partition, fsck -c (or badblocks, but that's trickier) and then dd_rescue would be the way to go. > Basically, with modern disks, if the HDD has a hard ECC error, it will > return an error --- but if you write to the sector, it will either > rewrite onto that location on the platter, or if that part of the > platter is truly gone, it will remap to the bad block spare pool. So > telling the disk to never use that block again isn't going to be the > right answer. Actually -- tool to do relocations would be nice. It is not exactly easy to do it right by hand. I know the theory. I had 5 read-error incidents this year. #1: Seagate refuses to reallocate sectors. Not sure why, I tried pretty much everything. #2: 3.16-rc1 produces incorrect errors every 4GB, leading to "bad sectors" that disappear with other kernels #3: Some more bad sectors appear on the Seagate #4: Kernel on thinkpad reports errors in daily check. Which is strange because there's nothing in SMART. #5: Some old IDE hdd has bad sectors in unused or unimportant areas. In #5 the theory might match the reality (I did not check, I trashed the disks). > The badblocks approach to dealing with hardware problems made sense > back when we had IDE disks. But that's been over a decade ago. These > days, it's horribly obsolete. Forcing reallocation is hard & tricky. You may want to simply mark it bad and lose a tiny bit of disk space... And even if you want to force reallocation, you want to do fsck -c, first, and restore affected files from backup. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 17+ messages in thread
* 3.16-rc, ext4: oopses, OOMs after hard powerdown 2014-07-07 18:55 ` Pavel Machek @ 2014-07-07 23:18 ` Pavel Machek 2014-07-07 23:21 ` ext4: media error but where? Theodore Ts'o 1 sibling, 0 replies; 17+ messages in thread From: Pavel Machek @ 2014-07-07 23:18 UTC (permalink / raw) To: Theodore Ts'o, kernel list, adilger.kernel, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 1482 bytes --] Hi! With 3.16-rc3, I did deliberate powerdown by holding down power key (not a clean shutdown). On the next boot, I got some scary messages about data corruption, "filesystem has errors, check forced", "reboot linux". Unfortunately, that made the scary messages gone forever (I tried ^S, was not fast enough), as system rebooted. But it seems I have more of the bad stuff coming: Mounting local filesystems threw an oops and then mount was killed due to out-of-memory. I lost sda2 (or /data) filesystem. Then both sda3 (root) and sda2 gave me. But there's no disk error either in smart or in syslog. Jul 8 01:03:18 duo kernel: EXT4-fs (sda3): error count: 2 Jul 8 01:03:18 duo kernel: EXT4-fs (sda3): initial error at 1404773782: ext4_mb_generate_buddy:757 Jul 8 01:03:18 duo kernel: EXT4-fs (sda3): last error at 1404773782: ext4_mb_generate_buddy:757 Jul 8 01:05:44 duo kernel: EXT4-fs (sda2): error count: 12 Jul 8 01:05:44 duo kernel: EXT4-fs (sda2): initial error at 1404773906: ext4_mb_generate_buddy:757 Jul 8 01:05:44 duo kernel: EXT4-fs (sda2): last error at 1404774058: ext4_journal_check_start:56 (Thinkpad x60 with Hitachi HTS... SATA disk). I attach complete syslog from the boot up... it should have everything relevant. I'm running fsck -f on sda3 now. I'd like to repair sda2 tommorow. Best regards, Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html [-- Attachment #2: delme.gz --] [-- Type: application/octet-stream, Size: 23593 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-07 18:55 ` Pavel Machek 2014-07-07 23:18 ` 3.16-rc, ext4: oopses, OOMs after hard powerdown Pavel Machek @ 2014-07-07 23:21 ` Theodore Ts'o 1 sibling, 0 replies; 17+ messages in thread From: Theodore Ts'o @ 2014-07-07 23:21 UTC (permalink / raw) To: Pavel Machek; +Cc: kernel list, adilger.kernel, linux-ext4 On Mon, Jul 07, 2014 at 08:55:43PM +0200, Pavel Machek wrote: > If I wanted to recover the data... remount-r would be the way to > go. Then back it up using dd_rescue. ... But that way I'd turn bad > sectors into silent data corruption. > > If I wanted to recover data from that partition, fsck -c (or > badblocks, but that's trickier) and then dd_rescue would be the way to go. Ah, if that's what you're worried about, just do the following: badblocks -b 4096 -o /tmp/badblocks.sdXX /dev/sdXX debugfs -R "icheck $(cat /tmp/badblocks.sdXX)" /dev/sdXX > /tmp/bad-inodes debugfs -R "ncheck $(sed -e 1d /tmp/bad-inodes | awk '{print $2}' | sort -nu)" > /tmp/bad-files This will give you a list of the files that contain blocks that had I/O errors. So now you know which files have contents which have probably been corrupted. No more silent data corruption. :-) > Actually -- tool to do relocations would be nice. It is not exactly > easy to do it right by hand. It's not *that* hard. All you really need to do is: for i in $(cat /tmp/badblocks.sdXX) ; do dd if=/dev/zero of=/dev/sdXX bs=4k seek=$i count=1 done e2fsck -f /dev/sdXX For bonus points, you could write a C program which tries to read the block one final time before doing the forced write of all zeros. It's a bit harder if you are trying to interpret the device-driver dependent error messages, and translate the absolute sector number into a partition-relative block number. (Except sometimes, depending on the block device, the number which is given is either a relative sector number, or a relative block number.) For disks that do bad block remapping, an even simpler thing to do is to just delete the corrupted files. When the blocks get reallocated for some other purpose, the HDD should automatically remap the block on write, and if the write fails, such that you are getting an I/O error on the write, it's time to replace the disk. > Forcing reallocation is hard & tricky. You may want to simply mark it > bad and lose a tiny bit of disk space... And even if you want to force > reallocation, you want to do fsck -c, first, and restore affected > files from backup. Trying to force reallocation isn't that hard, so long as you have resigned yourself that you've lost the data in the blocks in question. And if it doesn't work, for whatever reason, I would simply not trust the disk any longer. For me at least, it's all about the value of the disk versus the value of my time and the data on the disk. When I take my hourly rate into question ($annual comp divided by 2000) the value of trying to save a particular hard drive almost never works out in my favor. So these days, my bias is to do what I can to save the data, but to not fool around with trying to play fancy games with e2fsck -c. I'll just want to save what I can, and hopefully, with regular backups, that won't require heroic measures, and then trash and replace the HDD. Cheers, - Ted P.S. I'm not sure why you consider running badblocks to be tricky. The only thing you need to be careful about is passing the file system blocksize to badblocks. And since the block size is almost always 4k for any non-trivial file system, all you really need to do is "badblocks -b 4096". Or, if you really like: badblocks -b $(dumpe2fs -h /dev/sdXX | awk -F: '/^Block size: / {print $2}') /dev/sdXX See? Easy peasy! :-) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 17:21 ` Pavel Machek 2014-07-04 18:06 ` Pavel Machek 2014-07-04 18:56 ` Theodore Ts'o @ 2014-07-04 19:17 ` Andreas Dilger 2014-07-04 20:33 ` Pavel Machek 2 siblings, 1 reply; 17+ messages in thread From: Andreas Dilger @ 2014-07-04 19:17 UTC (permalink / raw) To: Pavel Machek; +Cc: Theodore Ts'o, Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 901 bytes --] On Jul 4, 2014, at 11:21 AM, Pavel Machek <pavel@ucw.cz> wrote: >>> pavel@duo:~$ uname -a >>> Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686 >>> GNU/Linux >>> >>> EXT4-fs (sda3): error count: 11 >>> EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756 >>> EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877 >>> >>> That sounds like media error to me? >> >> If you search your system logs since the last fsck, you should find 11 >> instances of "EXT4-fs error" message, which means that there was some >> file system inconsisntencies detected. The first error was detected at: >> >> % date -d @1401714179 >> Mon Jun 2 09:02:59 EDT 2014 > > Interesting. I always assumed 140... was block number. Maybe it is worthwhile to improve this error message to be: EXT4-fs (sda3): initial error at time 1401714179: ... Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 19:17 ` Andreas Dilger @ 2014-07-04 20:33 ` Pavel Machek 2014-07-04 22:18 ` Andreas Dilger 2014-07-05 22:17 ` Theodore Ts'o 0 siblings, 2 replies; 17+ messages in thread From: Pavel Machek @ 2014-07-04 20:33 UTC (permalink / raw) To: Andreas Dilger; +Cc: Theodore Ts'o, Ext4 Developers List On Fri 2014-07-04 13:17:37, Andreas Dilger wrote: > On Jul 4, 2014, at 11:21 AM, Pavel Machek <pavel@ucw.cz> wrote: > >>> pavel@duo:~$ uname -a > >>> Linux duo 3.15.0-rc8+ #365 SMP Mon Jun 9 09:18:29 CEST 2014 i686 > >>> GNU/Linux > >>> > >>> EXT4-fs (sda3): error count: 11 > >>> EXT4-fs (sda3): initial error at 1401714179: ext4_mb_generate_buddy:756 > >>> EXT4-fs (sda3): last error at 1401714179: ext4_reserve_inode_write:4877 > >>> > >>> That sounds like media error to me? > >> > >> If you search your system logs since the last fsck, you should find 11 > >> instances of "EXT4-fs error" message, which means that there was some > >> file system inconsisntencies detected. The first error was detected at: > >> > >> % date -d @1401714179 > >> Mon Jun 2 09:02:59 EDT 2014 > > > > Interesting. I always assumed 140... was block number. > > Maybe it is worthwhile to improve this error message to be: > > EXT4-fs (sda3): initial error at time 1401714179: ... I'm glad you suggested it. I actually done a patch before reading this. What about: --- Make it clear that values printed are times, and that it is error since last fsck. Also add note about fsck version required. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/fs/ext4/super.c b/fs/ext4/super.c index b9b9aab..3423947 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2809,10 +2809,11 @@ static void print_daily_error_info(unsigned long arg) es = sbi->s_es; if (es->s_error_count) - ext4_msg(sb, KERN_NOTICE, "error count: %u", + /* fsck newer than v1.41.13 is needed to clean this condition. */ + ext4_msg(sb, KERN_NOTICE, "error count since last fsck: %u", le32_to_cpu(es->s_error_count)); if (es->s_first_error_time) { - printk(KERN_NOTICE "EXT4-fs (%s): initial error at %u: %.*s:%d", + printk(KERN_NOTICE "EXT4-fs (%s): initial error at time %u: %.*s:%d", sb->s_id, le32_to_cpu(es->s_first_error_time), (int) sizeof(es->s_first_error_func), es->s_first_error_func, @@ -2826,7 +2827,7 @@ static void print_daily_error_info(unsigned long arg) printk("\n"); } if (es->s_last_error_time) { - printk(KERN_NOTICE "EXT4-fs (%s): last error at %u: %.*s:%d", + printk(KERN_NOTICE "EXT4-fs (%s): last error at time %u: %.*s:%d", sb->s_id, le32_to_cpu(es->s_last_error_time), (int) sizeof(es->s_last_error_func), es->s_last_error_func, -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 20:33 ` Pavel Machek @ 2014-07-04 22:18 ` Andreas Dilger 2014-07-05 22:17 ` Theodore Ts'o 1 sibling, 0 replies; 17+ messages in thread From: Andreas Dilger @ 2014-07-04 22:18 UTC (permalink / raw) To: Pavel Machek; +Cc: Theodore Ts'o, Ext4 Developers List [-- Attachment #1: Type: text/plain, Size: 2148 bytes --] On Jul 4, 2014, at 2:33 PM, Pavel Machek <pavel@ucw.cz> wrote: > On Fri 2014-07-04 13:17:37, Andreas Dilger wrote: >> Maybe it is worthwhile to improve this error message to be: >> >> EXT4-fs (sda3): initial error at time 1401714179: ... > > I'm glad you suggested it. I actually done a patch before reading > this. What about: Looks good to me. There have been a few users confused by these messages already, so it is nice to make them a bit more clear. Reviewed-by: Andreas Dilger <adilger@dilger.ca> > --- > > Make it clear that values printed are times, and that it is error > since last fsck. Also add note about fsck version required. > > Signed-off-by: Pavel Machek <pavel@ucw.cz> > > diff --git a/fs/ext4/super.c b/fs/ext4/super.c > index b9b9aab..3423947 100644 > --- a/fs/ext4/super.c > +++ b/fs/ext4/super.c > @@ -2809,10 +2809,11 @@ static void print_daily_error_info(unsigned long arg) > es = sbi->s_es; > > if (es->s_error_count) > - ext4_msg(sb, KERN_NOTICE, "error count: %u", > + /* fsck newer than v1.41.13 is needed to clean this condition. */ > + ext4_msg(sb, KERN_NOTICE, "error count since last fsck: %u", > le32_to_cpu(es->s_error_count)); > if (es->s_first_error_time) { > - printk(KERN_NOTICE "EXT4-fs (%s): initial error at %u: %.*s:%d", > + printk(KERN_NOTICE "EXT4-fs (%s): initial error at time %u: %.*s:%d", > sb->s_id, le32_to_cpu(es->s_first_error_time), > (int) sizeof(es->s_first_error_func), > es->s_first_error_func, > @@ -2826,7 +2827,7 @@ static void print_daily_error_info(unsigned long arg) > printk("\n"); > } > if (es->s_last_error_time) { > - printk(KERN_NOTICE "EXT4-fs (%s): last error at %u: %.*s:%d", > + printk(KERN_NOTICE "EXT4-fs (%s): last error at time %u: %.*s:%d", > sb->s_id, le32_to_cpu(es->s_last_error_time), > (int) sizeof(es->s_last_error_func), > es->s_last_error_func, > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html Cheers, Andreas [-- Attachment #2: Message signed with OpenPGP using GPGMail --] [-- Type: application/pgp-signature, Size: 833 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: ext4: media error but where? 2014-07-04 20:33 ` Pavel Machek 2014-07-04 22:18 ` Andreas Dilger @ 2014-07-05 22:17 ` Theodore Ts'o 1 sibling, 0 replies; 17+ messages in thread From: Theodore Ts'o @ 2014-07-05 22:17 UTC (permalink / raw) To: Pavel Machek; +Cc: Andreas Dilger, Ext4 Developers List On Fri, Jul 04, 2014 at 10:33:09PM +0200, Pavel Machek wrote: > > Make it clear that values printed are times, and that it is error > since last fsck. Also add note about fsck version required. > > Signed-off-by: Pavel Machek <pavel@ucw.cz> Thanks, applied. - Ted ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2014-07-07 23:21 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20140626202021.GA8512@xo-6d-61-c0.localdomain>
[not found] ` <20140626203052.GA9449@xo-6d-61-c0.localdomain>
[not found] ` <20140627024659.GF6826@thunk.org>
[not found] ` <20140629202516.GA11430@amd.pavel.ucw.cz>
[not found] ` <20140629210428.GD2162@thunk.org>
[not found] ` <20140630064644.GA23079@amd.pavel.ucw.cz>
[not found] ` <20140630134313.GA3753@thunk.org>
2014-07-04 10:23 ` ext4: media error but where? Pavel Machek
2014-07-04 12:11 ` Theodore Ts'o
2014-07-04 17:21 ` Pavel Machek
2014-07-04 18:06 ` Pavel Machek
2014-07-04 18:56 ` Theodore Ts'o
2014-07-06 13:32 ` Pavel Machek
2014-07-06 13:43 ` Pavel Machek
2014-07-06 18:29 ` Theodore Ts'o
2014-07-06 21:37 ` Pavel Machek
2014-07-07 1:00 ` Theodore Ts'o
2014-07-07 18:55 ` Pavel Machek
2014-07-07 23:18 ` 3.16-rc, ext4: oopses, OOMs after hard powerdown Pavel Machek
2014-07-07 23:21 ` ext4: media error but where? Theodore Ts'o
2014-07-04 19:17 ` Andreas Dilger
2014-07-04 20:33 ` Pavel Machek
2014-07-04 22:18 ` Andreas Dilger
2014-07-05 22:17 ` Theodore Ts'o
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).