From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Markus <M4rkusXXL@web.de>
Cc: "Theodore Ts'o" <tytso@mit.edu>, linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: Dirty ext4 blocks system startup
Date: Tue, 8 Apr 2014 12:18:34 -0700 [thread overview]
Message-ID: <20140408191834.GA12092@birch.djwong.org> (raw)
In-Reply-To: <2164274.jmlex94sWc@web.de>
On Mon, Apr 07, 2014 at 04:06:50PM +0200, Markus wrote:
> Theodore Ts'o wrote on 07.04.2014:
> > On Mon, Apr 07, 2014 at 12:58:40PM +0200, Markus wrote:
> > >
> > > Finally e2image finished successfully. But the produced file is way too
> big for a mail.
:(
> > >
> > > Any other possibility?
> > > (e2image does dump everything except file data and free space. But the
> problem seems to be just in the bitmap and/or journal.)
Yes, it might be less work if I just turn on data=journal + metadata_csum +
journal_checksum and see if I can easily reproduce it myself. Or, I suppose it
wouldn't be too hard just to format a fresh FS and tweak the journal to
"replay" into arbitrary empty blocks, and then corrupt the journal checksums to
see what happens.
> > >
> > > Actually, when I look at the code around e2fsck/recovery.c:594
> > > The error is detected and continue is called.
> > > But tagp/tag is never changed, but the checksum is always compared to the
> one from tag. Intended?
I think you're right, but that function makes my eyes bleed. :(
> > What mount options are you using? It appears that you have journal
> > checksums enabled, which isn't on by default, and unfortunately,
> > there's a good reason for that. The original code assumed that the
> > most common case for journal corruption would be caused by an
> > incomplete journal transaction getting written out if one were using
> > journal_async_commit. This feature has not been enabled by default
> > because the qeustion of what to do when the journal gets corrupted in
> > other cases is not an easy one.
>
> Normally just "noatime,journal_checksum", but with the corrupted journal I use
> "ro,noload".
>
> The "man mount" reads well about that "journal_checksum" option ;)
>
>
> > If some part of a transaction which is not the very last transaction
> > in the journal gets corrupted, replaying it could do severe damage to
> > the file system. Unfortunately, simply deleting the journal and then
> > recreating it could also do more damage as well. Most of the time, a
> > bad checksum happens because the last transaction hasn't fully made it
> > out to disk (especially if you use the journal_async_commit option,
> > which is a bit of a misnomer and has its own caveats[1]). But if the
> > checksum violation happens in a journal transaction that is not the
> > last transaction in the journal, right now the recovery code aborts,
> > because we don't have good automated logic to handle this case.
>
> The recovery does not seem to abort. It calles continue and is caught in an
> endless loop.
>
>
> > I suspect if you need to get your file system back on its feet, the
> > best thing to do is to create a patched e2fsck that doesn't abort when
> > it finds a checksum error, but instead continues. Then run it to
> > replay the journal, and then force a full file system check and hope
> > for the best.
>
> The code calls "continue". ;)
> So I just remove the whole if clause:
> /* Look for block corruption */
> if (!jbd2_block_tag_csum_verify(
> journal, tag, obh->b_data,
> be32_to_cpu(tmp->h_sequence))) {
> - brelse(obh);
> - success = -EIO;
> printk(KERN_ERR "JBD: Invalid "
> "checksum recovering "
> "block %lld in log\n",
> blocknr);
> - continue;
> }
>
> It would then ignore the checksum and just issue a message. Right?
Umm... I think you just made it replay the corrupt block too. Granted, it
looks as though fsck made everything right anyway, so in this case nothing bad
happened.
> > What has been on my todo list to implement, but has been relatively
> > low priority because this is not a feature that we've documented or
> > encouraged peple to use, is to have e2fsck skip the transaction has a
> > bad checksum (i.e., not replay it at all), and then force a full file
> > system check. This is a bit safer, but if you make e2fsck ignore the
> > checksum, it's no worse than if journal checksums weren't enabled in
> > the first place.
> >
> > The long term thing that we need to add before we can really support
> > journal checksums is to checksum each individual data block, instead
> > of just each transaction. Then when we have a bad checksum, we can
> > skip just the one bad data block, and then force a full fsck.
I think the metadata-csum patchset added per-block checksums, but now that
we've brought it up, I think (IBM) pulled me off ext4 before I could get to
implementing a more sane strategy for replaying with bad checksums. I can git
blame that particular hunk on myself. :/
Ugh, not documented in the on disk format wiki page either. Well, I guess I'll
go update the wiki while I reread the code to figure out just what's going on
here. Sorry about that. Apparently the poweroff testing I did didn't catch
it. <groan>
--D
> > I'm sorry you ran into this. What I should do is to disable these
> > mount options for now, since users who stumble across them, as
> > apparently you have, might be tempted to use them, and then get into
> > trouble.
> >
> > - Ted
> >
> > [1] The issue with journal_async_commit is that it's possible (fairly
> > unlikely, but still possible) that the guarantees of data=ordered will
> > be violated. If the data blocks that were written out while we are
> > resolving a delayed allocation writeback haven't made it all the way
> > down to the platter, it's possible for all of the journal writes and
> > the commit block to be reordered ahead of the data blocks. In that
> > case, the checksum for the commit block would be valid, but some of
> > the data blocks might not have been written back to disk.
>
> Thanks so far,
> Markus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2014-04-08 19:18 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1459400.cqhC1n3S74@f209>
2014-04-04 10:35 ` Dirty ext4 blocks system startup Markus
2014-04-04 18:20 ` Darrick J. Wong
2014-04-05 13:10 ` Markus
2014-04-07 10:58 ` Markus
2014-04-07 12:48 ` Theodore Ts'o
2014-04-07 14:06 ` Markus
2014-04-08 14:25 ` Markus
2014-04-08 15:28 ` Theodore Ts'o
2014-04-08 19:18 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20140408191834.GA12092@birch.djwong.org \
--to=darrick.wong@oracle.com \
--cc=M4rkusXXL@web.de \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).