linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: TR Reardon <thomas_reardon@hotmail.com>
Cc: linux-ext4@vger.kernel.org
Subject: Re: journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*)
Date: Tue, 12 Aug 2014 10:08:26 -0700	[thread overview]
Message-ID: <20140812170826.GH2808@birch.djwong.org> (raw)
In-Reply-To: <BLU436-SMTP123C75E96D9A42B3CA66A12FDEA0@phx.gbl>

On Tue, Aug 12, 2014 at 01:39:35AM -0400, TR Reardon wrote:
> On 8/11/14, Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > Hi all,
> >
> > Mr. Reardon has discovered that due to a bug in journal_tag_bytes(), if the
> > the
> > "journal csum v2" feature flag is turned on, block tag records are being
> > written with two extra bytes of space because we don't need to execute
> > "x += sizeof(tag.t_checksum);".  In 32-bit mode, other parts of the journal
> > then perform incorrect size comparisons, leading to BUG() being called.  In
> > 64-bit mode, there's enough padding that bad things won't happen.
> >
> > This is a remnant of the days when I tried to enlarge journal_block_tag_t
> > to
> > hold the full 32-bit checksum for a data block that's stored in the
> > journal.
> > Back in 2011, we decided (though sadly I can't find the link; I think we
> > might
> > have discussed this in the concall) that it was better not to change the
> > size
> > of journal_block_tag_t than it was to make it bigger so that it could hold
> > the
> > full checksum.
> >
> > A simple fix for the problem has been proposed by Mr. Reardon which fixes
> > journal_tag_bytes() and leaves everything else unchanged.  However, that is
> > technically a disk format change since the size of journal_block_tag_t on
> > disk
> > changes, albeit only for people running experimental metadata_csum
> > filesystems.
> > Since we've been allocating disk space for the enlarged checksum this whole
> > time, if we apply that patch, anyone with an unclean 64bit FS will not be
> > able
> > to recover the journal.  (Anyone with an unclean 32-bit FS has been broken
> > the
> > whole time, and still will be.)
> >
> > The other thing we could do is actually use the two bytes to store the high
> > 16-bits of the checksum, fix the jbd2 helper functions to reflect that
> > reality
> > (so that they don't BUG()), and change the checksum verify routine to pass
> > the
> > block if either the full checksum matches, or if the lower 16 bits match
> > and
> > the upper 16 bits are zero.  With this route, anybody with an uncleanly
> > unmounted FS could still recover the journal, since we're not changing the
> > size
> > of anything.
> >
> > Fortunately, journal tag blocks are fairly ephemeral and nobody ought to be
> > using metadata_csum on a production filesystem, so at this point we can
> > probably change the disk format without too many repercussions.  If you
> > make
> > sure to cleanly unmount the filesystem when changing kernel/e2fsprogs
> > versions,
> > there will be no problems.
> >
> > So, uh... comments?  How should we proceed?
> >
> > --D
> 
> My only concern is that legacy applies to in-the-wild kernels, not
> just journals.  Any post 3.4 kernel has this problem, which will be
> exposed as soon as e2fsprogs 1.43 is released.  A common (enough) use
> case might be, say, a 2TB USB drive, being unplugged and replugged
> across machines  without proper shutdown.  Old machines will think
> they recognize the journal but don't actually, and replay something
> destructive.
> 
> In other words, this is a future-retro problem.  The problem is not so
> much with existing fs experimenting with metadata_csum, but a future
> properly-journaled drive being plugged into an legacy faulty machine.
> Those incompat flags are supposed to protect against just this
> scenario, but won't.  So perhaps you'll need make a new
> "INCOMPAT_CSUM_V3" flag to protect?  Actually, dwelling on this a bit
> today, I think whether or not you retain those 2 extra checksum bytes
> in final fix, you ought use a new flag for the format.

Definitely we need a new bit to keep the old kernels out; I wanted others to
weigh in on whether we ought to shrink the struct or put the two bytes to use.

That said... journal recovery is not so well tested, so I'm currently busy
building out debugfs to create journal transactions so we can test recovery
both with and without *_csum.

--D
> 
> +Reardon
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

  reply	other threads:[~2014-08-12 17:08 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <BLU437-SMTP3407533C3883653626BD3DFDEC0@phx.gbl>
2014-08-11  7:10 ` journal recovery problems with metadata_csum, *non-64bit* Darrick J. Wong
2014-08-11 19:25   ` journal_checksum_v2 on-disk format change? (was: Re: journal recovery problems with metadata_csum, *non-64bit*) Darrick J. Wong
2014-08-12  5:39     ` TR Reardon
2014-08-12 17:08       ` Darrick J. Wong [this message]
2014-08-13 21:35         ` Andreas Dilger
2014-08-13 22:53           ` Darrick J. Wong
2014-08-14  1:56             ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140812170826.GH2808@birch.djwong.org \
    --to=darrick.wong@oracle.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=thomas_reardon@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).