linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: scjody@sun.com
Cc: linux-ext4@vger.kernel.org, linux-raid@vger.kernel.org,
	linux-kernel@vger.kernel.org, Andreas Dilger <adilger@sun.com>
Subject: Re: [patch 4/4] [ext3] Add journal guided resync (data=declared mode)
Date: Fri, 2 Oct 2009 11:51:59 +1000	[thread overview]
Message-ID: <19141.23743.35576.466246@notabene.brown> (raw)
In-Reply-To: message from scjody@sun.com on Thursday October 1

On Thursday October 1, scjody@sun.com wrote:
> We introduce a new data write mode known as declared mode.  This is based on
> ordered mode except that a list of blocks to be written during the current
> transaction is added to the journal before the blocks themselves are written to
> the disk.  Then, if the system crashes, we can resync only those blocks during
> journal replay and skip the rest of the resync of the RAID array.
> 
> TODO: Add support to e2fsck.
> 
> TODO: The following sequence of events could cause resync to be skipped
> incorrectly:
>  - An MD array that supports RESYNC_RANGE is undergoing resync.
>  - A filesystem on that array is mounted with data=declared.
>  - The machine crashes before the resync completes.
>  - The array is restarted and the filesystem is remounted.
>  - Recovery resyncs only the blocks that were undergoing writes during
>    the crash and skips the rest.
> Addressing this requires even more communication between MD and ext and
> I need to think more about how to do this.

I have thought about this sort of thing from time to time and I have a
very different idea for how the necessary communication between the
filesystem and MD would happen.  I think my approach would completely
address this problem, and doesn't need to add any ioctls (which I am not
keen on).

I would add two new BIO_RW_ flags to be used with WRITE requests.
The first flag would mean "don't worry about a crash in the middle of
this write,  I will validate it after a crash before I rely on the
data."
The second would mean "last time I wrote data near here there might
have been a failure, be extra careful".

So the first flag would be used during normal filesystem writes for
every block that gets recorded in the journal, and for every write
to the journal.

The second flag is used after a crash to re-write every block that
could have been in-flight during the crash.  Some of those blocks will
be read from the journal and written to their proper home, other will
be read from wherever they are and written back there.

The first flag would be interpreted by MD as "don't set the bitmap
bit".  The second flag would be interpreted as "don't trust the
parity block, but do a reconstruct-write".

With this scheme you would still need a write-intent-bitmap on the MD
array, but no bits would ever be set if the filesystem were using the
new flags, so no performance impact.  You probably could run without a
bitmap, in which case the flag would me "don't mark the array as
active".

I'm not entirely sure about the second flag.  Maybe it would be better
to make it a flag for READ and have it mean "validate and correct any
redundancy information (duplicates or parity) for this block before
returning it.  Then we could have just one flag, which meant different
things for READ and WRITE.

What do you think?

NeilBrown

  reply	other threads:[~2009-10-02  1:51 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-01 22:39 [patch 0/4] Journal guided resync and support scjody
2009-10-01 22:39 ` [patch 1/4] [md] Add SKIP_RESYNC ioctl scjody
2009-10-01 22:39 ` [patch 2/4] [md] Add RESYNC_RANGE ioctl scjody
2009-10-01 22:39 ` [patch 3/4] [jbd] Add support for journal guided resync scjody
2009-10-01 23:39   ` Andrew Morton
2009-10-01 22:39 ` [patch 4/4] [ext3] Add journal guided resync (data=declared mode) scjody
2009-10-02  1:51   ` Neil Brown [this message]
2009-10-02 15:53     ` Jody McIntyre
2009-10-02  0:36 ` [patch 0/4] Journal guided resync and support Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=19141.23743.35576.466246@notabene.brown \
    --to=neilb@suse.de \
    --cc=adilger@sun.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    --cc=scjody@sun.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).