Re: Extra write mode to close RAID5 write hole (kind of)

public inbox for linux-bcache@vger.kernel.org
 help / color / mirror / Atom feed

From: Vojtech Pavlik <vojtech@suse.com>
To: Kent Overstreet <kent.overstreet@gmail.com>
Cc: James Pharaoh <james@pharaoh.uk>, linux-bcache@vger.kernel.org
Subject: Re: Extra write mode to close RAID5 write hole (kind of)
Date: Fri, 28 Oct 2016 15:07:20 +0200	[thread overview]
Message-ID: <20161028130720.GA31703@suse.com> (raw)
In-Reply-To: <20161028115249.6myzx2ae24n2w4v7@kmo-pixel>

On Fri, Oct 28, 2016 at 03:52:49AM -0800, Kent Overstreet wrote:
> On Thu, Oct 27, 2016 at 12:31:58AM +0200, Vojtech Pavlik wrote:
> > In case you're using mdraid for the RAID part on a reasonably recent
> > Linux kernel, there is no write hole. Linux mdraid implements barriers
> > properly even on RAID5, at the cost of performance - mdraid waits for a
> > barrier to complete on all drives before submitting more i/o.
> 
> That's not what the raid 5 hole is. The raid 5 hole comes from the fact that
> it's not possible to update the p/q blocks atomically with the data blocks, thus
> there is a point in time when they are _inconsistent_ with the rest of the
> stripe, and if used will lead to reconstructing incorrect data. There's no way
> to fix this with just flushes.

Indeed. However, together with the write intent bitmap, and filesystems
ensuring consistency through barriers, it's still greatly mitigated. 

Mdraid will mark areas of disk dirty in the write intent bitmap before
writing to them. When the system comes up after a power outage, all
areas marked dirty are scanned and the xor block written where it
doesn't match the rest.

Thanks to the strict ordering using barriers, the damage to the
consistency of the RAID can only be in request since the last
successfully written barrier.

As such, the filesystem will always see a consistent state, and the raid
will also always recover to a consistent state.

The only situation where data damage can happen is a power outage that
comes together with a loss of one of the drives. In such a case, the
content of any blocks written past the last barrier is undefined. It
then depends on the filesystem whether it can revert to the last sane
state. Not sure about others, but btrfs will do so.

-- 
Vojtech Pavlik
Director SuSE Labs

next prev parent reply	other threads:[~2016-10-28 13:07 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-10-26 15:20 Extra write mode to close RAID5 write hole (kind of) James Pharaoh
2016-10-26 22:31 ` Vojtech Pavlik
2016-10-27 21:46   ` James Pharaoh
2016-10-28 11:52   ` Kent Overstreet
2016-10-28 13:07     ` Vojtech Pavlik [this message]
2016-10-28 13:13       ` Kent Overstreet
2016-10-28 16:55         ` Vojtech Pavlik
2016-10-28 16:58       ` James Pharaoh
2016-10-28 17:07     ` James Pharaoh
2016-10-29  0:58       ` Kent Overstreet
2016-10-29 19:58         ` James Pharaoh
2016-10-28 11:59 ` Kent Overstreet
2016-10-28 17:02   ` James Pharaoh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20161028130720.GA31703@suse.com \
    --to=vojtech@suse.com \
    --cc=james@pharaoh.uk \
    --cc=kent.overstreet@gmail.com \
    --cc=linux-bcache@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox