From: NeilBrown <neilb@suse.de>
To: Shaohua Li <shli@fb.com>
Cc: dan.j.williams@intel.com, linux-raid@vger.kernel.org,
songliubraving@fb.com, Kernel-team@fb.com
Subject: Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
Date: Thu, 2 Apr 2015 11:19:41 +1100 [thread overview]
Message-ID: <20150402111941.104d0633@notabene.brown> (raw)
In-Reply-To: <20150401234055.GA3375744@devbig257.prn2.facebook.com>
[-- Attachment #1: Type: text/plain, Size: 3985 bytes --]
On Wed, 1 Apr 2015 16:40:57 -0700 Shaohua Li <shli@fb.com> wrote:
> > Your code does avoid write-hole-protection for fill-stripe-writes, and this
> > would greatly reduce the number of block that were written multiple times.
> > However I'm not convinced that is correct.
> > A reasonable goal is that if the system crashes while writing to a storage
> > device, then reads should return the old data or not new data, not anything
> > else. A crash in the middle of a full-stripe-write to a degraded array
> > could result in some block in the stripe appearing to contain data that is
> > different to both the old and the new. If you are going to close the whole,
> > I think it should be done properly.
>
> I can do it simpley. But don't think this assumption is true. If you
> write to a disk range and there is failure, there is nothing guarantee
> you can either read old data or new data.
If you write a range of blocks to a normal disk and crash during the write,
each block will contain either the old data or the new data.
If you write a range to a degraded RAID5 and crash during the write, you
cannot make that same guarantee.
I don't know how important this is, but then I don't really know how
important any of this is.
>
> >
> > A combined log would "simply" involve writing every data block and every
> > compute parity block (with index information) to the log device.
> > Replaying the log would collect data blocks and flush out those in a stripe
> > once the parity block(s) for that stripe became available.
> >
> > I think this would actually turn into a fairly simple logging mechanism.
>
> It's not simple at all. It's unlikely we write data and parity
> continuously in disk and in the same time. This will make log checkpoint
> fairly complex.
I don't see any cause for complexity. Let me be more explicit.
I imagine that all data remains in the stripe cache, in memory, until it is
finally written to the RAID5. So the stripe cache will need to be quite a
bit bigger.
Every time we get a block that we want to write, either a new data block or a
a computed parity block, we queue it to the log.
The log works like this:
- take the first (e.g.) 256 blocks in the queue, create a header to describe
them, write the header with FUA, then write all the data blocks. If there
are fewer than 256, just write what we have.
- when the header write completes, all blocks written *previously* are now
safe and we can call bio_end_io on data or unlock the stripe for parity.
- loop back and write some more blocks. If there are no blocks to write,
write a header which describes an empty set of blocks, and wait for more
blocks to appear.
Each stripe_head needs to track (roughly) where the relevant blocks were
written so it can release them when the stripe is written.
I would conceptually divide the log into 32 regions and keep a 32bit number
with each stripe. When a block is assigned to a region in the log, the
relevant bit is set for the stripe, and a per-region counter is incremented.
When a stripe completes its write, the region counters for all the bits are
cleared. The log cannot progress into a region which has a non-zero counter.
We choose the size of transactions so that the first block of each region is
a header block. These contain a magic number, a sequence number, and a
checksum together with the addresses of the data/parity blocks. On restart
we read all 32 of these to find out where the log starts and ends. Then we
replay all the blocks into the stripe cache - discarding any that don't come
with the required parity blocks.
So it is a very simple log which is never read exact on crash recovery. It
commits everything ASAP so that the writeout to the array can be lazy and can
gather related blocks and sort address etc with not impact on filesystem
latency.
Does that make sense?
NeilBrown
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]
next prev parent reply other threads:[~2015-04-02 0:19 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-30 22:25 [RFC] raid5: add a log device to fix raid5/6 write hole issue Shaohua Li
2015-04-01 3:47 ` Dan Williams
2015-04-01 5:53 ` Shaohua Li
2015-04-01 6:02 ` NeilBrown
2015-04-01 17:14 ` Shaohua Li
2015-04-01 18:36 ` Piergiorgio Sartor
2015-04-01 18:46 ` Dan Williams
2015-04-01 20:07 ` Jiang, Dave
2015-04-01 18:46 ` Alireza Haghdoost
2015-04-01 19:57 ` Wols Lists
2015-04-01 20:04 ` Alireza Haghdoost
2015-04-01 20:18 ` Wols Lists
2015-04-01 20:17 ` Jens Axboe
2015-04-01 21:53 ` NeilBrown
2015-04-01 23:40 ` Shaohua Li
2015-04-02 0:19 ` NeilBrown [this message]
2015-04-02 4:07 ` Shaohua Li
2015-04-09 0:43 ` Shaohua Li
2015-04-09 5:04 ` NeilBrown
2015-04-09 6:15 ` Shaohua Li
2015-04-09 15:37 ` Dan Williams
2015-04-09 16:03 ` Shaohua Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150402111941.104d0633@notabene.brown \
--to=neilb@suse.de \
--cc=Kernel-team@fb.com \
--cc=dan.j.williams@intel.com \
--cc=linux-raid@vger.kernel.org \
--cc=shli@fb.com \
--cc=songliubraving@fb.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox