Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue

Linux RAID subsystem development
 help / color / mirror / Atom feed

From: Shaohua Li <shli@fb.com>
To: NeilBrown <neilb@suse.de>
Cc: dan.j.williams@intel.com, linux-raid@vger.kernel.org,
	songliubraving@fb.com, Kernel-team@fb.com
Subject: Re: [RFC] raid5: add a log device to fix raid5/6 write hole issue
Date: Wed, 8 Apr 2015 17:43:11 -0700	[thread overview]
Message-ID: <20150409004238.GA186860@devbig257.prn2.facebook.com> (raw)
In-Reply-To: <20150402040749.GA4025688@devbig257.prn2.facebook.com>

Hi,
This is what I'm working on now, and hopefully had the basic code
running next week. The new design will do cache and fix the write hole
issue too. Before I post the code out, I'd like to check if the design
has obvious issues.

Thanks,
Shaohua

Main goal is to aggregate write IO to hopefully make full stripe IO and fix the
write hole issue. This might speed up read too, but it's not optimized for
read, eg, we don't proactivly cache data for read. The aggregation makes a lot
of sense for workloads which sequentially write to several files. Such
workloads are popular in today's datacenter.

Here cache = cache disk, generally SSD. raid = raid array or raid disks
(excluding cache disk)
-------------------------
cache layout will like this:

|super|chunk descriptor|chunk data|

We divide cache to equal sized chunks. each chunk will have a descriptor.
Its size will be raid_chunk_size * raid_disks. That is the cache chunk can
store a whole raid chunk data and parity.

Write IO will store to cache chunks first and then flush to raid chunks. We use
fixed size chunk:
-manage cache space easily. We don't need a complex tree-like index

-flush data from cache to raid easily. data and parity are in the same chunk

-reclaim space is easy. when there is no free chunk in cache, we must try to
free some chunks, eg, reclaim. We do reclaim in chunk unit. reclaim a chunk
just means flush the chunk from cache to raid. If we use complex data
structure, we will need garbage collection and so on.

-The downside is we waste space. Eg, a single 4k data will use a whole chunk in
cache. But we can reclaim chunks with low utilization quickly to mitgate this
issue partially.

--------------------
chunk descriptor looks like this:
chunk_desc {
	u64 seq;
	u64 raid_chunk_index;
	u32 state;
	u8 bitmaps[];
}

seq: seq can be used to implement LRU-like algorithm for chunk reclaim. Every
time data is written to the chunk, we update the chunk's seq. When we flush a
chunk from cache to raid, we freeze the chunk (eg, the chunk can't accept new
IO). If there is new IO, we write the new IO to another chunk. The new chunk
will have a bigger seq than original chunk. crash and reboot can use the seq to
detinguish which chunk is newer.

raid_chunk_index: where the chunk should be flushed to raid

state: chunk state. Currently I defined 3 states
-FREE, the chunk is free
-RUNNING, the chunk maps to raid chunk and accepts new IO
-PARITY_INCORE, the chunk has both data and parity stored in cache

bitmaps: each page of data and parity has one bit. 1 means present. Store data
bits first.

-----IO READ PATH------
IO READ will check each chunk desc. If data is present in cache, dispatch to
cache. otherwise to raid.

-----IO WRITE PATH------
1. find or create a chunk in cache
2. write to cache
3. write descriptor

We write descriptor immediately in asynchronous way to reduce data loss, the
chunk will be RUNNING state.

-For normal write, IO return after 2. This will cut latency too. If there is a
crash, the chunk state might be FREE or bitmap isn't set. In either case, this
is the first write to the chunk, IO READ will read raid and get old data. We
meet the symantics. If data isn't in cache, we will read old data in cache, we
meet the symantics too.

-For FUA write, 2 will be a FUA write. When 2 finishes, run 3 with FUA. IO
return after 3. Crash after IO return deosn't impact symantics. We will read
old or new data if crash happens before IO return, which is the similar like
the normal write case.

-For FLUSH, wait all previous descriptor write finish and then flush cache disk
cache. In this way, we guarantee all previous write hit cache.

-----chunk reclaim--------
1. select a chunk
2. freeze the chunk
3. copy chunk data from cache to raid, so stripe state machine runs, eg,
calculate parity and so on
4. Hook to raid5 run_io. We write parity to cache
5. flush cache disk cache
6. mark descriptor PARITY_INCORE, and WRITE_FUA to cache
7. raid5 run_io continue run. data and parity write to raid disks
8. flush all raid disk cache
9. mark descriptor FREE, WRITE_FUA to cache

We will batch several chunks for reclaim for better performance. FUA write can
be replaced with FLUSH too.

If there is a crash before 6, descriptor state will be RUNNING. Recovery just
need discard the parity bitmap. If there is a crash before 9, descriptor state
will be PARITY_INCORE, recovery must copy both data and parity to raid.

next prev parent reply	other threads:[~2015-04-09  0:43 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-03-30 22:25 [RFC] raid5: add a log device to fix raid5/6 write hole issue Shaohua Li
2015-04-01  3:47 ` Dan Williams
2015-04-01  5:53   ` Shaohua Li
2015-04-01  6:02     ` NeilBrown
2015-04-01 17:14       ` Shaohua Li
2015-04-01 18:36   ` Piergiorgio Sartor
2015-04-01 18:46     ` Dan Williams
2015-04-01 20:07       ` Jiang, Dave
2015-04-01 18:46     ` Alireza Haghdoost
2015-04-01 19:57       ` Wols Lists
2015-04-01 20:04         ` Alireza Haghdoost
2015-04-01 20:18           ` Wols Lists
2015-04-01 20:17         ` Jens Axboe
2015-04-01 21:53 ` NeilBrown
2015-04-01 23:40   ` Shaohua Li
2015-04-02  0:19     ` NeilBrown
2015-04-02  4:07       ` Shaohua Li
2015-04-09  0:43         ` Shaohua Li [this message]
2015-04-09  5:04           ` NeilBrown
2015-04-09  6:15             ` Shaohua Li
2015-04-09 15:37               ` Dan Williams
2015-04-09 16:03                 ` Shaohua Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150409004238.GA186860@devbig257.prn2.facebook.com \
    --to=shli@fb.com \
    --cc=Kernel-team@fb.com \
    --cc=dan.j.williams@intel.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=songliubraving@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox