Linux RAID subsystem development
 help / color / mirror / Atom feed
From: Ram Ramesh <rramesh2400@gmail.com>
To: Shaohua Li <shli@fb.com>, linux-raid@vger.kernel.org
Subject: Re: [PATCH] MD: add doc for raid5-cache
Date: Thu, 2 Feb 2017 00:33:00 -0600	[thread overview]
Message-ID: <3d68e5aa-5c2e-4cb0-ba57-45246041ffe6@gmail.com> (raw)
In-Reply-To: <25051bd79d94b45c7be24ce466a8b6eb2fba66c0.1485890144.git.shli@fb.com>

On 01/31/2017 01:18 PM, Shaohua Li wrote:
> I'm starting document of the raid5-cache feature. Please let me know
> what else we should put into the document. Of course, comments are
> welcome!
>
> Signed-off-by: Shaohua Li <shli@fb.com>
> ---
>   Documentation/md/raid5-cache.txt | 99 ++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 99 insertions(+)
>   create mode 100644 Documentation/md/raid5-cache.txt
>
> diff --git a/Documentation/md/raid5-cache.txt b/Documentation/md/raid5-cache.txt
> new file mode 100644
> index 0000000..17a6279
> --- /dev/null
> +++ b/Documentation/md/raid5-cache.txt
> @@ -0,0 +1,99 @@
> +RAID5 cache
> +
> +Raid 4/5/6 could include an extra disk for data cache. The cache could be
> +in write-through or write-back mode. mdadm has a new option
> +'--write-journal' to create array with cache. By default (raid array
> +starts), the cache is in write-through mode. User can switch it to
> +write-back mode by:
> +
> +echo "write-back" > /sys/block/md0/md/journal_mode
> +
> +And switch it back to write-through mode by:
> +
> +echo "write-through" > /sys/block/md0/md/journal_mode
> +
> +In both modes, all writes to the array will hit cache disk first. This means
> +the cache disk must be fast and sustainable (if you use a SSD as the cache).
> +
> +-------------------------------------
> +write-through mode:
> +
> +This mode mainly fixes 'write hole' issue. For RAID 4/5/6 array, an
> +unclean shutdown could cause data in some stripes is not in consistent
> +state, eg, data and parity don't match. The reason is a stripe write
> +involves several raid disks and it's possible writes don't hit all raid
> +disks yet before the unclean shutdown. After an unclean shutdown, MD try
> +to 'resync' the array to put all stripes back into consistent state. In
> +the resync, any disk failure will cause real data corruption. This problem
> +is called 'write hole'. So the 'write hole' issue occurs between unclean
> +shutdown and 'resync'. This window isn't big. On the other hand, if one
> +disk fails, other disks could fail soon, which happens sometimes if the
> +disks are from the same vendor and manufactured in the same time. This
> +will increase the chance of 'write whole', but overall the chance isn't
> +big, so don't panic even not using cache disk.
> +
> +The write-through cache will cache all data in cache disk first. Until the
> +data hits into the cache disk, the data is flushed into RAID disks. The
> +two-step write will guarantee MD can recover correct data after unclean
> +shutdown even with disk failure. Thus the cache can close the 'write
> +hole'.
> +
> +In write-through mode, MD reports IO finish to upper layer (usually
> +filesystems) till the data hits RAID disks, so cache disk failure doesn't
> +cause data lost. Of course cache disk failure means the array is exposed
> +into 'write hole' again.
> +
> +--------------------------------------
> +write-back mode:
> +
> +write-back mode fixes the 'write hole' issue too, since all write data is
> +cached in cache disk. But the main goal of 'write-back' cache is to speed up
> +write. If a write crosses all raid disks of a stripe, we call it full-stripe
> +write. For non-full-stripe write, MD must do a read-modify-write. The extra
> +read (for data in other disks) and write (for parity) introduce a lot of
> +overhead. Some writes which are sequential but not dispatched in the same time
> +will suffer from this overhead too. write-back cache will aggregate the data
> +and flush the data to raid disks till the data becomes a full stripe write.
> +This will completely avoid the overhead, so it's very helpful for some
> +workloads. A typical workload which does sequential write and follows fsync is
> +an example.
> +
> +In write-back mode, MD reports IO finish to upper layer (usually filesystems)
> +right after the data hit cache disk. The data is flushed to raid disks later
> +after specific conditions met. So cache disk failure will cause data lost.
> +
> +--------------------------------------
> +The implementation:
> +
> +The write-through and write-back cache use the same disk format. The cache disk
> +is organized as a simple write log. The log consists of 'meta data' and 'data'
> +pairs. The meta data describes the data. It also includes checksum and sequence
> +ID for recovery identification. Data could be IO data and parity data. Data is
> +checksumed too. The checksum is stored in the meta data ahead of the data. The
> +checksum is an optimization because MD can write meta and data freely without
> +worry about the order. MD superblock has a field pointed to the valid meta data
> +of log head.
> +
> +The log implementation is pretty straightforward. The difficult part is the
> +order MD write data to cache disk and raid disks. Specifically, in
> +write-through mode, MD calculates parity for IO data, writes both IO data and
> +parity to the log, write the data and parity to raid disks after the data and
> +parity is settled down in log and finally the IO is finished. Read just reads
> +from raid disks as usual.
> +
> +In write-back mode, MD writes IO data to the log and reports IO finish. The
> +data is also fully cached in memory at that time, which means read must query
> +memory cache. If some conditions are met, MD will flush the data to raid disks.
> +MD will calculate parity for the data and write parity into the log. After this
> +is finished, MD will write both data and parity into raid disks, then MD can
> +release the memory cache. The flush conditions could be stripe becomes a full
> +stripe write, free cache disk space is low or in-kernel memory cache space is
> +low.
> +
> +After an unclean shutdown, MD does recovery. MD reads all meta data and data
> +from the log. The sequence ID and checksum will help us detect corrupted meta
> +data and data. If MD finds a stripe with data and valid parities (1 parity for
> +raid4/5 and 2 for raid6), MD will write the data and parities to raid disks. If
> +parities are incompleted, they are discarded. If part of data is corrupted,
> +they are discarded too. MD then loads valid data and writes them to raid disks
> +in normal way.

Which version of mdadm/kernel supports this feature? Is it already 
released or in the process?

Ramesh


  parent reply	other threads:[~2017-02-02  6:33 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-31 19:18 [PATCH] MD: add doc for raid5-cache Shaohua Li
2017-02-01 17:54 ` Song Liu
2017-02-02  0:37 ` NeilBrown
2017-02-02  6:33 ` Ram Ramesh [this message]
2017-02-02  6:54   ` Jure Erznožnik

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3d68e5aa-5c2e-4cb0-ba57-45246041ffe6@gmail.com \
    --to=rramesh2400@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=shli@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox