Re: Split RAID: Proposal for archival RAID using incremental batch checksum

All of lore.kernel.org
 help / color / mirror / Atom feed

From: NeilBrown <neilb@suse.de>
To: Anshuman Aggarwal <anshuman.aggarwal@gmail.com>
Cc: Mdadm <linux-raid@vger.kernel.org>
Subject: Re: Split RAID: Proposal for archival RAID using incremental batch checksum
Date: Tue, 2 Dec 2014 08:46:11 +1100	[thread overview]
Message-ID: <20141202084611.45f56d6a@notabene.brown> (raw)
In-Reply-To: <CAK-d5dZMjbhAqgs9GOvswG6crZtrV2uSidRzt_hBJAH=m5rGKw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 5522 bytes --]

On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 1 December 2014 at 21:30, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
> > On 26 November 2014 at 11:54, Anshuman Aggarwal
> > <anshuman.aggarwal@gmail.com> wrote:
> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> >>> <anshuman.aggarwal@gmail.com> wrote:
> >>>
> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> >>>> > <anshuman.aggarwal@gmail.com> wrote:
> >>>> >
> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
> >>>> >> parity be invalidated for any write to any of the disks (assuming md
> >>>> >> operates at a chunk level)...also please see my reply below
> >>>> >
> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
> >>>> > operates in units of 1 page (4K).
> >>>>
> >>>> It appears that my requirement may be met by a partitionable md raid 4
> >>>> array where the partitions are all on individual underlying block
> >>>> devices not striped across the block devices. Is that currently
> >>>> possible with md raid? I dont' see how but such an enhancement could
> >>>> do all that I had outlined earlier
> >>>>
> >>>> Is this possible to implement using RAID4 and MD already?
> >>>
> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> >>> Rounding down the size of your drives to match that could waste nearly half
> >>> the space.  However it should work as a proof-of-concept.
> >>>
> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> >>> RAID4/5/6 would be quite possible.
> >>>
> >>>>   can the
> >>>> partitions be made to write to individual block devices such that
> >>>> parity updates don't require reading all devices?
> >>>
> >>> md/raid4 will currently tries to minimize total IO requests when performing
> >>> an update, but prefer spreading the IO over more devices if the total number
> >>> of requests is the same.
> >>>
> >>> So for a 4-drive RAID4, Updating a single block can be done by:
> >>>   read old data block, read parity, write data, write parity - 4 IO requests
> >>> or
> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
> >>>
> >>> In this case it will prefer the second, which is not what you want.
> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> >>> will be chosen.
> >>> It is quite trivial to flip this default for testing
> >>>
> >>> -       if (rmw < rcw && rmw > 0) {
> >>> +       if (rmw <= rcw && rmw > 0) {
> >>>
> >>>
> >>> If you had 5 drives, you could experiment with no code changes.
> >>> Make the chunk size the largest power of 2 that fits in the device, and then
> >>> partition to align the partitions on those boundaries.
> >>
> >> If the chunk size is almost the same as the device size, I assume the
> >> entire chunk is not invalidated for parity on writing to a single
> >> block? i.e. if only 1 block is updated only that blocks parity will be
> >> read and written and not for the whole chunk? If thats' the case, what
> >> purpose does a chunk serve in md raid ? If that's not the case, it
> >> wouldn't work because a single block updation would lead to parity
> >> being written for the entire chunk, which is the size of the device
> >>
> >> I do have more than 5 drives though they are in use currently. I will
> >> create a small testing partition on each device of the same size and
> >> run the test on that after ensuring that the drives do go to sleep.
> >>
> >>>
> >>> NeilBrown
> >>>
> >
> > Wouldn't the meta data writes wake up all the disks in the cluster
> > anyways (defeating the purpose)? This idea will require metadata to
> > not be written out to each device (is that even possible or on the
> > cards?)
> >
> > I am about to try out your suggestion with the chunk sizes anyways but
> > thought about the metadata being a major stumbling block.
> >
> 
> And it seems to be confirmed that the metadata write is waking up the
> other drives. On any write to a particular drive the metadata update
> is accessing all the others.
> 
> Am I correct in assuming that all metadata is currently written as
> part of the block device itself and that the external metadata  is
> still embedded in each of the block devices (only the format of the
> metadata is defined externally?) I guess to implement this we would
> need to store metadata elsewhere which may be a major development
> work. Still that may be a flexibility desired in md raid for other
> reasons...
> 
> Neil, your thoughts.

This is exactly why I suggested testing with existing code and seeing how far
you can get.  Thanks.

For a full solution we probably do need some code changes here, but for
further testing you could:
1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
2/ set the safe_mode_delay to 0
     echo 0 > /sys/block/mdXXX/md/safe_mode_delay

when it won't try to update the metadata until you stop the array, or a
device fails.

Longer term: it would probably be good to only update the bitmap on the
devices that are being written to - and to merge all bitmaps when assembling
the array.  Also when there is a bitmap, the safe_mode functionality should
probably be disabled.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

next prev parent reply	other threads:[~2014-12-01 21:46 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-29  7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
2014-10-29  7:32 ` Roman Mamedov
2014-10-29  8:31   ` Anshuman Aggarwal
2014-10-29  9:05 ` NeilBrown
2014-10-29  9:25   ` Anshuman Aggarwal
2014-10-29 19:27     ` Ethan Wilson
2014-10-30 14:57       ` Anshuman Aggarwal
2014-10-30 17:25         ` Piergiorgio Sartor
2014-10-31 11:05           ` Anshuman Aggarwal
2014-10-31 14:25             ` Matt Garman
2014-11-01 12:55             ` Piergiorgio Sartor
2014-11-06  2:29               ` Anshuman Aggarwal
2014-10-30 15:00     ` Anshuman Aggarwal
2014-11-03  5:52       ` NeilBrown
2014-11-03 18:04         ` Piergiorgio Sartor
2014-11-06  2:24         ` Anshuman Aggarwal
2014-11-24  7:29         ` Anshuman Aggarwal
2014-11-24 22:50           ` NeilBrown
2014-11-26  6:24             ` Anshuman Aggarwal
2014-12-01 16:00               ` Anshuman Aggarwal
2014-12-01 16:34                 ` Anshuman Aggarwal
2014-12-01 21:46                   ` NeilBrown [this message]
2014-12-02 11:56                     ` Anshuman Aggarwal
2014-12-16 16:25                       ` Anshuman Aggarwal
2014-12-16 21:49                         ` NeilBrown
2014-12-17  6:40                           ` Anshuman Aggarwal
2015-01-06 11:40                             ` Anshuman Aggarwal
     [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
2014-11-01  5:36   ` Anshuman Aggarwal
  -- strict thread matches above, loose matches on Subject: below --
2014-11-21 10:15 Anshuman Aggarwal
2014-11-21 11:41 ` Greg Freemyer
2014-11-21 18:48   ` Anshuman Aggarwal
2014-11-22 13:17     ` Greg Freemyer
2014-11-22 13:22       ` Anshuman Aggarwal
2014-11-22 14:03         ` Greg Freemyer
2014-11-22 14:43           ` Anshuman Aggarwal
2014-11-22 14:54             ` Greg Freemyer
2014-11-24  5:36               ` SandeepKsinha
2014-11-24  6:48                 ` Anshuman Aggarwal
2014-11-24 13:19                   ` Greg Freemyer
2014-11-24 17:28                     ` Anshuman Aggarwal
2014-11-24 18:10                       ` Valdis.Kletnieks at vt.edu
2014-11-25  4:56                       ` Greg Freemyer
2014-11-27 17:50                         ` Anshuman Aggarwal
2014-11-27 18:31                           ` Greg Freemyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141202084611.45f56d6a@notabene.brown \
    --to=neilb@suse.de \
    --cc=anshuman.aggarwal@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.