From: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>
To: Anshuman Aggarwal <anshuman.aggarwal@gmail.com>
Cc: Piergiorgio Sartor <piergiorgio.sartor@nexgo.de>,
Ethan Wilson <ethan.wilson@shiftmail.org>,
linux-raid@vger.kernel.org
Subject: Re: Split RAID: Proposal for archival RAID using incremental batch checksum
Date: Sat, 1 Nov 2014 13:55:13 +0100 [thread overview]
Message-ID: <20141101125513.GB2078@lazy.lzy> (raw)
In-Reply-To: <CAK-d5dbL0wwC0jWgHG5KKLHgJfJ3t7=8Y7cEW=Qo1WHYA+jZbA@mail.gmail.com>
On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote:
> Hi pg,
> With MD raid striping all the writes not only does it keep ALL disks
> spinning to read/write the current content, it also leads to
> catastrophic data loss in case the rebuild/disk failure exceeds the
> number of parity disks.
Hi Anshuman,
yes but do you have hard evidence that
this is a common RAID-6 problem?
Considering that we have now bad block list,
write intent bitmap and proactive replacement,
it does not seem to me really the main issue,
having a triple fail in RAID-6.
Considering that there are available libraries
for more that 2 parities, I think the multiple
failure case is quite a rarity.
Furthermore, I suspect there are other type
of catastrophic situation (lighting, for example)
that can destroy an array completely.
> But more importantly, I find myself setting up multiple RAID levels
> (at least RAID6 and now thinking of more) just to make sure that MD
> raid will recover my data and not lose the whole cluster if an
> additional disk fails above the number of parity!!! The biggest
> advantage of the scheme that I have outlined is that with a single
> check sum I am mostly assure of a failed disk restoration and worst
> case only the media (movies/music) on the failing disk are lost not on
> the whole cluster.
Each disk will have its own filesystem?
If this is not the case, you cannot say
if a single disk failure will lose only
some files.
> Also in my experience about disks and usage, while what you are saying
> was true a while ago when storage capacity had not hit multiple TBs.
> Now if I am buying 3-4 TB disks they are likely to last a while
> especially since the incremental % growth in sizes seem to be slowing
> down.
As wrote above, you can safely replace
disks before they fail, without compromising
the array.
bye,
pg
> Regards,
> Anshuman
>
> On 30 October 2014 22:55, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
> >> What you are suggesting will work for delaying writing the checksum
> >> (but still making 2 disks work non stop and lead to failure, cost
> >> etc).
> >
> > Hi Anshuman,
> >
> > I'm a bit missing the point here.
> >
> > In my experience, with my storage systems, I change
> > disks because they're too small, way long before they
> > are too old (way long before they fail).
> > That's why I end up with a collection of small HDDs.
> > which, in turn, I recycled in some custom storage
> > system (using disks of different size, like explained
> > in one of the links posted before).
> >
> > Honestly, the only reason to spin down the disks, still
> > in my experience, is for reducing power consumption.
> > And this can be done with a RAID-6 without problems
> > and in a extremely flexible way.
> >
> > So, the bottom line, still in my experience, is that
> > this you're describing seems quite a nice situation.
> >
> > Or, I did not understood what you're proposing.
> >
> > Thanks,
> >
> > bye,
> >
> > pg
> >
> >> I am proposing N independent disks which are rarely accessed. When
> >> parity has to be written to the remaining 1,2 ...X disks ...it is
> >> batched up (bcache is feasible) and written out once in a while
> >> depending on how much write is happening. N-1 disks stay spun down and
> >> only X disks wake up periodically to get checksum written to (this
> >> would be tweaked by the user based on how up to date he needs the
> >> parity to be (tolerance of rebuilding parity in case of crash) and vs
> >> disk access for each parity write)
> >>
> >> It can't be done using any RAID6 because RAID5/6 will stripe all the
> >> data across the devices making any read access wake up all the
> >> devices. Ditto for writing to parity on every write to a single disk.
> >>
> >> The architecture being proposed is a lazy write to manage parity for
> >> individual disks which won't suffer from RAID catastrophic data loss
> >> and concurrent disk.
> >>
> >>
> >>
> >>
> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> >> >>
> >> >> Right on most counts but please see comments below.
> >> >>
> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> >>>
> >> >>> Just to be sure I understand, you would have N + X devices. Each of the
> >> >>> N
> >> >>> devices contains an independent filesystem and could be accessed directly
> >> >>> if
> >> >>> needed. Each of the X devices contains some codes so that if at most X
> >> >>> devices in total died, you would still be able to recover all of the
> >> >>> data.
> >> >>> If more than X devices failed, you would still get complete data from the
> >> >>> working devices.
> >> >>>
> >> >>> Every update would only write to the particular N device on which it is
> >> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger
> >> >>> than X for the spin-down to be really worth it.
> >> >>>
> >> >>> Am I right so far?
> >> >>
> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> >> devices to 1 data) so spin down is totally worth it for data
> >> >> protection but more on that below.
> >> >>
> >> >>> For some reason the writes to X are delayed... I don't really understand
> >> >>> that part.
> >> >>
> >> >> This delay is basically designed around archival devices which are
> >> >> rarely read from and even more rarely written to. By delaying writes
> >> >> on 2 criteria ( designated cache buffer filling up or preset time
> >> >> duration from last write expiring) we can significantly reduce the
> >> >> writes on the parity device. This assumes that we are ok to lose a
> >> >> movie or two in case the parity disk is not totally up to date but are
> >> >> more interested in device longevity.
> >> >>
> >> >>> Sounds like multi-parity RAID6 with no parity rotation and
> >> >>> chunksize == devicesize
> >> >>
> >> >> RAID6 would present us with a joint device and currently only allows
> >> >> writes to that directly, yes? Any writes will be striped.
> >> >
> >> >
> >> > I am not totally sure I understand your design, but it seems to me that the
> >> > following solution could work for you:
> >> >
> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> >> > expensive that you can't scrub)
> >> >
> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> >> > two will never spin-down) in writeback mode with writeback_running=off .
> >> > This will prevent writes to backend and leave the backend array spun down.
> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
> >> > and writethrough: it will wake up the backend raid6 array and flush all
> >> > dirty data. You can then then revert to writeback and writeback_running=off.
> >> > After this you can spin-down the backend array again.
> >> >
> >> > You also get read caching for free, which helps the backend array to stay
> >> > spun down as much as possible.
> >> >
> >> > Maybe you can modify bcache slightly so to implement an automatic switching
> >> > between the modes as described above, instead of polling the state from
> >> > outside.
> >> >
> >> > Would that work, or you are asking something different?
> >> >
> >> > EW
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > --
> >
> > piergiorgio
--
piergiorgio
next prev parent reply other threads:[~2014-11-01 12:55 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-10-29 7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
2014-10-29 7:32 ` Roman Mamedov
2014-10-29 8:31 ` Anshuman Aggarwal
2014-10-29 9:05 ` NeilBrown
2014-10-29 9:25 ` Anshuman Aggarwal
2014-10-29 19:27 ` Ethan Wilson
2014-10-30 14:57 ` Anshuman Aggarwal
2014-10-30 17:25 ` Piergiorgio Sartor
2014-10-31 11:05 ` Anshuman Aggarwal
2014-10-31 14:25 ` Matt Garman
2014-11-01 12:55 ` Piergiorgio Sartor [this message]
2014-11-06 2:29 ` Anshuman Aggarwal
2014-10-30 15:00 ` Anshuman Aggarwal
2014-11-03 5:52 ` NeilBrown
2014-11-03 18:04 ` Piergiorgio Sartor
2014-11-06 2:24 ` Anshuman Aggarwal
2014-11-24 7:29 ` Anshuman Aggarwal
2014-11-24 22:50 ` NeilBrown
2014-11-26 6:24 ` Anshuman Aggarwal
2014-12-01 16:00 ` Anshuman Aggarwal
2014-12-01 16:34 ` Anshuman Aggarwal
2014-12-01 21:46 ` NeilBrown
2014-12-02 11:56 ` Anshuman Aggarwal
2014-12-16 16:25 ` Anshuman Aggarwal
2014-12-16 21:49 ` NeilBrown
2014-12-17 6:40 ` Anshuman Aggarwal
2015-01-06 11:40 ` Anshuman Aggarwal
[not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
2014-11-01 5:36 ` Anshuman Aggarwal
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20141101125513.GB2078@lazy.lzy \
--to=piergiorgio.sartor@nexgo.de \
--cc=anshuman.aggarwal@gmail.com \
--cc=ethan.wilson@shiftmail.org \
--cc=linux-raid@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).