Split RAID: Proposal for archival RAID using incremental batch checksum

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Split RAID: Proposal for archival RAID using incremental batch checksum
@ 2014-10-29  7:15 Anshuman Aggarwal
  2014-10-29  7:32 ` Roman Mamedov
                   ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  7:15 UTC (permalink / raw)
  To: linux-raid

I'm outlining below a proposal for a RAID device mapper virtual block
device for the kernel which adds "split raid" functionality on an
incremental batch basis for a home media server/archived content which
is rarely accessed.

Given a set of N+X block devices (of the same size but smallest common
size wins)

the SplitRAID device mapper device generates virtual devices which are
passthrough for N devices and write a Batched/Delayed checksum into
the X devices so as to allow offline recovery of block on the N
devices in case of a single disk failure.

Advantages over conventional RAID:

- Disks can be spun down reducing wear and tear over MD RAID Levels
(such as 1, 10, 5,6) in the case of rarely accessed archival content

- Prevent catastrophic data loss for multiple device failure since
each block device is independent and hence unlike MD RAID will only
lose data incrementally.

- Performance degradation for writes can be achieved by keeping the
checksum update asynchronous and delaying the fsync to the checksum
block device.

In the event of improper shutdown the checksum may not have all the
updated data but will be mostly up to date which is often acceptable
for home media server requirements. A flag can be set in case the
checksum block device was shutdown properly indicating that  a full
checksum rebuild is not required.

Existing solutions considered:

- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
based scheme (Its advantages are that its in user space and has cross
platform support but has the huge disadvantage of every checksum being
done from scratch slowing the system, causing immense wear and tear on
every snapshot and also losing any information updates upto the
snapshot point etc)

I'd like to get opinions on the pros and cons of this proposal from
more experienced people on the list to redirect suitably on the
following questions:

- Maybe this can already be done using the block devices available in
the kernel?

- If not, Device mapper the right API to use? (I think so)

- What would be the best block devices code to look at to implement?

Neil, would appreciate your weighing in on this.

Regards,

Anshuman Aggarwal

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
@ 2014-10-29  7:32 ` Roman Mamedov
  2014-10-29  8:31   ` Anshuman Aggarwal
  2014-10-29  9:05 ` NeilBrown
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
  2 siblings, 1 reply; 28+ messages in thread
From: Roman Mamedov @ 2014-10-29  7:32 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

On Wed, 29 Oct 2014 12:45:34 +0530
Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:

> I'm outlining below a proposal for a RAID device mapper virtual block
> device for the kernel which adds "split raid" functionality on an
> incremental batch basis for a home media server/archived content which
> is rarely accessed.

> Existing solutions considered:

Some of the already-available "home media server" setup schemes you did not
mention:

http://linuxconfig.org/prouhd-raid-for-the-end-user
a smart way of managing MD RAID given multiple devices of various sizes;

http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html
what to do with a set of mixed-size drives, in simpler terms;

https://romanrm.net/mhddfs
File-level "concatenation" of disks, with smart distribution of new files;

-- 
With respect,
Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:32 ` Roman Mamedov
@ 2014-10-29  8:31   ` Anshuman Aggarwal
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  8:31 UTC (permalink / raw)
  To: Roman Mamedov; +Cc: linux-raid

 Actually I already use a combination of these solutions (MD raid,
multiple devices + LVM2 to join). Unfortunately, none of these
solutions address the following:

- Full data loss in case of disk failure beyond the raid level ( 2
disks in raid5, 3 disks in raid6). This solution allows for single
disk data loss
- Continous read/write to all disks causing wear and tear reducing
life and increasing end user cost

mhddfs (or something like it) will probably be used on top of the N
devices in this proposal to join but that is upto the requirement of
the user.


On 29 October 2014 13:02, Roman Mamedov <rm@romanrm.net> wrote:
> On Wed, 29 Oct 2014 12:45:34 +0530
> Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>
>> I'm outlining below a proposal for a RAID device mapper virtual block
>> device for the kernel which adds "split raid" functionality on an
>> incremental batch basis for a home media server/archived content which
>> is rarely accessed.
>
>> Existing solutions considered:
>
> Some of the already-available "home media server" setup schemes you did not
> mention:
>
> http://linuxconfig.org/prouhd-raid-for-the-end-user
> a smart way of managing MD RAID given multiple devices of various sizes;
>
> http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html
> what to do with a set of mixed-size drives, in simpler terms;
>
> https://romanrm.net/mhddfs
> File-level "concatenation" of disks, with smart distribution of new files;
>
> --
> With respect,
> Roman

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
  2014-10-29  7:32 ` Roman Mamedov
@ 2014-10-29  9:05 ` NeilBrown
  2014-10-29  9:25   ` Anshuman Aggarwal
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
  2 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-10-29  9:05 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3348 bytes --]

On Wed, 29 Oct 2014 12:45:34 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> I'm outlining below a proposal for a RAID device mapper virtual block
> device for the kernel which adds "split raid" functionality on an
> incremental batch basis for a home media server/archived content which
> is rarely accessed.
> 
> Given a set of N+X block devices (of the same size but smallest common
> size wins)
> 
> the SplitRAID device mapper device generates virtual devices which are
> passthrough for N devices and write a Batched/Delayed checksum into
> the X devices so as to allow offline recovery of block on the N
> devices in case of a single disk failure.
> 
> Advantages over conventional RAID:
> 
> - Disks can be spun down reducing wear and tear over MD RAID Levels
> (such as 1, 10, 5,6) in the case of rarely accessed archival content
> 
> - Prevent catastrophic data loss for multiple device failure since
> each block device is independent and hence unlike MD RAID will only
> lose data incrementally.
> 
> - Performance degradation for writes can be achieved by keeping the
> checksum update asynchronous and delaying the fsync to the checksum
> block device.
> 
> In the event of improper shutdown the checksum may not have all the
> updated data but will be mostly up to date which is often acceptable
> for home media server requirements. A flag can be set in case the
> checksum block device was shutdown properly indicating that  a full
> checksum rebuild is not required.
> 
> Existing solutions considered:
> 
> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
> based scheme (Its advantages are that its in user space and has cross
> platform support but has the huge disadvantage of every checksum being
> done from scratch slowing the system, causing immense wear and tear on
> every snapshot and also losing any information updates upto the
> snapshot point etc)
> 
> I'd like to get opinions on the pros and cons of this proposal from
> more experienced people on the list to redirect suitably on the
> following questions:
> 
> - Maybe this can already be done using the block devices available in
> the kernel?
> 
> - If not, Device mapper the right API to use? (I think so)
> 
> - What would be the best block devices code to look at to implement?
> 
> Neil, would appreciate your weighing in on this.

Just to be sure I understand, you would have N + X devices.  Each of the N
devices contains an independent filesystem and could be accessed directly if
needed.  Each of the X devices contains some codes so that if at most X
devices in total died, you would still be able to recover all of the data.
If more than X devices failed, you would still get complete data from the
working devices.

Every update would only write to the particular N device on which it is
relevant, and  all of the X devices.  So N needs to be quite a bit bigger
than X for the spin-down to be really worth it.

Am I right so far?

For some reason the writes to X are delayed...  I don't really understand
that part.

Sounds like multi-parity RAID6 with no parity rotation and 
  chunksize == devicesize

I wouldn't use device-mapper myself, but you are unlikely to get an entirely
impartial opinion from me on that topic.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:05 ` NeilBrown
@ 2014-10-29  9:25   ` Anshuman Aggarwal
  2014-10-29 19:27     ` Ethan Wilson
  2014-10-30 15:00     ` Anshuman Aggarwal
  0 siblings, 2 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29  9:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Right on most counts but please see comments below.

On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> Just to be sure I understand, you would have N + X devices.  Each of the N
> devices contains an independent filesystem and could be accessed directly if
> needed.  Each of the X devices contains some codes so that if at most X
> devices in total died, you would still be able to recover all of the data.
> If more than X devices failed, you would still get complete data from the
> working devices.
>
> Every update would only write to the particular N device on which it is
> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> than X for the spin-down to be really worth it.
>
> Am I right so far?

Perfectly right so far. I typically have a N to X ratio of 4 (4
devices to 1 data) so spin down is totally worth it for data
protection but more on that below.

>
> For some reason the writes to X are delayed...  I don't really understand
> that part.

This delay is basically designed around archival devices which are
rarely read from and even more rarely written to. By delaying writes
on 2 criteria ( designated cache buffer filling up or preset time
duration from last write expiring) we can significantly reduce the
writes on the parity device. This assumes that we are ok to lose a
movie or two in case the parity disk is not totally up to date but are
more interested in device longevity.

>
> Sounds like multi-parity RAID6 with no parity rotation and
>   chunksize == devicesize
RAID6 would present us with a joint device and currently only allows
writes to that directly, yes? Any writes will be striped.
In any case would md raid allow the underlying device to be written to
directly? Also how would it know that the device has been written to
and hence parity has to be updated? What about the superblock which
the FS would not know about?

Also except for the delayed checksum writing part which would be
significant if one of the objectives is to reduce the amount of
writes. Can we delay that in the code currently for RAID6? I
understand the objective of RAID6 is to ensure data recovery and we
are looking at a compromise in this case.

If feasible, this can be an enhancement to MD RAID as well where N
devices are presented instead of a single joint device in case of
raid6 (maybe the multi part device can be individual disks?)

It will certainly solve my problem of where to store the metadata. I
was currently hoping to just store it as a configuration file to be
read by the initramfs since in this case worst case scenario the
checksum goes out of sync and is rebuilt from scratch.

>
> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> impartial opinion from me on that topic.

I haven't hacked around the kernel internals much so far so will have
to dig out that history. I will welcome any particular links/mail
threads I should look at for guidance (with both yours and opposing
points of view)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:25   ` Anshuman Aggarwal
@ 2014-10-29 19:27     ` Ethan Wilson
  2014-10-30 14:57       ` Anshuman Aggarwal
  2014-10-30 15:00     ` Anshuman Aggarwal
  1 sibling, 1 reply; 28+ messages in thread
From: Ethan Wilson @ 2014-10-29 19:27 UTC (permalink / raw)
  To: linux-raid

On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> Right on most counts but please see comments below.
>
> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> Just to be sure I understand, you would have N + X devices.  Each of the N
>> devices contains an independent filesystem and could be accessed directly if
>> needed.  Each of the X devices contains some codes so that if at most X
>> devices in total died, you would still be able to recover all of the data.
>> If more than X devices failed, you would still get complete data from the
>> working devices.
>>
>> Every update would only write to the particular N device on which it is
>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> than X for the spin-down to be really worth it.
>>
>> Am I right so far?
> Perfectly right so far. I typically have a N to X ratio of 4 (4
> devices to 1 data) so spin down is totally worth it for data
> protection but more on that below.
>
>> For some reason the writes to X are delayed...  I don't really understand
>> that part.
> This delay is basically designed around archival devices which are
> rarely read from and even more rarely written to. By delaying writes
> on 2 criteria ( designated cache buffer filling up or preset time
> duration from last write expiring) we can significantly reduce the
> writes on the parity device. This assumes that we are ok to lose a
> movie or two in case the parity disk is not totally up to date but are
> more interested in device longevity.
>
>> Sounds like multi-parity RAID6 with no parity rotation and
>>    chunksize == devicesize
> RAID6 would present us with a joint device and currently only allows
> writes to that directly, yes? Any writes will be striped.

I am not totally sure I understand your design, but it seems to me that 
the following solution could work for you:

MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD 
yet, but just do a periodic scrub and 2 parities can be fine. Wake-up is 
not so expensive that you can't scrub)

Over that you put a raid1 of 2 x 4TB disks as a bcache cache device 
(those two will never spin-down) in writeback mode with 
writeback_running=off . This will prevent writes to backend and leave 
the backend array spun down.
When bcache is almost full (poll dirty_data), switch to 
writeback_running=on and writethrough: it will wake up the backend raid6 
array and flush all dirty data. You can then then revert to writeback 
and writeback_running=off. After this you can spin-down the backend 
array again.

You also get read caching for free, which helps the backend array to 
stay spun down as much as possible.

Maybe you can modify bcache slightly so to implement an automatic 
switching between the modes as described above, instead of polling the 
state from outside.

Would that work, or you are asking something different?

EW

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29 19:27     ` Ethan Wilson
@ 2014-10-30 14:57       ` Anshuman Aggarwal
  2014-10-30 17:25         ` Piergiorgio Sartor
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-30 14:57 UTC (permalink / raw)
  To: Ethan Wilson; +Cc: linux-raid

 What you are suggesting will work for delaying writing the checksum
(but still making 2 disks work non stop and lead to failure, cost
etc).
I am proposing N independent disks which are rarely accessed. When
parity has to be written to the remaining 1,2 ...X disks ...it is
batched up (bcache is feasible) and written out once in a while
depending on how much write is happening. N-1 disks stay spun down and
only X disks wake up periodically to get checksum written to (this
would be tweaked by the user based on how up to date he needs the
parity to be (tolerance of rebuilding parity in case of crash) and vs
disk access for each parity write)

It can't be done using any RAID6 because RAID5/6 will stripe all the
data across the devices making any read access wake up all the
devices. Ditto for writing to parity on every write to a single disk.

The architecture being proposed is a lazy write to manage parity for
individual disks which won't suffer from RAID catastrophic data loss
and concurrent disk.




On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>>
>> Right on most counts but please see comments below.
>>
>> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>>
>>> Just to be sure I understand, you would have N + X devices.  Each of the
>>> N
>>> devices contains an independent filesystem and could be accessed directly
>>> if
>>> needed.  Each of the X devices contains some codes so that if at most X
>>> devices in total died, you would still be able to recover all of the
>>> data.
>>> If more than X devices failed, you would still get complete data from the
>>> working devices.
>>>
>>> Every update would only write to the particular N device on which it is
>>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> than X for the spin-down to be really worth it.
>>>
>>> Am I right so far?
>>
>> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> devices to 1 data) so spin down is totally worth it for data
>> protection but more on that below.
>>
>>> For some reason the writes to X are delayed...  I don't really understand
>>> that part.
>>
>> This delay is basically designed around archival devices which are
>> rarely read from and even more rarely written to. By delaying writes
>> on 2 criteria ( designated cache buffer filling up or preset time
>> duration from last write expiring) we can significantly reduce the
>> writes on the parity device. This assumes that we are ok to lose a
>> movie or two in case the parity disk is not totally up to date but are
>> more interested in device longevity.
>>
>>> Sounds like multi-parity RAID6 with no parity rotation and
>>>    chunksize == devicesize
>>
>> RAID6 would present us with a joint device and currently only allows
>> writes to that directly, yes? Any writes will be striped.
>
>
> I am not totally sure I understand your design, but it seems to me that the
> following solution could work for you:
>
> MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> expensive that you can't scrub)
>
> Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> two will never spin-down) in writeback mode with writeback_running=off .
> This will prevent writes to backend and leave the backend array spun down.
> When bcache is almost full (poll dirty_data), switch to writeback_running=on
> and writethrough: it will wake up the backend raid6 array and flush all
> dirty data. You can then then revert to writeback and writeback_running=off.
> After this you can spin-down the backend array again.
>
> You also get read caching for free, which helps the backend array to stay
> spun down as much as possible.
>
> Maybe you can modify bcache slightly so to implement an automatic switching
> between the modes as described above, instead of polling the state from
> outside.
>
> Would that work, or you are asking something different?
>
> EW
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-29  9:25   ` Anshuman Aggarwal
  2014-10-29 19:27     ` Ethan Wilson
@ 2014-10-30 15:00     ` Anshuman Aggarwal
  2014-11-03  5:52       ` NeilBrown
  1 sibling, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-30 15:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Would chunksize==disksize work? Wouldn't that lead to the entire
parity be invalidated for any write to any of the disks (assuming md
operates at a chunk level)...also please see my reply below

On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> Right on most counts but please see comments below.
>
> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> Just to be sure I understand, you would have N + X devices.  Each of the N
>> devices contains an independent filesystem and could be accessed directly if
>> needed.  Each of the X devices contains some codes so that if at most X
>> devices in total died, you would still be able to recover all of the data.
>> If more than X devices failed, you would still get complete data from the
>> working devices.
>>
>> Every update would only write to the particular N device on which it is
>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> than X for the spin-down to be really worth it.
>>
>> Am I right so far?
>
> Perfectly right so far. I typically have a N to X ratio of 4 (4
> devices to 1 data) so spin down is totally worth it for data
> protection but more on that below.
>
>>
>> For some reason the writes to X are delayed...  I don't really understand
>> that part.
>
> This delay is basically designed around archival devices which are
> rarely read from and even more rarely written to. By delaying writes
> on 2 criteria ( designated cache buffer filling up or preset time
> duration from last write expiring) we can significantly reduce the
> writes on the parity device. This assumes that we are ok to lose a
> movie or two in case the parity disk is not totally up to date but are
> more interested in device longevity.
>
>>
>> Sounds like multi-parity RAID6 with no parity rotation and
>>   chunksize == devicesize
> RAID6 would present us with a joint device and currently only allows
> writes to that directly, yes? Any writes will be striped.
> In any case would md raid allow the underlying device to be written to
> directly? Also how would it know that the device has been written to
> and hence parity has to be updated? What about the superblock which
> the FS would not know about?
>
> Also except for the delayed checksum writing part which would be
> significant if one of the objectives is to reduce the amount of
> writes. Can we delay that in the code currently for RAID6? I
> understand the objective of RAID6 is to ensure data recovery and we
> are looking at a compromise in this case.
>
> If feasible, this can be an enhancement to MD RAID as well where N
> devices are presented instead of a single joint device in case of
> raid6 (maybe the multi part device can be individual disks?)
>
> It will certainly solve my problem of where to store the metadata. I
> was currently hoping to just store it as a configuration file to be
> read by the initramfs since in this case worst case scenario the
> checksum goes out of sync and is rebuilt from scratch.
>
>>
>> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> impartial opinion from me on that topic.
>
> I haven't hacked around the kernel internals much so far so will have
> to dig out that history. I will welcome any particular links/mail
> threads I should look at for guidance (with both yours and opposing
> points of view)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 14:57       ` Anshuman Aggarwal
@ 2014-10-30 17:25         ` Piergiorgio Sartor
  2014-10-31 11:05           ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: Piergiorgio Sartor @ 2014-10-30 17:25 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Ethan Wilson, linux-raid

On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>  What you are suggesting will work for delaying writing the checksum
> (but still making 2 disks work non stop and lead to failure, cost
> etc).

Hi Anshuman,

I'm a bit missing the point here.

In my experience, with my storage systems, I change
disks because they're too small, way long before they
are too old (way long before they fail).
That's why I end up with a collection of small HDDs.
which, in turn, I recycled in some custom storage
system (using disks of different size, like explained
in one of the links posted before).

Honestly, the only reason to spin down the disks, still
in my experience, is for reducing power consumption.
And this can be done with a RAID-6 without problems
and in a extremely flexible way.

So, the bottom line, still in my experience, is that
this you're describing seems quite a nice situation.

Or, I did not understood what you're proposing.

Thanks,

bye,

pg

> I am proposing N independent disks which are rarely accessed. When
> parity has to be written to the remaining 1,2 ...X disks ...it is
> batched up (bcache is feasible) and written out once in a while
> depending on how much write is happening. N-1 disks stay spun down and
> only X disks wake up periodically to get checksum written to (this
> would be tweaked by the user based on how up to date he needs the
> parity to be (tolerance of rebuilding parity in case of crash) and vs
> disk access for each parity write)
> 
> It can't be done using any RAID6 because RAID5/6 will stripe all the
> data across the devices making any read access wake up all the
> devices. Ditto for writing to parity on every write to a single disk.
> 
> The architecture being proposed is a lazy write to manage parity for
> individual disks which won't suffer from RAID catastrophic data loss
> and concurrent disk.
> 
> 
> 
> 
> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> >>
> >> Right on most counts but please see comments below.
> >>
> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >>>
> >>> Just to be sure I understand, you would have N + X devices.  Each of the
> >>> N
> >>> devices contains an independent filesystem and could be accessed directly
> >>> if
> >>> needed.  Each of the X devices contains some codes so that if at most X
> >>> devices in total died, you would still be able to recover all of the
> >>> data.
> >>> If more than X devices failed, you would still get complete data from the
> >>> working devices.
> >>>
> >>> Every update would only write to the particular N device on which it is
> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >>> than X for the spin-down to be really worth it.
> >>>
> >>> Am I right so far?
> >>
> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> devices to 1 data) so spin down is totally worth it for data
> >> protection but more on that below.
> >>
> >>> For some reason the writes to X are delayed...  I don't really understand
> >>> that part.
> >>
> >> This delay is basically designed around archival devices which are
> >> rarely read from and even more rarely written to. By delaying writes
> >> on 2 criteria ( designated cache buffer filling up or preset time
> >> duration from last write expiring) we can significantly reduce the
> >> writes on the parity device. This assumes that we are ok to lose a
> >> movie or two in case the parity disk is not totally up to date but are
> >> more interested in device longevity.
> >>
> >>> Sounds like multi-parity RAID6 with no parity rotation and
> >>>    chunksize == devicesize
> >>
> >> RAID6 would present us with a joint device and currently only allows
> >> writes to that directly, yes? Any writes will be striped.
> >
> >
> > I am not totally sure I understand your design, but it seems to me that the
> > following solution could work for you:
> >
> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> > expensive that you can't scrub)
> >
> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> > two will never spin-down) in writeback mode with writeback_running=off .
> > This will prevent writes to backend and leave the backend array spun down.
> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
> > and writethrough: it will wake up the backend raid6 array and flush all
> > dirty data. You can then then revert to writeback and writeback_running=off.
> > After this you can spin-down the backend array again.
> >
> > You also get read caching for free, which helps the backend array to stay
> > spun down as much as possible.
> >
> > Maybe you can modify bcache slightly so to implement an automatic switching
> > between the modes as described above, instead of polling the state from
> > outside.
> >
> > Would that work, or you are asking something different?
> >
> > EW
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 17:25         ` Piergiorgio Sartor
@ 2014-10-31 11:05           ` Anshuman Aggarwal
  2014-10-31 14:25             ` Matt Garman
  2014-11-01 12:55             ` Piergiorgio Sartor
  0 siblings, 2 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-10-31 11:05 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Ethan Wilson, linux-raid

Hi pg,
With MD raid striping all the writes not only does it keep ALL disks
spinning to read/write the current content, it also leads to
catastrophic data loss in case the rebuild/disk failure exceeds the
number of parity disks.

But more importantly, I find myself setting up multiple RAID levels
(at least RAID6 and now thinking of more) just to make sure that MD
raid will recover my data and not lose the whole cluster if an
additional disk fails above the number of parity!!! The biggest
advantage of the scheme that I have outlined is that with a single
check sum I am mostly assure of a failed disk restoration and worst
case only the media (movies/music) on the failing disk are lost not on
the whole cluster.

Also in my experience about disks and usage, while what you are saying
was true a while ago when storage capacity had not hit multiple TBs.
Now if I am buying 3-4 TB disks they are likely to last a while
especially since the incremental % growth in sizes seem to be slowing
down.

Regards,
Anshuman

On 30 October 2014 22:55, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>>  What you are suggesting will work for delaying writing the checksum
>> (but still making 2 disks work non stop and lead to failure, cost
>> etc).
>
> Hi Anshuman,
>
> I'm a bit missing the point here.
>
> In my experience, with my storage systems, I change
> disks because they're too small, way long before they
> are too old (way long before they fail).
> That's why I end up with a collection of small HDDs.
> which, in turn, I recycled in some custom storage
> system (using disks of different size, like explained
> in one of the links posted before).
>
> Honestly, the only reason to spin down the disks, still
> in my experience, is for reducing power consumption.
> And this can be done with a RAID-6 without problems
> and in a extremely flexible way.
>
> So, the bottom line, still in my experience, is that
> this you're describing seems quite a nice situation.
>
> Or, I did not understood what you're proposing.
>
> Thanks,
>
> bye,
>
> pg
>
>> I am proposing N independent disks which are rarely accessed. When
>> parity has to be written to the remaining 1,2 ...X disks ...it is
>> batched up (bcache is feasible) and written out once in a while
>> depending on how much write is happening. N-1 disks stay spun down and
>> only X disks wake up periodically to get checksum written to (this
>> would be tweaked by the user based on how up to date he needs the
>> parity to be (tolerance of rebuilding parity in case of crash) and vs
>> disk access for each parity write)
>>
>> It can't be done using any RAID6 because RAID5/6 will stripe all the
>> data across the devices making any read access wake up all the
>> devices. Ditto for writing to parity on every write to a single disk.
>>
>> The architecture being proposed is a lazy write to manage parity for
>> individual disks which won't suffer from RAID catastrophic data loss
>> and concurrent disk.
>>
>>
>>
>>
>> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>> >>
>> >> Right on most counts but please see comments below.
>> >>
>> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >>>
>> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>> >>> N
>> >>> devices contains an independent filesystem and could be accessed directly
>> >>> if
>> >>> needed.  Each of the X devices contains some codes so that if at most X
>> >>> devices in total died, you would still be able to recover all of the
>> >>> data.
>> >>> If more than X devices failed, you would still get complete data from the
>> >>> working devices.
>> >>>
>> >>> Every update would only write to the particular N device on which it is
>> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >>> than X for the spin-down to be really worth it.
>> >>>
>> >>> Am I right so far?
>> >>
>> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> devices to 1 data) so spin down is totally worth it for data
>> >> protection but more on that below.
>> >>
>> >>> For some reason the writes to X are delayed...  I don't really understand
>> >>> that part.
>> >>
>> >> This delay is basically designed around archival devices which are
>> >> rarely read from and even more rarely written to. By delaying writes
>> >> on 2 criteria ( designated cache buffer filling up or preset time
>> >> duration from last write expiring) we can significantly reduce the
>> >> writes on the parity device. This assumes that we are ok to lose a
>> >> movie or two in case the parity disk is not totally up to date but are
>> >> more interested in device longevity.
>> >>
>> >>> Sounds like multi-parity RAID6 with no parity rotation and
>> >>>    chunksize == devicesize
>> >>
>> >> RAID6 would present us with a joint device and currently only allows
>> >> writes to that directly, yes? Any writes will be striped.
>> >
>> >
>> > I am not totally sure I understand your design, but it seems to me that the
>> > following solution could work for you:
>> >
>> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>> > expensive that you can't scrub)
>> >
>> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>> > two will never spin-down) in writeback mode with writeback_running=off .
>> > This will prevent writes to backend and leave the backend array spun down.
>> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>> > and writethrough: it will wake up the backend raid6 array and flush all
>> > dirty data. You can then then revert to writeback and writeback_running=off.
>> > After this you can spin-down the backend array again.
>> >
>> > You also get read caching for free, which helps the backend array to stay
>> > spun down as much as possible.
>> >
>> > Maybe you can modify bcache slightly so to implement an automatic switching
>> > between the modes as described above, instead of polling the state from
>> > outside.
>> >
>> > Would that work, or you are asking something different?
>> >
>> > EW
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
>
> piergiorgio

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-31 11:05           ` Anshuman Aggarwal
@ 2014-10-31 14:25             ` Matt Garman
  2014-11-01 12:55             ` Piergiorgio Sartor
  1 sibling, 0 replies; 28+ messages in thread
From: Matt Garman @ 2014-10-31 14:25 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, Mdadm

(Re-posting as I forgot to change to plaintext mode for the mailing
list, sorry for any dups.)

In a later post, you said you had a 4-to-1 scheme, but it wasn't clear
to me if that was 1 drive worth of data, and 4 drives worth of
checksum/backup, or the other way around.

In your proposed scheme, I assume you want your actual data drives to
be spinning all the time?  Otherwise, when you go to read data (play
music/videos), you have the multi-second spinup delay... or is that OK
with you?

Some other considerations: modern 5400 RPM drives generally consume
less than five watts in idle state[1].  Actual AC draw will be higher
due to power supply inefficiency, so we'll err on the conservative
side and say each drive requires 10 AC watts of power.  My electrical
rates in Chicago are about average for the USA (11 or 12 cents/kWH),
and conveniently it roughly works out such that one always-on watt
costs about $1/year.  So, each always-running hard drive will cost
about $10/year to run, less with a more efficient power supply.  I
know electricity is substantially more expensive in many parts of the
world; or maybe you're running off-the-grid (e.g. solar) and have a
very small power budget?

On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
>
> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
> based scheme (Its advantages are that its in user space and has cross
> platform support but has the huge disadvantage of every checksum being
> done from scratch slowing the system, causing immense wear and tear on
> every snapshot and also losing any information updates upto the
> snapshot point etc)


Last time I looked at SnapRAID, it seemed like yours was its target
use case.  The "huge disadvantage of every checksum being done from
scratch" sounds like a SnapRAID feature enhancement that might be
simpler/easier/faster-to-get done than a major enhancement to the
Linux kernel (just speculating though).

But, on the other hand, by your use case description, writes are very
infrequent, and you're willing to buffer checksum updates for quite a
while... so what if you had a *monthly* cron job to do parity syncs?
Schedule it for a time when the system is unlikely to be used to
offset the increased load.  That's only 12 "hard" tasks for the drive
per year.  I'm not an expert, but that doesn't "feel" like a lot of
wear and tear.

On the issue of wear and tear, I've mostly given up trying to
understand what's best for my drives.  One school of thought says many
spinup-spindown cycles are actually harder on the drive than running
24/7.  But maybe consumer drives actually aren't designed for 24/7
operation, so they're better off being cycled up and down.  Or
consumer drives can't handle the vibrations of being in a case with
other 24/7 drives.  But failure to"exercise" the entire drive
regularly enough might result in a situation where an error has
developed but you don't know until it's too late or your warranty
period has expired.


[1] http://www.silentpcreview.com/article29-page2.html


On Fri, Oct 31, 2014 at 6:05 AM, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> Hi pg,
> With MD raid striping all the writes not only does it keep ALL disks
> spinning to read/write the current content, it also leads to
> catastrophic data loss in case the rebuild/disk failure exceeds the
> number of parity disks.
>
> But more importantly, I find myself setting up multiple RAID levels
> (at least RAID6 and now thinking of more) just to make sure that MD
> raid will recover my data and not lose the whole cluster if an
> additional disk fails above the number of parity!!! The biggest
> advantage of the scheme that I have outlined is that with a single
> check sum I am mostly assure of a failed disk restoration and worst
> case only the media (movies/music) on the failing disk are lost not on
> the whole cluster.
>
> Also in my experience about disks and usage, while what you are saying
> was true a while ago when storage capacity had not hit multiple TBs.
> Now if I am buying 3-4 TB disks they are likely to last a while
> especially since the incremental % growth in sizes seem to be slowing
> down.
>
> Regards,
> Anshuman
>
> On 30 October 2014 22:55, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
>> On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>>>  What you are suggesting will work for delaying writing the checksum
>>> (but still making 2 disks work non stop and lead to failure, cost
>>> etc).
>>
>> Hi Anshuman,
>>
>> I'm a bit missing the point here.
>>
>> In my experience, with my storage systems, I change
>> disks because they're too small, way long before they
>> are too old (way long before they fail).
>> That's why I end up with a collection of small HDDs.
>> which, in turn, I recycled in some custom storage
>> system (using disks of different size, like explained
>> in one of the links posted before).
>>
>> Honestly, the only reason to spin down the disks, still
>> in my experience, is for reducing power consumption.
>> And this can be done with a RAID-6 without problems
>> and in a extremely flexible way.
>>
>> So, the bottom line, still in my experience, is that
>> this you're describing seems quite a nice situation.
>>
>> Or, I did not understood what you're proposing.
>>
>> Thanks,
>>
>> bye,
>>
>> pg
>>
>>> I am proposing N independent disks which are rarely accessed. When
>>> parity has to be written to the remaining 1,2 ...X disks ...it is
>>> batched up (bcache is feasible) and written out once in a while
>>> depending on how much write is happening. N-1 disks stay spun down and
>>> only X disks wake up periodically to get checksum written to (this
>>> would be tweaked by the user based on how up to date he needs the
>>> parity to be (tolerance of rebuilding parity in case of crash) and vs
>>> disk access for each parity write)
>>>
>>> It can't be done using any RAID6 because RAID5/6 will stripe all the
>>> data across the devices making any read access wake up all the
>>> devices. Ditto for writing to parity on every write to a single disk.
>>>
>>> The architecture being proposed is a lazy write to manage parity for
>>> individual disks which won't suffer from RAID catastrophic data loss
>>> and concurrent disk.
>>>
>>>
>>>
>>>
>>> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>>> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>>> >>
>>> >> Right on most counts but please see comments below.
>>> >>
>>> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>> >>>
>>> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>>> >>> N
>>> >>> devices contains an independent filesystem and could be accessed directly
>>> >>> if
>>> >>> needed.  Each of the X devices contains some codes so that if at most X
>>> >>> devices in total died, you would still be able to recover all of the
>>> >>> data.
>>> >>> If more than X devices failed, you would still get complete data from the
>>> >>> working devices.
>>> >>>
>>> >>> Every update would only write to the particular N device on which it is
>>> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> >>> than X for the spin-down to be really worth it.
>>> >>>
>>> >>> Am I right so far?
>>> >>
>>> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>>> >> devices to 1 data) so spin down is totally worth it for data
>>> >> protection but more on that below.
>>> >>
>>> >>> For some reason the writes to X are delayed...  I don't really understand
>>> >>> that part.
>>> >>
>>> >> This delay is basically designed around archival devices which are
>>> >> rarely read from and even more rarely written to. By delaying writes
>>> >> on 2 criteria ( designated cache buffer filling up or preset time
>>> >> duration from last write expiring) we can significantly reduce the
>>> >> writes on the parity device. This assumes that we are ok to lose a
>>> >> movie or two in case the parity disk is not totally up to date but are
>>> >> more interested in device longevity.
>>> >>
>>> >>> Sounds like multi-parity RAID6 with no parity rotation and
>>> >>>    chunksize == devicesize
>>> >>
>>> >> RAID6 would present us with a joint device and currently only allows
>>> >> writes to that directly, yes? Any writes will be striped.
>>> >
>>> >
>>> > I am not totally sure I understand your design, but it seems to me that the
>>> > following solution could work for you:
>>> >
>>> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>>> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>>> > expensive that you can't scrub)
>>> >
>>> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>>> > two will never spin-down) in writeback mode with writeback_running=off .
>>> > This will prevent writes to backend and leave the backend array spun down.
>>> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>>> > and writethrough: it will wake up the backend raid6 array and flush all
>>> > dirty data. You can then then revert to writeback and writeback_running=off.
>>> > After this you can spin-down the backend array again.
>>> >
>>> > You also get read caching for free, which helps the backend array to stay
>>> > spun down as much as possible.
>>> >
>>> > Maybe you can modify bcache slightly so to implement an automatic switching
>>> > between the modes as described above, instead of polling the state from
>>> > outside.
>>> >
>>> > Would that work, or you are asking something different?
>>> >
>>> > EW
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>>
>> piergiorgio
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
       [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
@ 2014-11-01  5:36   ` Anshuman Aggarwal
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-11-01  5:36 UTC (permalink / raw)
  To: Matt Garman; +Cc: Mdadm

On 31 October 2014 19:53, Matt Garman <matthew.garman@gmail.com> wrote:
> In a later post, you said you had a 4-to-1 scheme, but it wasn't clear to me
> if that was 1 drive worth of data, and 4 drives worth of checksum/backup, or
> the other way around.

I was wondering if anybody would catch that slip. I meant 4 data to 1
parity seems about the right mix to me so far based on the my read and
feel of probability of drive failure.

>
> In your proposed scheme, I assume you want your actual data drives to be
> spinning all the time?  Otherwise, when you go to read data (play
> music/videos), you have the multi-second spinup delay... or is that OK with
> you?

Well, actually in my experience with 6-8, 2-4TB drives there is a lot
of music/video content that I dont' end up playing that often. Those
drives can easily be spun down (maybe for days on end and at least all
night) and a small initial (one time) delay before playing a file who
drive hasn't been accessed easily seems like a good trade off ( both
for power and drive life )

>
> Some other considerations: modern 5400 RPM drives generally consume less
> than five watts in idle state[1].  Actual AC draw will be higher due to
> power supply inefficiency, so we'll err on the conservative side and say
> each drive requires 10 AC watts of power.  My electrical rates in Chicago
> are about average for the USA (11 or 12 cents/kWH), and conveniently it
> roughly works out such that one always-on watt costs about $1/year.  So,
> each always-running hard drive will cost about $10/year to run, less with a
> more efficient power supply.  I know electricity is substantially more
> expensive in many parts of the world; or maybe you're running off-the-grid
> (e.g. solar) and have a very small power budget?

Besides the cost, there is an environmental aspect. If something has
superior efficiency and increases life of the product isn't it a good
thing wherever we live on the planet. BTW great calculation but I
moved back (to India) from San Francisco some time ago :) and the
electricity cost is quite high (and availability of supply is not 100%
yet).  I'd like to maximize my backups and spinning disks that are not
being used for hours on end sounds bad.

Just to add, internet is metered per GB in many parts (and in mine
sadly :( for high speed access (meaning 4-8 MBps) so I have to store
content locally (before cloud suggestions are thrown around)

>
> On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal

> <anshuman.aggarwal@gmail.com> wrote:
>>
>> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
>> based scheme (Its advantages are that its in user space and has cross
>> platform support but has the huge disadvantage of every checksum being
>> done from scratch slowing the system, causing immense wear and tear on
>> every snapshot and also losing any information updates upto the
>> snapshot point etc)
>
>
> Last time I looked at SnapRAID, it seemed like yours was its target use
> case.  The "huge disadvantage of every checksum being done from scratch"
> sounds like a SnapRAID feature enhancement that might be
> simpler/easier/faster-to-get done than a major enhancement to the Linux
> kernel (just speculating though).

SnapRAID can't be enhanced without involving the kernel because the
delta checksum will require knowing which blocks were written to and
only a  kernel level driver can know that. This is a hard reality, no
way around it and that was my reason to propose this.

>
> But, on the other hand, by your use case description, writes are very
> infrequent, and you're willing to buffer checksum updates for quite a
> while... so what if you had a *monthly* cron job to do parity syncs?
> Schedule it for a time when the system is unlikely to be used to offset the
> increased load.  That's only 12 "hard" tasks for the drive per year.  I'm
> not an expert, but that doesn't "feel" like a lot of wear and tear.

Well, again, between infrequent updates down to weekly or monthly
crons sounds like a bad compromise either way when a better
incremental update could store the checksum in a buffer and write them
out eventually (2-3 times a day). Almost always the buffer will get
written out giving us an updated parity with little to none "extra"
wear and tear.

>
> On the issue of wear and tear, I've mostly given up trying to understand
> what's best for my drives.  One school of thought says many spinup-spindown
> cycles are actually harder on the drive than running 24/7.  But maybe
> consumer drives actually aren't designed for 24/7 operation, so they're
> better off being cycled up and down.  Or consumer drives can't handle the
> vibrations of being in a case with other 24/7 drives.  But failure
> to"exercise" the entire drive regularly enough might result in a situation
> where an error has developed but you don't know until it's too late or your
> warranty period has expired.

You are right about consumer drives where spin downs are good ...with
a time of an hour or so should reduce unnecessary spin up/downs. Once
spun down, most may stay that way for days which is better for all of
us (energy, wastage of drives etc). Spin down technology is
progressing faster than block failure (also because block density is
going up causing media failure and not the head failure to be the
primary cause of drive outage)

The drive can be tested periodically (by non destructive bad blocks
etc) as a pure testing exercise to find errors being developed. There
is no need to needlessly stress the drives out by reading/writing to
all parts continuously. Also RAID speeds are often no longer required
due to the higher R/W coming from the drives.

Thanks for reading and writing such a thorough reply.

Neil, would you be willing to assist/guide in helping design or with
the best approach to the same? I would like to avoid the obvious
pitfalls that any new kernel block level device writer is bound to
face.

Regards,
Anshuman

>
>
> [1] http://www.silentpcreview.com/article29-page2.html
>
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-31 11:05           ` Anshuman Aggarwal
  2014-10-31 14:25             ` Matt Garman
@ 2014-11-01 12:55             ` Piergiorgio Sartor
  2014-11-06  2:29               ` Anshuman Aggarwal
  1 sibling, 1 reply; 28+ messages in thread
From: Piergiorgio Sartor @ 2014-11-01 12:55 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, linux-raid

On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote:
> Hi pg,
> With MD raid striping all the writes not only does it keep ALL disks
> spinning to read/write the current content, it also leads to
> catastrophic data loss in case the rebuild/disk failure exceeds the
> number of parity disks.

Hi Anshuman,

yes but do you have hard evidence that
this is a common RAID-6 problem?
Considering that we have now bad block list,
write intent bitmap and proactive replacement,
it does not seem to me really the main issue,
having a triple fail in RAID-6.
Considering that there are available libraries
for more that 2 parities, I think the multiple
failure case is quite a rarity.
Furthermore, I suspect there are other type
of catastrophic situation (lighting, for example)
that can destroy an array completely.
 
> But more importantly, I find myself setting up multiple RAID levels
> (at least RAID6 and now thinking of more) just to make sure that MD
> raid will recover my data and not lose the whole cluster if an
> additional disk fails above the number of parity!!! The biggest
> advantage of the scheme that I have outlined is that with a single
> check sum I am mostly assure of a failed disk restoration and worst
> case only the media (movies/music) on the failing disk are lost not on
> the whole cluster.

Each disk will have its own filesystem?
If this is not the case, you cannot say
if a single disk failure will lose only
some files.

> Also in my experience about disks and usage, while what you are saying
> was true a while ago when storage capacity had not hit multiple TBs.
> Now if I am buying 3-4 TB disks they are likely to last a while
> especially since the incremental % growth in sizes seem to be slowing
> down.

As wrote above, you can safely replace
disks before they fail, without compromising
the array.

bye,

pg
 
> Regards,
> Anshuman
> 
> On 30 October 2014 22:55, Piergiorgio Sartor
> <piergiorgio.sartor@nexgo.de> wrote:
> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
> >>  What you are suggesting will work for delaying writing the checksum
> >> (but still making 2 disks work non stop and lead to failure, cost
> >> etc).
> >
> > Hi Anshuman,
> >
> > I'm a bit missing the point here.
> >
> > In my experience, with my storage systems, I change
> > disks because they're too small, way long before they
> > are too old (way long before they fail).
> > That's why I end up with a collection of small HDDs.
> > which, in turn, I recycled in some custom storage
> > system (using disks of different size, like explained
> > in one of the links posted before).
> >
> > Honestly, the only reason to spin down the disks, still
> > in my experience, is for reducing power consumption.
> > And this can be done with a RAID-6 without problems
> > and in a extremely flexible way.
> >
> > So, the bottom line, still in my experience, is that
> > this you're describing seems quite a nice situation.
> >
> > Or, I did not understood what you're proposing.
> >
> > Thanks,
> >
> > bye,
> >
> > pg
> >
> >> I am proposing N independent disks which are rarely accessed. When
> >> parity has to be written to the remaining 1,2 ...X disks ...it is
> >> batched up (bcache is feasible) and written out once in a while
> >> depending on how much write is happening. N-1 disks stay spun down and
> >> only X disks wake up periodically to get checksum written to (this
> >> would be tweaked by the user based on how up to date he needs the
> >> parity to be (tolerance of rebuilding parity in case of crash) and vs
> >> disk access for each parity write)
> >>
> >> It can't be done using any RAID6 because RAID5/6 will stripe all the
> >> data across the devices making any read access wake up all the
> >> devices. Ditto for writing to parity on every write to a single disk.
> >>
> >> The architecture being proposed is a lazy write to manage parity for
> >> individual disks which won't suffer from RAID catastrophic data loss
> >> and concurrent disk.
> >>
> >>
> >>
> >>
> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
> >> >>
> >> >> Right on most counts but please see comments below.
> >> >>
> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> >>>
> >> >>> Just to be sure I understand, you would have N + X devices.  Each of the
> >> >>> N
> >> >>> devices contains an independent filesystem and could be accessed directly
> >> >>> if
> >> >>> needed.  Each of the X devices contains some codes so that if at most X
> >> >>> devices in total died, you would still be able to recover all of the
> >> >>> data.
> >> >>> If more than X devices failed, you would still get complete data from the
> >> >>> working devices.
> >> >>>
> >> >>> Every update would only write to the particular N device on which it is
> >> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> >>> than X for the spin-down to be really worth it.
> >> >>>
> >> >>> Am I right so far?
> >> >>
> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> >> devices to 1 data) so spin down is totally worth it for data
> >> >> protection but more on that below.
> >> >>
> >> >>> For some reason the writes to X are delayed...  I don't really understand
> >> >>> that part.
> >> >>
> >> >> This delay is basically designed around archival devices which are
> >> >> rarely read from and even more rarely written to. By delaying writes
> >> >> on 2 criteria ( designated cache buffer filling up or preset time
> >> >> duration from last write expiring) we can significantly reduce the
> >> >> writes on the parity device. This assumes that we are ok to lose a
> >> >> movie or two in case the parity disk is not totally up to date but are
> >> >> more interested in device longevity.
> >> >>
> >> >>> Sounds like multi-parity RAID6 with no parity rotation and
> >> >>>    chunksize == devicesize
> >> >>
> >> >> RAID6 would present us with a joint device and currently only allows
> >> >> writes to that directly, yes? Any writes will be striped.
> >> >
> >> >
> >> > I am not totally sure I understand your design, but it seems to me that the
> >> > following solution could work for you:
> >> >
> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
> >> > expensive that you can't scrub)
> >> >
> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
> >> > two will never spin-down) in writeback mode with writeback_running=off .
> >> > This will prevent writes to backend and leave the backend array spun down.
> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
> >> > and writethrough: it will wake up the backend raid6 array and flush all
> >> > dirty data. You can then then revert to writeback and writeback_running=off.
> >> > After this you can spin-down the backend array again.
> >> >
> >> > You also get read caching for free, which helps the backend array to stay
> >> > spun down as much as possible.
> >> >
> >> > Maybe you can modify bcache slightly so to implement an automatic switching
> >> > between the modes as described above, instead of polling the state from
> >> > outside.
> >> >
> >> > Would that work, or you are asking something different?
> >> >
> >> > EW
> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> > --
> >
> > piergiorgio

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-10-30 15:00     ` Anshuman Aggarwal
@ 2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: NeilBrown @ 2014-11-03  5:52 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4572 bytes --]

On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> Would chunksize==disksize work? Wouldn't that lead to the entire
> parity be invalidated for any write to any of the disks (assuming md
> operates at a chunk level)...also please see my reply below

Operating at a chunk level would be a very poor design choice.  md/raid5
operates in units of 1 page (4K).


> 
> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> > Right on most counts but please see comments below.
> >
> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> Just to be sure I understand, you would have N + X devices.  Each of the N
> >> devices contains an independent filesystem and could be accessed directly if
> >> needed.  Each of the X devices contains some codes so that if at most X
> >> devices in total died, you would still be able to recover all of the data.
> >> If more than X devices failed, you would still get complete data from the
> >> working devices.
> >>
> >> Every update would only write to the particular N device on which it is
> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> than X for the spin-down to be really worth it.
> >>
> >> Am I right so far?
> >
> > Perfectly right so far. I typically have a N to X ratio of 4 (4
> > devices to 1 data) so spin down is totally worth it for data
> > protection but more on that below.
> >
> >>
> >> For some reason the writes to X are delayed...  I don't really understand
> >> that part.
> >
> > This delay is basically designed around archival devices which are
> > rarely read from and even more rarely written to. By delaying writes
> > on 2 criteria ( designated cache buffer filling up or preset time
> > duration from last write expiring) we can significantly reduce the
> > writes on the parity device. This assumes that we are ok to lose a
> > movie or two in case the parity disk is not totally up to date but are
> > more interested in device longevity.
> >
> >>
> >> Sounds like multi-parity RAID6 with no parity rotation and
> >>   chunksize == devicesize
> > RAID6 would present us with a joint device and currently only allows
> > writes to that directly, yes? Any writes will be striped.

If the chunksize equals the device size, then you need a very large write for
it to be striped.

> > In any case would md raid allow the underlying device to be written to
> > directly? Also how would it know that the device has been written to
> > and hence parity has to be updated? What about the superblock which
> > the FS would not know about?

No, you wouldn't write to the underlying device.  You would carefully
partition the RAID5 so each partition aligns exactly with an underlying
device.  Then write to the partition.

> >
> > Also except for the delayed checksum writing part which would be
> > significant if one of the objectives is to reduce the amount of
> > writes. Can we delay that in the code currently for RAID6? I
> > understand the objective of RAID6 is to ensure data recovery and we
> > are looking at a compromise in this case.

"simple matter of programming"
Of course there would be a limit to how much data can be buffered in memory
before it has to be flushed out.
If you are mostly storing movies, then they are probably too large to
buffer.  Why not just write them out straight away?

NeilBrown



> >
> > If feasible, this can be an enhancement to MD RAID as well where N
> > devices are presented instead of a single joint device in case of
> > raid6 (maybe the multi part device can be individual disks?)
> >
> > It will certainly solve my problem of where to store the metadata. I
> > was currently hoping to just store it as a configuration file to be
> > read by the initramfs since in this case worst case scenario the
> > checksum goes out of sync and is rebuilt from scratch.
> >
> >>
> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> >> impartial opinion from me on that topic.
> >
> > I haven't hacked around the kernel internals much so far so will have
> > to dig out that history. I will welcome any particular links/mail
> > threads I should look at for guidance (with both yours and opposing
> > points of view)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
@ 2014-11-03 18:04         ` Piergiorgio Sartor
  2014-11-06  2:24         ` Anshuman Aggarwal
  2014-11-24  7:29         ` Anshuman Aggarwal
  2 siblings, 0 replies; 28+ messages in thread
From: Piergiorgio Sartor @ 2014-11-03 18:04 UTC (permalink / raw)
  To: NeilBrown; +Cc: Anshuman Aggarwal, linux-raid

On Mon, Nov 03, 2014 at 04:52:17PM +1100, NeilBrown wrote:
[...]
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?

One scenario I can envision is the following.

You've a bunch of HDDs in RAID-5/6, which are
almost always in standby (spin down).
Together, you've 2 SSDs in RAID-10.

All the write (and read, if possible) operations
are done towards the SSDs.
When the SSD RAID is X% full, the RAID-5/6 is
activated and the data *moved* (maybe copied, with
proper cache policy) there.
In case of reading (a large file), the RAID-5/6 is
activated, the file copied to the SSD RAID, and,
when finished, the HDDs put in standby again.

Of course, this is *not* a block device protocol,
it is a filesystem one.
It is the FS that must handle the caching, because
only the FS can know the file size, for example.

bye,

-- 

piergiorgio

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
@ 2014-11-06  2:24         ` Anshuman Aggarwal
  2014-11-24  7:29         ` Anshuman Aggarwal
  2 siblings, 0 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-11-06  2:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

Pls see below

On 3 November 2014 11:22, NeilBrown <neilb@suse.de> wrote:
> On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> Would chunksize==disksize work? Wouldn't that lead to the entire
>> parity be invalidated for any write to any of the disks (assuming md
>> operates at a chunk level)...also please see my reply below
>
> Operating at a chunk level would be a very poor design choice.  md/raid5
> operates in units of 1 page (4K).
>
>
>>
>> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> > Right on most counts but please see comments below.
>> >
>> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> devices contains an independent filesystem and could be accessed directly if
>> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> devices in total died, you would still be able to recover all of the data.
>> >> If more than X devices failed, you would still get complete data from the
>> >> working devices.
>> >>
>> >> Every update would only write to the particular N device on which it is
>> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> than X for the spin-down to be really worth it.
>> >>
>> >> Am I right so far?
>> >
>> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> > devices to 1 data) so spin down is totally worth it for data
>> > protection but more on that below.
>> >
>> >>
>> >> For some reason the writes to X are delayed...  I don't really understand
>> >> that part.
>> >
>> > This delay is basically designed around archival devices which are
>> > rarely read from and even more rarely written to. By delaying writes
>> > on 2 criteria ( designated cache buffer filling up or preset time
>> > duration from last write expiring) we can significantly reduce the
>> > writes on the parity device. This assumes that we are ok to lose a
>> > movie or two in case the parity disk is not totally up to date but are
>> > more interested in device longevity.
>> >
>> >>
>> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >>   chunksize == devicesize
>> > RAID6 would present us with a joint device and currently only allows
>> > writes to that directly, yes? Any writes will be striped.
>
> If the chunksize equals the device size, then you need a very large write for
> it to be striped.
>
>> > In any case would md raid allow the underlying device to be written to
>> > directly? Also how would it know that the device has been written to
>> > and hence parity has to be updated? What about the superblock which
>> > the FS would not know about?
>
> No, you wouldn't write to the underlying device.  You would carefully
> partition the RAID5 so each partition aligns exactly with an underlying
> device.  Then write to the partition.

This is what I'm unclear about. Even with non rotating parity on RAID
5/6 is it possible to create md partitions such that the writes are
effectively not striped (within each partition) and that each
partition on the md device ends up writing only to that one device?
How is this managed? My understanding is that raid5/6 will stripe any
data blocks across all the devices making all of them spin up for each
read and write.




>
>> >
>> > Also except for the delayed checksum writing part which would be
>> > significant if one of the objectives is to reduce the amount of
>> > writes. Can we delay that in the code currently for RAID6? I
>> > understand the objective of RAID6 is to ensure data recovery and we
>> > are looking at a compromise in this case.
>
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?

Well, yeah if the buffer gets filled (such as by a movie) the parity
will get written pretty much write away (the main data drive gets
written to immediately anyways). The delay is to prevent parity drive
spin ups due to a small updates on any one of the drives in the array.
Maybe a small temp file created by a software etc.
>
> NeilBrown
>
>
>
>> >
>> > If feasible, this can be an enhancement to MD RAID as well where N
>> > devices are presented instead of a single joint device in case of
>> > raid6 (maybe the multi part device can be individual disks?)
>> >
>> > It will certainly solve my problem of where to store the metadata. I
>> > was currently hoping to just store it as a configuration file to be
>> > read by the initramfs since in this case worst case scenario the
>> > checksum goes out of sync and is rebuilt from scratch.
>> >
>> >>
>> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> impartial opinion from me on that topic.
>> >
>> > I haven't hacked around the kernel internals much so far so will have
>> > to dig out that history. I will welcome any particular links/mail
>> > threads I should look at for guidance (with both yours and opposing
>> > points of view)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-01 12:55             ` Piergiorgio Sartor
@ 2014-11-06  2:29               ` Anshuman Aggarwal
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-11-06  2:29 UTC (permalink / raw)
  To: Piergiorgio Sartor; +Cc: Ethan Wilson, Mdadm

On 1 November 2014 18:25, Piergiorgio Sartor
<piergiorgio.sartor@nexgo.de> wrote:
> On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote:
>> Hi pg,
>> With MD raid striping all the writes not only does it keep ALL disks
>> spinning to read/write the current content, it also leads to
>> catastrophic data loss in case the rebuild/disk failure exceeds the
>> number of parity disks.
>
> Hi Anshuman,
>
> yes but do you have hard evidence that
> this is a common RAID-6 problem?
> Considering that we have now bad block list,
> write intent bitmap and proactive replacement,
> it does not seem to me really the main issue,
> having a triple fail in RAID-6.
> Considering that there are available libraries
> for more that 2 parities, I think the multiple
> failure case is quite a rarity.
> Furthermore, I suspect there are other type
> of catastrophic situation (lighting, for example)
> that can destroy an array completely.

I have most definitely lost data when a drive fails and during
reconstruction another drive fails (remember the array has been
chugging away all drives active for 2-3 years). At this point I'm dead
scared of losing another one to avoid catastrophic. If I dont' go out
and buy a replacement right away i'm on borrowed time for my whole
array. For home use this is not fun.

>
>> But more importantly, I find myself setting up multiple RAID levels
>> (at least RAID6 and now thinking of more) just to make sure that MD
>> raid will recover my data and not lose the whole cluster if an
>> additional disk fails above the number of parity!!! The biggest
>> advantage of the scheme that I have outlined is that with a single
>> check sum I am mostly assure of a failed disk restoration and worst
>> case only the media (movies/music) on the failing disk are lost not on
>> the whole cluster.
>
> Each disk will have its own filesystem?
> If this is not the case, you cannot say
> if a single disk failure will lose only
> some files.

Indeed, each device will indeed be an independent block device and
file system. Joined together by some union FS if the user so requires
but that's not in scope for this discussion.

>
>> Also in my experience about disks and usage, while what you are saying
>> was true a while ago when storage capacity had not hit multiple TBs.
>> Now if I am buying 3-4 TB disks they are likely to last a while
>> especially since the incremental % growth in sizes seem to be slowing
>> down.
>
> As wrote above, you can safely replace
> disks before they fail, without compromising
> the array.

Same point above. For home use, I might be away or not have time to
give the array the TLC (tender loving care ;) it needs which is the
only shortcoming of MD really...its hard on the disks  and has
potential of compromising the whole array (giving super fast R/W
performance in return for sure)

>
> bye,
>
> pg
>
>> Regards,
>> Anshuman
>>
>> On 30 October 2014 22:55, Piergiorgio Sartor
>> <piergiorgio.sartor@nexgo.de> wrote:
>> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote:
>> >>  What you are suggesting will work for delaying writing the checksum
>> >> (but still making 2 disks work non stop and lead to failure, cost
>> >> etc).
>> >
>> > Hi Anshuman,
>> >
>> > I'm a bit missing the point here.
>> >
>> > In my experience, with my storage systems, I change
>> > disks because they're too small, way long before they
>> > are too old (way long before they fail).
>> > That's why I end up with a collection of small HDDs.
>> > which, in turn, I recycled in some custom storage
>> > system (using disks of different size, like explained
>> > in one of the links posted before).
>> >
>> > Honestly, the only reason to spin down the disks, still
>> > in my experience, is for reducing power consumption.
>> > And this can be done with a RAID-6 without problems
>> > and in a extremely flexible way.
>> >
>> > So, the bottom line, still in my experience, is that
>> > this you're describing seems quite a nice situation.
>> >
>> > Or, I did not understood what you're proposing.
>> >
>> > Thanks,
>> >
>> > bye,
>> >
>> > pg
>> >
>> >> I am proposing N independent disks which are rarely accessed. When
>> >> parity has to be written to the remaining 1,2 ...X disks ...it is
>> >> batched up (bcache is feasible) and written out once in a while
>> >> depending on how much write is happening. N-1 disks stay spun down and
>> >> only X disks wake up periodically to get checksum written to (this
>> >> would be tweaked by the user based on how up to date he needs the
>> >> parity to be (tolerance of rebuilding parity in case of crash) and vs
>> >> disk access for each parity write)
>> >>
>> >> It can't be done using any RAID6 because RAID5/6 will stripe all the
>> >> data across the devices making any read access wake up all the
>> >> devices. Ditto for writing to parity on every write to a single disk.
>> >>
>> >> The architecture being proposed is a lazy write to manage parity for
>> >> individual disks which won't suffer from RAID catastrophic data loss
>> >> and concurrent disk.
>> >>
>> >>
>> >>
>> >>
>> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote:
>> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote:
>> >> >>
>> >> >> Right on most counts but please see comments below.
>> >> >>
>> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> >>>
>> >> >>> Just to be sure I understand, you would have N + X devices.  Each of the
>> >> >>> N
>> >> >>> devices contains an independent filesystem and could be accessed directly
>> >> >>> if
>> >> >>> needed.  Each of the X devices contains some codes so that if at most X
>> >> >>> devices in total died, you would still be able to recover all of the
>> >> >>> data.
>> >> >>> If more than X devices failed, you would still get complete data from the
>> >> >>> working devices.
>> >> >>>
>> >> >>> Every update would only write to the particular N device on which it is
>> >> >>> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> >>> than X for the spin-down to be really worth it.
>> >> >>>
>> >> >>> Am I right so far?
>> >> >>
>> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> >> devices to 1 data) so spin down is totally worth it for data
>> >> >> protection but more on that below.
>> >> >>
>> >> >>> For some reason the writes to X are delayed...  I don't really understand
>> >> >>> that part.
>> >> >>
>> >> >> This delay is basically designed around archival devices which are
>> >> >> rarely read from and even more rarely written to. By delaying writes
>> >> >> on 2 criteria ( designated cache buffer filling up or preset time
>> >> >> duration from last write expiring) we can significantly reduce the
>> >> >> writes on the parity device. This assumes that we are ok to lose a
>> >> >> movie or two in case the parity disk is not totally up to date but are
>> >> >> more interested in device longevity.
>> >> >>
>> >> >>> Sounds like multi-parity RAID6 with no parity rotation and
>> >> >>>    chunksize == devicesize
>> >> >>
>> >> >> RAID6 would present us with a joint device and currently only allows
>> >> >> writes to that directly, yes? Any writes will be striped.
>> >> >
>> >> >
>> >> > I am not totally sure I understand your design, but it seems to me that the
>> >> > following solution could work for you:
>> >> >
>> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet,
>> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so
>> >> > expensive that you can't scrub)
>> >> >
>> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those
>> >> > two will never spin-down) in writeback mode with writeback_running=off .
>> >> > This will prevent writes to backend and leave the backend array spun down.
>> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on
>> >> > and writethrough: it will wake up the backend raid6 array and flush all
>> >> > dirty data. You can then then revert to writeback and writeback_running=off.
>> >> > After this you can spin-down the backend array again.
>> >> >
>> >> > You also get read caching for free, which helps the backend array to stay
>> >> > spun down as much as possible.
>> >> >
>> >> > Maybe you can modify bcache slightly so to implement an automatic switching
>> >> > between the modes as described above, instead of polling the state from
>> >> > outside.
>> >> >
>> >> > Would that work, or you are asking something different?
>> >> >
>> >> > EW
>> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> > --
>> >
>> > piergiorgio
>
> --
>
> piergiorgio

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-03  5:52       ` NeilBrown
  2014-11-03 18:04         ` Piergiorgio Sartor
  2014-11-06  2:24         ` Anshuman Aggarwal
@ 2014-11-24  7:29         ` Anshuman Aggarwal
  2014-11-24 22:50           ` NeilBrown
  2 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-11-24  7:29 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> Would chunksize==disksize work? Wouldn't that lead to the entire
>> parity be invalidated for any write to any of the disks (assuming md
>> operates at a chunk level)...also please see my reply below
>
> Operating at a chunk level would be a very poor design choice.  md/raid5
> operates in units of 1 page (4K).

It appears that my requirement may be met by a partitionable md raid 4
array where the partitions are all on individual underlying block
devices not striped across the block devices. Is that currently
possible with md raid? I dont' see how but such an enhancement could
do all that I had outlined earlier

Is this possible to implement using RAID4 and MD already? can the
partitions be made to write to individual block devices such that
parity updates don't require reading all devices?

To illustrate:
-----------------RAID - 4 ---------------------
|
Device 1       Device 2       Device 3       Parity
A1                 B1                 C1                P1
A2                 B2                 C2                P2
A3                 B3                 C3                P3

Each device gets written to independently (via a layer of block
devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
blocks leading to updation of P1, P2 P3 (without causing any reads on
devices 2 and 3 using XOR for the parity).

In RAID4, IIUC data gets striped and all devices become a single block device.


>
>
>>
>> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> > Right on most counts but please see comments below.
>> >
>> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> devices contains an independent filesystem and could be accessed directly if
>> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> devices in total died, you would still be able to recover all of the data.
>> >> If more than X devices failed, you would still get complete data from the
>> >> working devices.
>> >>
>> >> Every update would only write to the particular N device on which it is
>> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> than X for the spin-down to be really worth it.
>> >>
>> >> Am I right so far?
>> >
>> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> > devices to 1 data) so spin down is totally worth it for data
>> > protection but more on that below.
>> >
>> >>
>> >> For some reason the writes to X are delayed...  I don't really understand
>> >> that part.
>> >
>> > This delay is basically designed around archival devices which are
>> > rarely read from and even more rarely written to. By delaying writes
>> > on 2 criteria ( designated cache buffer filling up or preset time
>> > duration from last write expiring) we can significantly reduce the
>> > writes on the parity device. This assumes that we are ok to lose a
>> > movie or two in case the parity disk is not totally up to date but are
>> > more interested in device longevity.
>> >
>> >>
>> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >>   chunksize == devicesize
>> > RAID6 would present us with a joint device and currently only allows
>> > writes to that directly, yes? Any writes will be striped.
>
> If the chunksize equals the device size, then you need a very large write for
> it to be striped.
>
>> > In any case would md raid allow the underlying device to be written to
>> > directly? Also how would it know that the device has been written to
>> > and hence parity has to be updated? What about the superblock which
>> > the FS would not know about?
>
> No, you wouldn't write to the underlying device.  You would carefully
> partition the RAID5 so each partition aligns exactly with an underlying
> device.  Then write to the partition.
>
>> >
>> > Also except for the delayed checksum writing part which would be
>> > significant if one of the objectives is to reduce the amount of
>> > writes. Can we delay that in the code currently for RAID6? I
>> > understand the objective of RAID6 is to ensure data recovery and we
>> > are looking at a compromise in this case.
>
> "simple matter of programming"
> Of course there would be a limit to how much data can be buffered in memory
> before it has to be flushed out.
> If you are mostly storing movies, then they are probably too large to
> buffer.  Why not just write them out straight away?
>
> NeilBrown
>
>
>
>> >
>> > If feasible, this can be an enhancement to MD RAID as well where N
>> > devices are presented instead of a single joint device in case of
>> > raid6 (maybe the multi part device can be individual disks?)
>> >
>> > It will certainly solve my problem of where to store the metadata. I
>> > was currently hoping to just store it as a configuration file to be
>> > read by the initramfs since in this case worst case scenario the
>> > checksum goes out of sync and is rebuilt from scratch.
>> >
>> >>
>> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> impartial opinion from me on that topic.
>> >
>> > I haven't hacked around the kernel internals much so far so will have
>> > to dig out that history. I will welcome any particular links/mail
>> > threads I should look at for guidance (with both yours and opposing
>> > points of view)
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24  7:29         ` Anshuman Aggarwal
@ 2014-11-24 22:50           ` NeilBrown
  2014-11-26  6:24             ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-11-24 22:50 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 7687 bytes --]

On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> > <anshuman.aggarwal@gmail.com> wrote:
> >
> >> Would chunksize==disksize work? Wouldn't that lead to the entire
> >> parity be invalidated for any write to any of the disks (assuming md
> >> operates at a chunk level)...also please see my reply below
> >
> > Operating at a chunk level would be a very poor design choice.  md/raid5
> > operates in units of 1 page (4K).
> 
> It appears that my requirement may be met by a partitionable md raid 4
> array where the partitions are all on individual underlying block
> devices not striped across the block devices. Is that currently
> possible with md raid? I dont' see how but such an enhancement could
> do all that I had outlined earlier
> 
> Is this possible to implement using RAID4 and MD already?

Nearly.  RAID4 currently requires the chunk size to be a power of 2.
Rounding down the size of your drives to match that could waste nearly half
the space.  However it should work as a proof-of-concept.

RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
RAID4/5/6 would be quite possible.

>   can the
> partitions be made to write to individual block devices such that
> parity updates don't require reading all devices?

md/raid4 will currently tries to minimize total IO requests when performing
an update, but prefer spreading the IO over more devices if the total number
of requests is the same.

So for a 4-drive RAID4, Updating a single block can be done by:
  read old data block, read parity, write data, write parity - 4 IO requests
or
  read other 2 data blocks, write data, write parity - 4 IO requests.

In this case it will prefer the second, which is not what you want.
With 5-drive RAID4, the second option will require 5 IO requests, so the first
will be chosen.
It is quite trivial to flip this default for testing

-	if (rmw < rcw && rmw > 0) {
+	if (rmw <= rcw && rmw > 0) {


If you had 5 drives, you could experiment with no code changes.
Make the chunk size the largest power of 2 that fits in the device, and then
partition to align the partitions on those boundaries.

NeilBrown


> 
> To illustrate:
> -----------------RAID - 4 ---------------------
> |
> Device 1       Device 2       Device 3       Parity
> A1                 B1                 C1                P1
> A2                 B2                 C2                P2
> A3                 B3                 C3                P3
> 
> Each device gets written to independently (via a layer of block
> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
> blocks leading to updation of P1, P2 P3 (without causing any reads on
> devices 2 and 3 using XOR for the parity).
> 
> In RAID4, IIUC data gets striped and all devices become a single block device.
> 
> 
> >
> >
> >>
> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
> >> > Right on most counts but please see comments below.
> >> >
> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
> >> >> devices contains an independent filesystem and could be accessed directly if
> >> >> needed.  Each of the X devices contains some codes so that if at most X
> >> >> devices in total died, you would still be able to recover all of the data.
> >> >> If more than X devices failed, you would still get complete data from the
> >> >> working devices.
> >> >>
> >> >> Every update would only write to the particular N device on which it is
> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
> >> >> than X for the spin-down to be really worth it.
> >> >>
> >> >> Am I right so far?
> >> >
> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
> >> > devices to 1 data) so spin down is totally worth it for data
> >> > protection but more on that below.
> >> >
> >> >>
> >> >> For some reason the writes to X are delayed...  I don't really understand
> >> >> that part.
> >> >
> >> > This delay is basically designed around archival devices which are
> >> > rarely read from and even more rarely written to. By delaying writes
> >> > on 2 criteria ( designated cache buffer filling up or preset time
> >> > duration from last write expiring) we can significantly reduce the
> >> > writes on the parity device. This assumes that we are ok to lose a
> >> > movie or two in case the parity disk is not totally up to date but are
> >> > more interested in device longevity.
> >> >
> >> >>
> >> >> Sounds like multi-parity RAID6 with no parity rotation and
> >> >>   chunksize == devicesize
> >> > RAID6 would present us with a joint device and currently only allows
> >> > writes to that directly, yes? Any writes will be striped.
> >
> > If the chunksize equals the device size, then you need a very large write for
> > it to be striped.
> >
> >> > In any case would md raid allow the underlying device to be written to
> >> > directly? Also how would it know that the device has been written to
> >> > and hence parity has to be updated? What about the superblock which
> >> > the FS would not know about?
> >
> > No, you wouldn't write to the underlying device.  You would carefully
> > partition the RAID5 so each partition aligns exactly with an underlying
> > device.  Then write to the partition.
> >
> >> >
> >> > Also except for the delayed checksum writing part which would be
> >> > significant if one of the objectives is to reduce the amount of
> >> > writes. Can we delay that in the code currently for RAID6? I
> >> > understand the objective of RAID6 is to ensure data recovery and we
> >> > are looking at a compromise in this case.
> >
> > "simple matter of programming"
> > Of course there would be a limit to how much data can be buffered in memory
> > before it has to be flushed out.
> > If you are mostly storing movies, then they are probably too large to
> > buffer.  Why not just write them out straight away?
> >
> > NeilBrown
> >
> >
> >
> >> >
> >> > If feasible, this can be an enhancement to MD RAID as well where N
> >> > devices are presented instead of a single joint device in case of
> >> > raid6 (maybe the multi part device can be individual disks?)
> >> >
> >> > It will certainly solve my problem of where to store the metadata. I
> >> > was currently hoping to just store it as a configuration file to be
> >> > read by the initramfs since in this case worst case scenario the
> >> > checksum goes out of sync and is rebuilt from scratch.
> >> >
> >> >>
> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
> >> >> impartial opinion from me on that topic.
> >> >
> >> > I haven't hacked around the kernel internals much so far so will have
> >> > to dig out that history. I will welcome any particular links/mail
> >> > threads I should look at for guidance (with both yours and opposing
> >> > points of view)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-24 22:50           ` NeilBrown
@ 2014-11-26  6:24             ` Anshuman Aggarwal
  2014-12-01 16:00               ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-11-26  6:24 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> > <anshuman.aggarwal@gmail.com> wrote:
>> >
>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >> parity be invalidated for any write to any of the disks (assuming md
>> >> operates at a chunk level)...also please see my reply below
>> >
>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> > operates in units of 1 page (4K).
>>
>> It appears that my requirement may be met by a partitionable md raid 4
>> array where the partitions are all on individual underlying block
>> devices not striped across the block devices. Is that currently
>> possible with md raid? I dont' see how but such an enhancement could
>> do all that I had outlined earlier
>>
>> Is this possible to implement using RAID4 and MD already?
>
> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> Rounding down the size of your drives to match that could waste nearly half
> the space.  However it should work as a proof-of-concept.
>
> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> RAID4/5/6 would be quite possible.
>
>>   can the
>> partitions be made to write to individual block devices such that
>> parity updates don't require reading all devices?
>
> md/raid4 will currently tries to minimize total IO requests when performing
> an update, but prefer spreading the IO over more devices if the total number
> of requests is the same.
>
> So for a 4-drive RAID4, Updating a single block can be done by:
>   read old data block, read parity, write data, write parity - 4 IO requests
> or
>   read other 2 data blocks, write data, write parity - 4 IO requests.
>
> In this case it will prefer the second, which is not what you want.
> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> will be chosen.
> It is quite trivial to flip this default for testing
>
> -       if (rmw < rcw && rmw > 0) {
> +       if (rmw <= rcw && rmw > 0) {
>
>
> If you had 5 drives, you could experiment with no code changes.
> Make the chunk size the largest power of 2 that fits in the device, and then
> partition to align the partitions on those boundaries.

If the chunk size is almost the same as the device size, I assume the
entire chunk is not invalidated for parity on writing to a single
block? i.e. if only 1 block is updated only that blocks parity will be
read and written and not for the whole chunk? If thats' the case, what
purpose does a chunk serve in md raid ? If that's not the case, it
wouldn't work because a single block updation would lead to parity
being written for the entire chunk, which is the size of the device

I do have more than 5 drives though they are in use currently. I will
create a small testing partition on each device of the same size and
run the test on that after ensuring that the drives do go to sleep.

>
> NeilBrown
>

Thanks,
Anshuman
>
>>
>> To illustrate:
>> -----------------RAID - 4 ---------------------
>> |
>> Device 1       Device 2       Device 3       Parity
>> A1                 B1                 C1                P1
>> A2                 B2                 C2                P2
>> A3                 B3                 C3                P3
>>
>> Each device gets written to independently (via a layer of block
>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>> devices 2 and 3 using XOR for the parity).
>>
>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>
>>
>> >
>> >
>> >>
>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>> >> > Right on most counts but please see comments below.
>> >> >
>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>> >> >> devices contains an independent filesystem and could be accessed directly if
>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>> >> >> devices in total died, you would still be able to recover all of the data.
>> >> >> If more than X devices failed, you would still get complete data from the
>> >> >> working devices.
>> >> >>
>> >> >> Every update would only write to the particular N device on which it is
>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>> >> >> than X for the spin-down to be really worth it.
>> >> >>
>> >> >> Am I right so far?
>> >> >
>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>> >> > devices to 1 data) so spin down is totally worth it for data
>> >> > protection but more on that below.
>> >> >
>> >> >>
>> >> >> For some reason the writes to X are delayed...  I don't really understand
>> >> >> that part.
>> >> >
>> >> > This delay is basically designed around archival devices which are
>> >> > rarely read from and even more rarely written to. By delaying writes
>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>> >> > duration from last write expiring) we can significantly reduce the
>> >> > writes on the parity device. This assumes that we are ok to lose a
>> >> > movie or two in case the parity disk is not totally up to date but are
>> >> > more interested in device longevity.
>> >> >
>> >> >>
>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>> >> >>   chunksize == devicesize
>> >> > RAID6 would present us with a joint device and currently only allows
>> >> > writes to that directly, yes? Any writes will be striped.
>> >
>> > If the chunksize equals the device size, then you need a very large write for
>> > it to be striped.
>> >
>> >> > In any case would md raid allow the underlying device to be written to
>> >> > directly? Also how would it know that the device has been written to
>> >> > and hence parity has to be updated? What about the superblock which
>> >> > the FS would not know about?
>> >
>> > No, you wouldn't write to the underlying device.  You would carefully
>> > partition the RAID5 so each partition aligns exactly with an underlying
>> > device.  Then write to the partition.
>> >
>> >> >
>> >> > Also except for the delayed checksum writing part which would be
>> >> > significant if one of the objectives is to reduce the amount of
>> >> > writes. Can we delay that in the code currently for RAID6? I
>> >> > understand the objective of RAID6 is to ensure data recovery and we
>> >> > are looking at a compromise in this case.
>> >
>> > "simple matter of programming"
>> > Of course there would be a limit to how much data can be buffered in memory
>> > before it has to be flushed out.
>> > If you are mostly storing movies, then they are probably too large to
>> > buffer.  Why not just write them out straight away?
>> >
>> > NeilBrown
>> >
>> >
>> >
>> >> >
>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>> >> > devices are presented instead of a single joint device in case of
>> >> > raid6 (maybe the multi part device can be individual disks?)
>> >> >
>> >> > It will certainly solve my problem of where to store the metadata. I
>> >> > was currently hoping to just store it as a configuration file to be
>> >> > read by the initramfs since in this case worst case scenario the
>> >> > checksum goes out of sync and is rebuilt from scratch.
>> >> >
>> >> >>
>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>> >> >> impartial opinion from me on that topic.
>> >> >
>> >> > I haven't hacked around the kernel internals much so far so will have
>> >> > to dig out that history. I will welcome any particular links/mail
>> >> > threads I should look at for guidance (with both yours and opposing
>> >> > points of view)
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-11-26  6:24             ` Anshuman Aggarwal
@ 2014-12-01 16:00               ` Anshuman Aggarwal
  2014-12-01 16:34                 ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-12-01 16:00 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 26 November 2014 at 11:54, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>>
>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>>> > <anshuman.aggarwal@gmail.com> wrote:
>>> >
>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>>> >> parity be invalidated for any write to any of the disks (assuming md
>>> >> operates at a chunk level)...also please see my reply below
>>> >
>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>>> > operates in units of 1 page (4K).
>>>
>>> It appears that my requirement may be met by a partitionable md raid 4
>>> array where the partitions are all on individual underlying block
>>> devices not striped across the block devices. Is that currently
>>> possible with md raid? I dont' see how but such an enhancement could
>>> do all that I had outlined earlier
>>>
>>> Is this possible to implement using RAID4 and MD already?
>>
>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>> Rounding down the size of your drives to match that could waste nearly half
>> the space.  However it should work as a proof-of-concept.
>>
>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>> RAID4/5/6 would be quite possible.
>>
>>>   can the
>>> partitions be made to write to individual block devices such that
>>> parity updates don't require reading all devices?
>>
>> md/raid4 will currently tries to minimize total IO requests when performing
>> an update, but prefer spreading the IO over more devices if the total number
>> of requests is the same.
>>
>> So for a 4-drive RAID4, Updating a single block can be done by:
>>   read old data block, read parity, write data, write parity - 4 IO requests
>> or
>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>>
>> In this case it will prefer the second, which is not what you want.
>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>> will be chosen.
>> It is quite trivial to flip this default for testing
>>
>> -       if (rmw < rcw && rmw > 0) {
>> +       if (rmw <= rcw && rmw > 0) {
>>
>>
>> If you had 5 drives, you could experiment with no code changes.
>> Make the chunk size the largest power of 2 that fits in the device, and then
>> partition to align the partitions on those boundaries.
>
> If the chunk size is almost the same as the device size, I assume the
> entire chunk is not invalidated for parity on writing to a single
> block? i.e. if only 1 block is updated only that blocks parity will be
> read and written and not for the whole chunk? If thats' the case, what
> purpose does a chunk serve in md raid ? If that's not the case, it
> wouldn't work because a single block updation would lead to parity
> being written for the entire chunk, which is the size of the device
>
> I do have more than 5 drives though they are in use currently. I will
> create a small testing partition on each device of the same size and
> run the test on that after ensuring that the drives do go to sleep.
>
>>
>> NeilBrown
>>

Wouldn't the meta data writes wake up all the disks in the cluster
anyways (defeating the purpose)? This idea will require metadata to
not be written out to each device (is that even possible or on the
cards?)

I am about to try out your suggestion with the chunk sizes anyways but
thought about the metadata being a major stumbling block.

>
> Thanks,
> Anshuman
>>
>>>
>>> To illustrate:
>>> -----------------RAID - 4 ---------------------
>>> |
>>> Device 1       Device 2       Device 3       Parity
>>> A1                 B1                 C1                P1
>>> A2                 B2                 C2                P2
>>> A3                 B3                 C3                P3
>>>
>>> Each device gets written to independently (via a layer of block
>>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>>> devices 2 and 3 using XOR for the parity).
>>>
>>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>>
>>>
>>> >
>>> >
>>> >>
>>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>> >> > Right on most counts but please see comments below.
>>> >> >
>>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>>> >> >> devices contains an independent filesystem and could be accessed directly if
>>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>>> >> >> devices in total died, you would still be able to recover all of the data.
>>> >> >> If more than X devices failed, you would still get complete data from the
>>> >> >> working devices.
>>> >> >>
>>> >> >> Every update would only write to the particular N device on which it is
>>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>> >> >> than X for the spin-down to be really worth it.
>>> >> >>
>>> >> >> Am I right so far?
>>> >> >
>>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>>> >> > devices to 1 data) so spin down is totally worth it for data
>>> >> > protection but more on that below.
>>> >> >
>>> >> >>
>>> >> >> For some reason the writes to X are delayed...  I don't really understand
>>> >> >> that part.
>>> >> >
>>> >> > This delay is basically designed around archival devices which are
>>> >> > rarely read from and even more rarely written to. By delaying writes
>>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>>> >> > duration from last write expiring) we can significantly reduce the
>>> >> > writes on the parity device. This assumes that we are ok to lose a
>>> >> > movie or two in case the parity disk is not totally up to date but are
>>> >> > more interested in device longevity.
>>> >> >
>>> >> >>
>>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>>> >> >>   chunksize == devicesize
>>> >> > RAID6 would present us with a joint device and currently only allows
>>> >> > writes to that directly, yes? Any writes will be striped.
>>> >
>>> > If the chunksize equals the device size, then you need a very large write for
>>> > it to be striped.
>>> >
>>> >> > In any case would md raid allow the underlying device to be written to
>>> >> > directly? Also how would it know that the device has been written to
>>> >> > and hence parity has to be updated? What about the superblock which
>>> >> > the FS would not know about?
>>> >
>>> > No, you wouldn't write to the underlying device.  You would carefully
>>> > partition the RAID5 so each partition aligns exactly with an underlying
>>> > device.  Then write to the partition.
>>> >
>>> >> >
>>> >> > Also except for the delayed checksum writing part which would be
>>> >> > significant if one of the objectives is to reduce the amount of
>>> >> > writes. Can we delay that in the code currently for RAID6? I
>>> >> > understand the objective of RAID6 is to ensure data recovery and we
>>> >> > are looking at a compromise in this case.
>>> >
>>> > "simple matter of programming"
>>> > Of course there would be a limit to how much data can be buffered in memory
>>> > before it has to be flushed out.
>>> > If you are mostly storing movies, then they are probably too large to
>>> > buffer.  Why not just write them out straight away?
>>> >
>>> > NeilBrown
>>> >
>>> >
>>> >
>>> >> >
>>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>>> >> > devices are presented instead of a single joint device in case of
>>> >> > raid6 (maybe the multi part device can be individual disks?)
>>> >> >
>>> >> > It will certainly solve my problem of where to store the metadata. I
>>> >> > was currently hoping to just store it as a configuration file to be
>>> >> > read by the initramfs since in this case worst case scenario the
>>> >> > checksum goes out of sync and is rebuilt from scratch.
>>> >> >
>>> >> >>
>>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>>> >> >> impartial opinion from me on that topic.
>>> >> >
>>> >> > I haven't hacked around the kernel internals much so far so will have
>>> >> > to dig out that history. I will welcome any particular links/mail
>>> >> > threads I should look at for guidance (with both yours and opposing
>>> >> > points of view)
>>> >> --
>>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> >> the body of a message to majordomo@vger.kernel.org
>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 16:00               ` Anshuman Aggarwal
@ 2014-12-01 16:34                 ` Anshuman Aggarwal
  2014-12-01 21:46                   ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-12-01 16:34 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 1 December 2014 at 21:30, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 26 November 2014 at 11:54, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>>> <anshuman.aggarwal@gmail.com> wrote:
>>>
>>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>>>> > <anshuman.aggarwal@gmail.com> wrote:
>>>> >
>>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>>>> >> parity be invalidated for any write to any of the disks (assuming md
>>>> >> operates at a chunk level)...also please see my reply below
>>>> >
>>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>>>> > operates in units of 1 page (4K).
>>>>
>>>> It appears that my requirement may be met by a partitionable md raid 4
>>>> array where the partitions are all on individual underlying block
>>>> devices not striped across the block devices. Is that currently
>>>> possible with md raid? I dont' see how but such an enhancement could
>>>> do all that I had outlined earlier
>>>>
>>>> Is this possible to implement using RAID4 and MD already?
>>>
>>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>>> Rounding down the size of your drives to match that could waste nearly half
>>> the space.  However it should work as a proof-of-concept.
>>>
>>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>>> RAID4/5/6 would be quite possible.
>>>
>>>>   can the
>>>> partitions be made to write to individual block devices such that
>>>> parity updates don't require reading all devices?
>>>
>>> md/raid4 will currently tries to minimize total IO requests when performing
>>> an update, but prefer spreading the IO over more devices if the total number
>>> of requests is the same.
>>>
>>> So for a 4-drive RAID4, Updating a single block can be done by:
>>>   read old data block, read parity, write data, write parity - 4 IO requests
>>> or
>>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>>>
>>> In this case it will prefer the second, which is not what you want.
>>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>>> will be chosen.
>>> It is quite trivial to flip this default for testing
>>>
>>> -       if (rmw < rcw && rmw > 0) {
>>> +       if (rmw <= rcw && rmw > 0) {
>>>
>>>
>>> If you had 5 drives, you could experiment with no code changes.
>>> Make the chunk size the largest power of 2 that fits in the device, and then
>>> partition to align the partitions on those boundaries.
>>
>> If the chunk size is almost the same as the device size, I assume the
>> entire chunk is not invalidated for parity on writing to a single
>> block? i.e. if only 1 block is updated only that blocks parity will be
>> read and written and not for the whole chunk? If thats' the case, what
>> purpose does a chunk serve in md raid ? If that's not the case, it
>> wouldn't work because a single block updation would lead to parity
>> being written for the entire chunk, which is the size of the device
>>
>> I do have more than 5 drives though they are in use currently. I will
>> create a small testing partition on each device of the same size and
>> run the test on that after ensuring that the drives do go to sleep.
>>
>>>
>>> NeilBrown
>>>
>
> Wouldn't the meta data writes wake up all the disks in the cluster
> anyways (defeating the purpose)? This idea will require metadata to
> not be written out to each device (is that even possible or on the
> cards?)
>
> I am about to try out your suggestion with the chunk sizes anyways but
> thought about the metadata being a major stumbling block.
>

And it seems to be confirmed that the metadata write is waking up the
other drives. On any write to a particular drive the metadata update
is accessing all the others.

Am I correct in assuming that all metadata is currently written as
part of the block device itself and that the external metadata  is
still embedded in each of the block devices (only the format of the
metadata is defined externally?) I guess to implement this we would
need to store metadata elsewhere which may be a major development
work. Still that may be a flexibility desired in md raid for other
reasons...

Neil, your thoughts.

>>
>> Thanks,
>> Anshuman
>>>
>>>>
>>>> To illustrate:
>>>> -----------------RAID - 4 ---------------------
>>>> |
>>>> Device 1       Device 2       Device 3       Parity
>>>> A1                 B1                 C1                P1
>>>> A2                 B2                 C2                P2
>>>> A3                 B3                 C3                P3
>>>>
>>>> Each device gets written to independently (via a layer of block
>>>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous
>>>> blocks leading to updation of P1, P2 P3 (without causing any reads on
>>>> devices 2 and 3 using XOR for the parity).
>>>>
>>>> In RAID4, IIUC data gets striped and all devices become a single block device.
>>>>
>>>>
>>>> >
>>>> >
>>>> >>
>>>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote:
>>>> >> > Right on most counts but please see comments below.
>>>> >> >
>>>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote:
>>>> >> >> Just to be sure I understand, you would have N + X devices.  Each of the N
>>>> >> >> devices contains an independent filesystem and could be accessed directly if
>>>> >> >> needed.  Each of the X devices contains some codes so that if at most X
>>>> >> >> devices in total died, you would still be able to recover all of the data.
>>>> >> >> If more than X devices failed, you would still get complete data from the
>>>> >> >> working devices.
>>>> >> >>
>>>> >> >> Every update would only write to the particular N device on which it is
>>>> >> >> relevant, and  all of the X devices.  So N needs to be quite a bit bigger
>>>> >> >> than X for the spin-down to be really worth it.
>>>> >> >>
>>>> >> >> Am I right so far?
>>>> >> >
>>>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4
>>>> >> > devices to 1 data) so spin down is totally worth it for data
>>>> >> > protection but more on that below.
>>>> >> >
>>>> >> >>
>>>> >> >> For some reason the writes to X are delayed...  I don't really understand
>>>> >> >> that part.
>>>> >> >
>>>> >> > This delay is basically designed around archival devices which are
>>>> >> > rarely read from and even more rarely written to. By delaying writes
>>>> >> > on 2 criteria ( designated cache buffer filling up or preset time
>>>> >> > duration from last write expiring) we can significantly reduce the
>>>> >> > writes on the parity device. This assumes that we are ok to lose a
>>>> >> > movie or two in case the parity disk is not totally up to date but are
>>>> >> > more interested in device longevity.
>>>> >> >
>>>> >> >>
>>>> >> >> Sounds like multi-parity RAID6 with no parity rotation and
>>>> >> >>   chunksize == devicesize
>>>> >> > RAID6 would present us with a joint device and currently only allows
>>>> >> > writes to that directly, yes? Any writes will be striped.
>>>> >
>>>> > If the chunksize equals the device size, then you need a very large write for
>>>> > it to be striped.
>>>> >
>>>> >> > In any case would md raid allow the underlying device to be written to
>>>> >> > directly? Also how would it know that the device has been written to
>>>> >> > and hence parity has to be updated? What about the superblock which
>>>> >> > the FS would not know about?
>>>> >
>>>> > No, you wouldn't write to the underlying device.  You would carefully
>>>> > partition the RAID5 so each partition aligns exactly with an underlying
>>>> > device.  Then write to the partition.
>>>> >
>>>> >> >
>>>> >> > Also except for the delayed checksum writing part which would be
>>>> >> > significant if one of the objectives is to reduce the amount of
>>>> >> > writes. Can we delay that in the code currently for RAID6? I
>>>> >> > understand the objective of RAID6 is to ensure data recovery and we
>>>> >> > are looking at a compromise in this case.
>>>> >
>>>> > "simple matter of programming"
>>>> > Of course there would be a limit to how much data can be buffered in memory
>>>> > before it has to be flushed out.
>>>> > If you are mostly storing movies, then they are probably too large to
>>>> > buffer.  Why not just write them out straight away?
>>>> >
>>>> > NeilBrown
>>>> >
>>>> >
>>>> >
>>>> >> >
>>>> >> > If feasible, this can be an enhancement to MD RAID as well where N
>>>> >> > devices are presented instead of a single joint device in case of
>>>> >> > raid6 (maybe the multi part device can be individual disks?)
>>>> >> >
>>>> >> > It will certainly solve my problem of where to store the metadata. I
>>>> >> > was currently hoping to just store it as a configuration file to be
>>>> >> > read by the initramfs since in this case worst case scenario the
>>>> >> > checksum goes out of sync and is rebuilt from scratch.
>>>> >> >
>>>> >> >>
>>>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely
>>>> >> >> impartial opinion from me on that topic.
>>>> >> >
>>>> >> > I haven't hacked around the kernel internals much so far so will have
>>>> >> > to dig out that history. I will welcome any particular links/mail
>>>> >> > threads I should look at for guidance (with both yours and opposing
>>>> >> > points of view)
>>>> >> --
>>>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> >> the body of a message to majordomo@vger.kernel.org
>>>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> >
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 16:34                 ` Anshuman Aggarwal
@ 2014-12-01 21:46                   ` NeilBrown
  2014-12-02 11:56                     ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-12-01 21:46 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 5522 bytes --]

On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 1 December 2014 at 21:30, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
> > On 26 November 2014 at 11:54, Anshuman Aggarwal
> > <anshuman.aggarwal@gmail.com> wrote:
> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
> >>> <anshuman.aggarwal@gmail.com> wrote:
> >>>
> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
> >>>> > <anshuman.aggarwal@gmail.com> wrote:
> >>>> >
> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
> >>>> >> parity be invalidated for any write to any of the disks (assuming md
> >>>> >> operates at a chunk level)...also please see my reply below
> >>>> >
> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
> >>>> > operates in units of 1 page (4K).
> >>>>
> >>>> It appears that my requirement may be met by a partitionable md raid 4
> >>>> array where the partitions are all on individual underlying block
> >>>> devices not striped across the block devices. Is that currently
> >>>> possible with md raid? I dont' see how but such an enhancement could
> >>>> do all that I had outlined earlier
> >>>>
> >>>> Is this possible to implement using RAID4 and MD already?
> >>>
> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
> >>> Rounding down the size of your drives to match that could waste nearly half
> >>> the space.  However it should work as a proof-of-concept.
> >>>
> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
> >>> RAID4/5/6 would be quite possible.
> >>>
> >>>>   can the
> >>>> partitions be made to write to individual block devices such that
> >>>> parity updates don't require reading all devices?
> >>>
> >>> md/raid4 will currently tries to minimize total IO requests when performing
> >>> an update, but prefer spreading the IO over more devices if the total number
> >>> of requests is the same.
> >>>
> >>> So for a 4-drive RAID4, Updating a single block can be done by:
> >>>   read old data block, read parity, write data, write parity - 4 IO requests
> >>> or
> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
> >>>
> >>> In this case it will prefer the second, which is not what you want.
> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
> >>> will be chosen.
> >>> It is quite trivial to flip this default for testing
> >>>
> >>> -       if (rmw < rcw && rmw > 0) {
> >>> +       if (rmw <= rcw && rmw > 0) {
> >>>
> >>>
> >>> If you had 5 drives, you could experiment with no code changes.
> >>> Make the chunk size the largest power of 2 that fits in the device, and then
> >>> partition to align the partitions on those boundaries.
> >>
> >> If the chunk size is almost the same as the device size, I assume the
> >> entire chunk is not invalidated for parity on writing to a single
> >> block? i.e. if only 1 block is updated only that blocks parity will be
> >> read and written and not for the whole chunk? If thats' the case, what
> >> purpose does a chunk serve in md raid ? If that's not the case, it
> >> wouldn't work because a single block updation would lead to parity
> >> being written for the entire chunk, which is the size of the device
> >>
> >> I do have more than 5 drives though they are in use currently. I will
> >> create a small testing partition on each device of the same size and
> >> run the test on that after ensuring that the drives do go to sleep.
> >>
> >>>
> >>> NeilBrown
> >>>
> >
> > Wouldn't the meta data writes wake up all the disks in the cluster
> > anyways (defeating the purpose)? This idea will require metadata to
> > not be written out to each device (is that even possible or on the
> > cards?)
> >
> > I am about to try out your suggestion with the chunk sizes anyways but
> > thought about the metadata being a major stumbling block.
> >
> 
> And it seems to be confirmed that the metadata write is waking up the
> other drives. On any write to a particular drive the metadata update
> is accessing all the others.
> 
> Am I correct in assuming that all metadata is currently written as
> part of the block device itself and that the external metadata  is
> still embedded in each of the block devices (only the format of the
> metadata is defined externally?) I guess to implement this we would
> need to store metadata elsewhere which may be a major development
> work. Still that may be a flexibility desired in md raid for other
> reasons...
> 
> Neil, your thoughts.

This is exactly why I suggested testing with existing code and seeing how far
you can get.  Thanks.

For a full solution we probably do need some code changes here, but for
further testing you could:
1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
2/ set the safe_mode_delay to 0
     echo 0 > /sys/block/mdXXX/md/safe_mode_delay

when it won't try to update the metadata until you stop the array, or a
device fails.

Longer term: it would probably be good to only update the bitmap on the
devices that are being written to - and to merge all bitmaps when assembling
the array.  Also when there is a bitmap, the safe_mode functionality should
probably be disabled.

NeilBrown


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-01 21:46                   ` NeilBrown
@ 2014-12-02 11:56                     ` Anshuman Aggarwal
  2014-12-16 16:25                       ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-12-02 11:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
will find more space on my drives and do a larger test but don't see
why it shouldn't work)
Here are the following caveats (and questions):
- Neil, like you pointed out, the power of 2 chunk size will probably
need a code change (in the kernel or only in the userspace tool?)
    - Any performance or other reasons why a terabyte size chunk may
not be feasible?
- Implications of safe_mode_delay
    - Would the metadata be updated on the block device be written to
and the parity device as well?
    - If the drive  fails which is the same as the drive being written
to, would that lack of metadata updates to the other devices affect
reconstruction?
- Adding new devices (is it possible to move the parity to the disk
being added? How does device addition work for RAID4 ...is it added as
a zero-ed out device with parity disk remaining the same)


On 2 December 2014 at 03:16, NeilBrown <neilb@suse.de> wrote:
> On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 1 December 2014 at 21:30, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>> > On 26 November 2014 at 11:54, Anshuman Aggarwal
>> > <anshuman.aggarwal@gmail.com> wrote:
>> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote:
>> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal
>> >>> <anshuman.aggarwal@gmail.com> wrote:
>> >>>
>> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote:
>> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal
>> >>>> > <anshuman.aggarwal@gmail.com> wrote:
>> >>>> >
>> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire
>> >>>> >> parity be invalidated for any write to any of the disks (assuming md
>> >>>> >> operates at a chunk level)...also please see my reply below
>> >>>> >
>> >>>> > Operating at a chunk level would be a very poor design choice.  md/raid5
>> >>>> > operates in units of 1 page (4K).
>> >>>>
>> >>>> It appears that my requirement may be met by a partitionable md raid 4
>> >>>> array where the partitions are all on individual underlying block
>> >>>> devices not striped across the block devices. Is that currently
>> >>>> possible with md raid? I dont' see how but such an enhancement could
>> >>>> do all that I had outlined earlier
>> >>>>
>> >>>> Is this possible to implement using RAID4 and MD already?
>> >>>
>> >>> Nearly.  RAID4 currently requires the chunk size to be a power of 2.
>> >>> Rounding down the size of your drives to match that could waste nearly half
>> >>> the space.  However it should work as a proof-of-concept.
>> >>>
>> >>> RAID0 supports non-power-of-2 chunk sizes.  Doing the same thing for
>> >>> RAID4/5/6 would be quite possible.
>> >>>
>> >>>>   can the
>> >>>> partitions be made to write to individual block devices such that
>> >>>> parity updates don't require reading all devices?
>> >>>
>> >>> md/raid4 will currently tries to minimize total IO requests when performing
>> >>> an update, but prefer spreading the IO over more devices if the total number
>> >>> of requests is the same.
>> >>>
>> >>> So for a 4-drive RAID4, Updating a single block can be done by:
>> >>>   read old data block, read parity, write data, write parity - 4 IO requests
>> >>> or
>> >>>   read other 2 data blocks, write data, write parity - 4 IO requests.
>> >>>
>> >>> In this case it will prefer the second, which is not what you want.
>> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first
>> >>> will be chosen.
>> >>> It is quite trivial to flip this default for testing
>> >>>
>> >>> -       if (rmw < rcw && rmw > 0) {
>> >>> +       if (rmw <= rcw && rmw > 0) {
>> >>>
>> >>>
>> >>> If you had 5 drives, you could experiment with no code changes.
>> >>> Make the chunk size the largest power of 2 that fits in the device, and then
>> >>> partition to align the partitions on those boundaries.
>> >>
>> >> If the chunk size is almost the same as the device size, I assume the
>> >> entire chunk is not invalidated for parity on writing to a single
>> >> block? i.e. if only 1 block is updated only that blocks parity will be
>> >> read and written and not for the whole chunk? If thats' the case, what
>> >> purpose does a chunk serve in md raid ? If that's not the case, it
>> >> wouldn't work because a single block updation would lead to parity
>> >> being written for the entire chunk, which is the size of the device
>> >>
>> >> I do have more than 5 drives though they are in use currently. I will
>> >> create a small testing partition on each device of the same size and
>> >> run the test on that after ensuring that the drives do go to sleep.
>> >>
>> >>>
>> >>> NeilBrown
>> >>>
>> >
>> > Wouldn't the meta data writes wake up all the disks in the cluster
>> > anyways (defeating the purpose)? This idea will require metadata to
>> > not be written out to each device (is that even possible or on the
>> > cards?)
>> >
>> > I am about to try out your suggestion with the chunk sizes anyways but
>> > thought about the metadata being a major stumbling block.
>> >
>>
>> And it seems to be confirmed that the metadata write is waking up the
>> other drives. On any write to a particular drive the metadata update
>> is accessing all the others.
>>
>> Am I correct in assuming that all metadata is currently written as
>> part of the block device itself and that the external metadata  is
>> still embedded in each of the block devices (only the format of the
>> metadata is defined externally?) I guess to implement this we would
>> need to store metadata elsewhere which may be a major development
>> work. Still that may be a flexibility desired in md raid for other
>> reasons...
>>
>> Neil, your thoughts.
>
> This is exactly why I suggested testing with existing code and seeing how far
> you can get.  Thanks.
>
> For a full solution we probably do need some code changes here, but for
> further testing you could:
> 1/ make sure there is no bitmap (mdadm --grow --bitmap=none)
> 2/ set the safe_mode_delay to 0
>      echo 0 > /sys/block/mdXXX/md/safe_mode_delay
>
> when it won't try to update the metadata until you stop the array, or a
> device fails.
>
> Longer term: it would probably be good to only update the bitmap on the
> devices that are being written to - and to merge all bitmaps when assembling
> the array.  Also when there is a bitmap, the safe_mode functionality should
> probably be disabled.
>
> NeilBrown
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-02 11:56                     ` Anshuman Aggarwal
@ 2014-12-16 16:25                       ` Anshuman Aggarwal
  2014-12-16 21:49                         ` NeilBrown
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-12-16 16:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 2 December 2014 at 17:26, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
> will find more space on my drives and do a larger test but don't see
> why it shouldn't work)
> Here are the following caveats (and questions):
> - Neil, like you pointed out, the power of 2 chunk size will probably
> need a code change (in the kernel or only in the userspace tool?)
>     - Any performance or other reasons why a terabyte size chunk may
> not be feasible?
> - Implications of safe_mode_delay
>     - Would the metadata be updated on the block device be written to
> and the parity device as well?
>     - If the drive  fails which is the same as the drive being written
> to, would that lack of metadata updates to the other devices affect
> reconstruction?
> - Adding new devices (is it possible to move the parity to the disk
> being added? How does device addition work for RAID4 ...is it added as
> a zero-ed out device with parity disk remaining the same)
>
>

Neil, sorry to try to bump this thread. Could you please look over the
questions and address the points on the remaining items that can make
it a working solution? Thanks

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-16 16:25                       ` Anshuman Aggarwal
@ 2014-12-16 21:49                         ` NeilBrown
  2014-12-17  6:40                           ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: NeilBrown @ 2014-12-16 21:49 UTC (permalink / raw)
  To: Anshuman Aggarwal; +Cc: Mdadm

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:

> On 2 December 2014 at 17:26, Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
> > will find more space on my drives and do a larger test but don't see
> > why it shouldn't work)
> > Here are the following caveats (and questions):
> > - Neil, like you pointed out, the power of 2 chunk size will probably
> > need a code change (in the kernel or only in the userspace tool?)

In the kernel too.

> >     - Any performance or other reasons why a terabyte size chunk may
> > not be feasible?

Not that I can think of.

> > - Implications of safe_mode_delay
> >     - Would the metadata be updated on the block device be written to
> > and the parity device as well?

Probably.  Hard to give a specific answer to vague question.

> >     - If the drive  fails which is the same as the drive being written
> > to, would that lack of metadata updates to the other devices affect
> > reconstruction?

Again, to give a precise answer, a detailed question is needed.  Obviously
any change would have to made in such a way to ensure that things which
needed to work, did work.


> > - Adding new devices (is it possible to move the parity to the disk
> > being added? How does device addition work for RAID4 ...is it added as
> > a zero-ed out device with parity disk remaining the same)

RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
Currently if you add a device to such an array ...... I'm not sure what it
will do.  It should be possible to make it just write zeros out.


NeilBrown


> >
> >
> 
> Neil, sorry to try to bump this thread. Could you please look over the
> questions and address the points on the remaining items that can make
> it a working solution? Thanks


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 811 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-16 21:49                         ` NeilBrown
@ 2014-12-17  6:40                           ` Anshuman Aggarwal
  2015-01-06 11:40                             ` Anshuman Aggarwal
  0 siblings, 1 reply; 28+ messages in thread
From: Anshuman Aggarwal @ 2014-12-17  6:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote:
> On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
> <anshuman.aggarwal@gmail.com> wrote:
>
>> On 2 December 2014 at 17:26, Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
>> > will find more space on my drives and do a larger test but don't see
>> > why it shouldn't work)
>> > Here are the following caveats (and questions):
>> > - Neil, like you pointed out, the power of 2 chunk size will probably
>> > need a code change (in the kernel or only in the userspace tool?)
>
> In the kernel too.

Is this something that you would consider implementing soon? Is there
a performance/other impact to any other consideration to remove this
limitation.. could you elaborate on the reason why it was there in the
first place?

If this is a case of patches are welcome, please guide on where to
start looking/working even if its just

>
>> >     - Any performance or other reasons why a terabyte size chunk may
>> > not be feasible?
>
> Not that I can think of.
>
>> > - Implications of safe_mode_delay
>> >     - Would the metadata be updated on the block device be written to
>> > and the parity device as well?
>
> Probably.  Hard to give a specific answer to vague question.

I should clarify.

For example in a 5 device RAID4, lets say block is being written to
device 1 and parity is on device 5 and devices 2,3,4 are sleeping
(spun down). If we set safe_mode_delay to 0 and md decides to update
the parity without involving the blocks on the other 3 devices and
just updates the parity by doing a read, compute, write to device 5
will the metadata be updated on both device 1 and 5 even though
safe_mode_delay is 0?

>
>> >     - If the drive  fails which is the same as the drive being written
>> > to, would that lack of metadata updates to the other devices affect
>> > reconstruction?
>
> Again, to give a precise answer, a detailed question is needed.  Obviously
> any change would have to made in such a way to ensure that things which
> needed to work, did work.

Continuing from the previous example, lets say device 1 fails after a
write which only updated metadata on 1 and 5 while 2,3,4 were
sleeping. In that case to access the data from 1, md will use 2,3,4,5
but will it then update the metadata from 5 onto 2,3,4? I hope I am
making this clear.

>
>
>> > - Adding new devices (is it possible to move the parity to the disk
>> > being added? How does device addition work for RAID4 ...is it added as
>> > a zero-ed out device with parity disk remaining the same)
>
> RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
> Currently if you add a device to such an array ...... I'm not sure what it
> will do.  It should be possible to make it just write zeros out.
>

Once again, is this something that can make its way to your roadmap?
If so, great.. otherwise could you steer me towards where in the md
kernel and mdadm source I should be looking to make these changes.
Thanks again.

>
> NeilBrown
>
>
>> >
>> >
>>
>> Neil, sorry to try to bump this thread. Could you please look over the
>> questions and address the points on the remaining items that can make
>> it a working solution? Thanks
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: Split RAID: Proposal for archival RAID using incremental batch checksum
  2014-12-17  6:40                           ` Anshuman Aggarwal
@ 2015-01-06 11:40                             ` Anshuman Aggarwal
  0 siblings, 0 replies; 28+ messages in thread
From: Anshuman Aggarwal @ 2015-01-06 11:40 UTC (permalink / raw)
  To: NeilBrown; +Cc: Mdadm

On 17 December 2014 at 12:10, Anshuman Aggarwal
<anshuman.aggarwal@gmail.com> wrote:
> On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote:
>> On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal
>> <anshuman.aggarwal@gmail.com> wrote:
>>
>>> On 2 December 2014 at 17:26, Anshuman Aggarwal
>>> <anshuman.aggarwal@gmail.com> wrote:
>>> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-)
>>> > will find more space on my drives and do a larger test but don't see
>>> > why it shouldn't work)
>>> > Here are the following caveats (and questions):
>>> > - Neil, like you pointed out, the power of 2 chunk size will probably
>>> > need a code change (in the kernel or only in the userspace tool?)
>>
>> In the kernel too.
>
> Is this something that you would consider implementing soon? Is there
> a performance/other impact to any other consideration to remove this
> limitation.. could you elaborate on the reason why it was there in the
> first place?
>
> If this is a case of patches are welcome, please guide on where to
> start looking/working even if its just
>
>>
>>> >     - Any performance or other reasons why a terabyte size chunk may
>>> > not be feasible?
>>
>> Not that I can think of.
>>
>>> > - Implications of safe_mode_delay
>>> >     - Would the metadata be updated on the block device be written to
>>> > and the parity device as well?
>>
>> Probably.  Hard to give a specific answer to vague question.
>
> I should clarify.
>
> For example in a 5 device RAID4, lets say block is being written to
> device 1 and parity is on device 5 and devices 2,3,4 are sleeping
> (spun down). If we set safe_mode_delay to 0 and md decides to update
> the parity without involving the blocks on the other 3 devices and
> just updates the parity by doing a read, compute, write to device 5
> will the metadata be updated on both device 1 and 5 even though
> safe_mode_delay is 0?
>
>>
>>> >     - If the drive  fails which is the same as the drive being written
>>> > to, would that lack of metadata updates to the other devices affect
>>> > reconstruction?
>>
>> Again, to give a precise answer, a detailed question is needed.  Obviously
>> any change would have to made in such a way to ensure that things which
>> needed to work, did work.
>
> Continuing from the previous example, lets say device 1 fails after a
> write which only updated metadata on 1 and 5 while 2,3,4 were
> sleeping. In that case to access the data from 1, md will use 2,3,4,5
> but will it then update the metadata from 5 onto 2,3,4? I hope I am
> making this clear.
>
>>
>>
>>> > - Adding new devices (is it possible to move the parity to the disk
>>> > being added? How does device addition work for RAID4 ...is it added as
>>> > a zero-ed out device with parity disk remaining the same)
>>
>> RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices.
>> Currently if you add a device to such an array ...... I'm not sure what it
>> will do.  It should be possible to make it just write zeros out.
>>
>
> Once again, is this something that can make its way to your roadmap?
> If so, great.. otherwise could you steer me towards where in the md
> kernel and mdadm source I should be looking to make these changes.
> Thanks again.
>
>>
>> NeilBrown
>>
>>
>>> >
>>> >
>>>
>>> Neil, sorry to try to bump this thread. Could you please look over the
>>> questions and address the points on the remaining items that can make
>>> it a working solution? Thanks
>>

Hi Neil,
 Could you please find a minute to give your input to the above? Your
guidance will go a long way towards making this a reality and it may
be useful to the community at large with the new Seagate 8TB archival
drives which seem to be more geared towards occasional use but would
still benefit from a RAID like redundancy.

Many thanks,
Anshuman

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2015-01-06 11:40 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-29  7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
2014-10-29  7:32 ` Roman Mamedov
2014-10-29  8:31   ` Anshuman Aggarwal
2014-10-29  9:05 ` NeilBrown
2014-10-29  9:25   ` Anshuman Aggarwal
2014-10-29 19:27     ` Ethan Wilson
2014-10-30 14:57       ` Anshuman Aggarwal
2014-10-30 17:25         ` Piergiorgio Sartor
2014-10-31 11:05           ` Anshuman Aggarwal
2014-10-31 14:25             ` Matt Garman
2014-11-01 12:55             ` Piergiorgio Sartor
2014-11-06  2:29               ` Anshuman Aggarwal
2014-10-30 15:00     ` Anshuman Aggarwal
2014-11-03  5:52       ` NeilBrown
2014-11-03 18:04         ` Piergiorgio Sartor
2014-11-06  2:24         ` Anshuman Aggarwal
2014-11-24  7:29         ` Anshuman Aggarwal
2014-11-24 22:50           ` NeilBrown
2014-11-26  6:24             ` Anshuman Aggarwal
2014-12-01 16:00               ` Anshuman Aggarwal
2014-12-01 16:34                 ` Anshuman Aggarwal
2014-12-01 21:46                   ` NeilBrown
2014-12-02 11:56                     ` Anshuman Aggarwal
2014-12-16 16:25                       ` Anshuman Aggarwal
2014-12-16 21:49                         ` NeilBrown
2014-12-17  6:40                           ` Anshuman Aggarwal
2015-01-06 11:40                             ` Anshuman Aggarwal
     [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
2014-11-01  5:36   ` Anshuman Aggarwal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).