* Split RAID: Proposal for archival RAID using incremental batch checksum
@ 2014-10-29 7:15 Anshuman Aggarwal
2014-10-29 7:32 ` Roman Mamedov
` (2 more replies)
0 siblings, 3 replies; 44+ messages in thread
From: Anshuman Aggarwal @ 2014-10-29 7:15 UTC (permalink / raw)
To: linux-raid
I'm outlining below a proposal for a RAID device mapper virtual block
device for the kernel which adds "split raid" functionality on an
incremental batch basis for a home media server/archived content which
is rarely accessed.
Given a set of N+X block devices (of the same size but smallest common
size wins)
the SplitRAID device mapper device generates virtual devices which are
passthrough for N devices and write a Batched/Delayed checksum into
the X devices so as to allow offline recovery of block on the N
devices in case of a single disk failure.
Advantages over conventional RAID:
- Disks can be spun down reducing wear and tear over MD RAID Levels
(such as 1, 10, 5,6) in the case of rarely accessed archival content
- Prevent catastrophic data loss for multiple device failure since
each block device is independent and hence unlike MD RAID will only
lose data incrementally.
- Performance degradation for writes can be achieved by keeping the
checksum update asynchronous and delaying the fsync to the checksum
block device.
In the event of improper shutdown the checksum may not have all the
updated data but will be mostly up to date which is often acceptable
for home media server requirements. A flag can be set in case the
checksum block device was shutdown properly indicating that a full
checksum rebuild is not required.
Existing solutions considered:
- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot
based scheme (Its advantages are that its in user space and has cross
platform support but has the huge disadvantage of every checksum being
done from scratch slowing the system, causing immense wear and tear on
every snapshot and also losing any information updates upto the
snapshot point etc)
I'd like to get opinions on the pros and cons of this proposal from
more experienced people on the list to redirect suitably on the
following questions:
- Maybe this can already be done using the block devices available in
the kernel?
- If not, Device mapper the right API to use? (I think so)
- What would be the best block devices code to look at to implement?
Neil, would appreciate your weighing in on this.
Regards,
Anshuman Aggarwal
^ permalink raw reply [flat|nested] 44+ messages in thread* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal @ 2014-10-29 7:32 ` Roman Mamedov 2014-10-29 8:31 ` Anshuman Aggarwal 2014-10-29 9:05 ` NeilBrown [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com> 2 siblings, 1 reply; 44+ messages in thread From: Roman Mamedov @ 2014-10-29 7:32 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: linux-raid On Wed, 29 Oct 2014 12:45:34 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > I'm outlining below a proposal for a RAID device mapper virtual block > device for the kernel which adds "split raid" functionality on an > incremental batch basis for a home media server/archived content which > is rarely accessed. > Existing solutions considered: Some of the already-available "home media server" setup schemes you did not mention: http://linuxconfig.org/prouhd-raid-for-the-end-user a smart way of managing MD RAID given multiple devices of various sizes; http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html what to do with a set of mixed-size drives, in simpler terms; https://romanrm.net/mhddfs File-level "concatenation" of disks, with smart distribution of new files; -- With respect, Roman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 7:32 ` Roman Mamedov @ 2014-10-29 8:31 ` Anshuman Aggarwal 0 siblings, 0 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-10-29 8:31 UTC (permalink / raw) To: Roman Mamedov; +Cc: linux-raid Actually I already use a combination of these solutions (MD raid, multiple devices + LVM2 to join). Unfortunately, none of these solutions address the following: - Full data loss in case of disk failure beyond the raid level ( 2 disks in raid5, 3 disks in raid6). This solution allows for single disk data loss - Continous read/write to all disks causing wear and tear reducing life and increasing end user cost mhddfs (or something like it) will probably be used on top of the N devices in this proposal to join but that is upto the requirement of the user. On 29 October 2014 13:02, Roman Mamedov <rm@romanrm.net> wrote: > On Wed, 29 Oct 2014 12:45:34 +0530 > Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > >> I'm outlining below a proposal for a RAID device mapper virtual block >> device for the kernel which adds "split raid" functionality on an >> incremental batch basis for a home media server/archived content which >> is rarely accessed. > >> Existing solutions considered: > > Some of the already-available "home media server" setup schemes you did not > mention: > > http://linuxconfig.org/prouhd-raid-for-the-end-user > a smart way of managing MD RAID given multiple devices of various sizes; > > http://louwrentius.com/building-a-raid-6-array-of-mixed-drives.html > what to do with a set of mixed-size drives, in simpler terms; > > https://romanrm.net/mhddfs > File-level "concatenation" of disks, with smart distribution of new files; > > -- > With respect, > Roman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal 2014-10-29 7:32 ` Roman Mamedov @ 2014-10-29 9:05 ` NeilBrown 2014-10-29 9:25 ` Anshuman Aggarwal [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com> 2 siblings, 1 reply; 44+ messages in thread From: NeilBrown @ 2014-10-29 9:05 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 3348 bytes --] On Wed, 29 Oct 2014 12:45:34 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > I'm outlining below a proposal for a RAID device mapper virtual block > device for the kernel which adds "split raid" functionality on an > incremental batch basis for a home media server/archived content which > is rarely accessed. > > Given a set of N+X block devices (of the same size but smallest common > size wins) > > the SplitRAID device mapper device generates virtual devices which are > passthrough for N devices and write a Batched/Delayed checksum into > the X devices so as to allow offline recovery of block on the N > devices in case of a single disk failure. > > Advantages over conventional RAID: > > - Disks can be spun down reducing wear and tear over MD RAID Levels > (such as 1, 10, 5,6) in the case of rarely accessed archival content > > - Prevent catastrophic data loss for multiple device failure since > each block device is independent and hence unlike MD RAID will only > lose data incrementally. > > - Performance degradation for writes can be achieved by keeping the > checksum update asynchronous and delaying the fsync to the checksum > block device. > > In the event of improper shutdown the checksum may not have all the > updated data but will be mostly up to date which is often acceptable > for home media server requirements. A flag can be set in case the > checksum block device was shutdown properly indicating that a full > checksum rebuild is not required. > > Existing solutions considered: > > - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot > based scheme (Its advantages are that its in user space and has cross > platform support but has the huge disadvantage of every checksum being > done from scratch slowing the system, causing immense wear and tear on > every snapshot and also losing any information updates upto the > snapshot point etc) > > I'd like to get opinions on the pros and cons of this proposal from > more experienced people on the list to redirect suitably on the > following questions: > > - Maybe this can already be done using the block devices available in > the kernel? > > - If not, Device mapper the right API to use? (I think so) > > - What would be the best block devices code to look at to implement? > > Neil, would appreciate your weighing in on this. Just to be sure I understand, you would have N + X devices. Each of the N devices contains an independent filesystem and could be accessed directly if needed. Each of the X devices contains some codes so that if at most X devices in total died, you would still be able to recover all of the data. If more than X devices failed, you would still get complete data from the working devices. Every update would only write to the particular N device on which it is relevant, and all of the X devices. So N needs to be quite a bit bigger than X for the spin-down to be really worth it. Am I right so far? For some reason the writes to X are delayed... I don't really understand that part. Sounds like multi-parity RAID6 with no parity rotation and chunksize == devicesize I wouldn't use device-mapper myself, but you are unlikely to get an entirely impartial opinion from me on that topic. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 9:05 ` NeilBrown @ 2014-10-29 9:25 ` Anshuman Aggarwal 2014-10-29 19:27 ` Ethan Wilson 2014-10-30 15:00 ` Anshuman Aggarwal 0 siblings, 2 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-10-29 9:25 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Right on most counts but please see comments below. On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: > Just to be sure I understand, you would have N + X devices. Each of the N > devices contains an independent filesystem and could be accessed directly if > needed. Each of the X devices contains some codes so that if at most X > devices in total died, you would still be able to recover all of the data. > If more than X devices failed, you would still get complete data from the > working devices. > > Every update would only write to the particular N device on which it is > relevant, and all of the X devices. So N needs to be quite a bit bigger > than X for the spin-down to be really worth it. > > Am I right so far? Perfectly right so far. I typically have a N to X ratio of 4 (4 devices to 1 data) so spin down is totally worth it for data protection but more on that below. > > For some reason the writes to X are delayed... I don't really understand > that part. This delay is basically designed around archival devices which are rarely read from and even more rarely written to. By delaying writes on 2 criteria ( designated cache buffer filling up or preset time duration from last write expiring) we can significantly reduce the writes on the parity device. This assumes that we are ok to lose a movie or two in case the parity disk is not totally up to date but are more interested in device longevity. > > Sounds like multi-parity RAID6 with no parity rotation and > chunksize == devicesize RAID6 would present us with a joint device and currently only allows writes to that directly, yes? Any writes will be striped. In any case would md raid allow the underlying device to be written to directly? Also how would it know that the device has been written to and hence parity has to be updated? What about the superblock which the FS would not know about? Also except for the delayed checksum writing part which would be significant if one of the objectives is to reduce the amount of writes. Can we delay that in the code currently for RAID6? I understand the objective of RAID6 is to ensure data recovery and we are looking at a compromise in this case. If feasible, this can be an enhancement to MD RAID as well where N devices are presented instead of a single joint device in case of raid6 (maybe the multi part device can be individual disks?) It will certainly solve my problem of where to store the metadata. I was currently hoping to just store it as a configuration file to be read by the initramfs since in this case worst case scenario the checksum goes out of sync and is rebuilt from scratch. > > I wouldn't use device-mapper myself, but you are unlikely to get an entirely > impartial opinion from me on that topic. I haven't hacked around the kernel internals much so far so will have to dig out that history. I will welcome any particular links/mail threads I should look at for guidance (with both yours and opposing points of view) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 9:25 ` Anshuman Aggarwal @ 2014-10-29 19:27 ` Ethan Wilson 2014-10-30 14:57 ` Anshuman Aggarwal 2014-10-30 15:00 ` Anshuman Aggarwal 1 sibling, 1 reply; 44+ messages in thread From: Ethan Wilson @ 2014-10-29 19:27 UTC (permalink / raw) To: linux-raid On 29/10/2014 10:25, Anshuman Aggarwal wrote: > Right on most counts but please see comments below. > > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> Just to be sure I understand, you would have N + X devices. Each of the N >> devices contains an independent filesystem and could be accessed directly if >> needed. Each of the X devices contains some codes so that if at most X >> devices in total died, you would still be able to recover all of the data. >> If more than X devices failed, you would still get complete data from the >> working devices. >> >> Every update would only write to the particular N device on which it is >> relevant, and all of the X devices. So N needs to be quite a bit bigger >> than X for the spin-down to be really worth it. >> >> Am I right so far? > Perfectly right so far. I typically have a N to X ratio of 4 (4 > devices to 1 data) so spin down is totally worth it for data > protection but more on that below. > >> For some reason the writes to X are delayed... I don't really understand >> that part. > This delay is basically designed around archival devices which are > rarely read from and even more rarely written to. By delaying writes > on 2 criteria ( designated cache buffer filling up or preset time > duration from last write expiring) we can significantly reduce the > writes on the parity device. This assumes that we are ok to lose a > movie or two in case the parity disk is not totally up to date but are > more interested in device longevity. > >> Sounds like multi-parity RAID6 with no parity rotation and >> chunksize == devicesize > RAID6 would present us with a joint device and currently only allows > writes to that directly, yes? Any writes will be striped. I am not totally sure I understand your design, but it seems to me that the following solution could work for you: MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, but just do a periodic scrub and 2 parities can be fine. Wake-up is not so expensive that you can't scrub) Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those two will never spin-down) in writeback mode with writeback_running=off . This will prevent writes to backend and leave the backend array spun down. When bcache is almost full (poll dirty_data), switch to writeback_running=on and writethrough: it will wake up the backend raid6 array and flush all dirty data. You can then then revert to writeback and writeback_running=off. After this you can spin-down the backend array again. You also get read caching for free, which helps the backend array to stay spun down as much as possible. Maybe you can modify bcache slightly so to implement an automatic switching between the modes as described above, instead of polling the state from outside. Would that work, or you are asking something different? EW ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 19:27 ` Ethan Wilson @ 2014-10-30 14:57 ` Anshuman Aggarwal 2014-10-30 17:25 ` Piergiorgio Sartor 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-10-30 14:57 UTC (permalink / raw) To: Ethan Wilson; +Cc: linux-raid What you are suggesting will work for delaying writing the checksum (but still making 2 disks work non stop and lead to failure, cost etc). I am proposing N independent disks which are rarely accessed. When parity has to be written to the remaining 1,2 ...X disks ...it is batched up (bcache is feasible) and written out once in a while depending on how much write is happening. N-1 disks stay spun down and only X disks wake up periodically to get checksum written to (this would be tweaked by the user based on how up to date he needs the parity to be (tolerance of rebuilding parity in case of crash) and vs disk access for each parity write) It can't be done using any RAID6 because RAID5/6 will stripe all the data across the devices making any read access wake up all the devices. Ditto for writing to parity on every write to a single disk. The architecture being proposed is a lazy write to manage parity for individual disks which won't suffer from RAID catastrophic data loss and concurrent disk. On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: > On 29/10/2014 10:25, Anshuman Aggarwal wrote: >> >> Right on most counts but please see comments below. >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >>> >>> Just to be sure I understand, you would have N + X devices. Each of the >>> N >>> devices contains an independent filesystem and could be accessed directly >>> if >>> needed. Each of the X devices contains some codes so that if at most X >>> devices in total died, you would still be able to recover all of the >>> data. >>> If more than X devices failed, you would still get complete data from the >>> working devices. >>> >>> Every update would only write to the particular N device on which it is >>> relevant, and all of the X devices. So N needs to be quite a bit bigger >>> than X for the spin-down to be really worth it. >>> >>> Am I right so far? >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 >> devices to 1 data) so spin down is totally worth it for data >> protection but more on that below. >> >>> For some reason the writes to X are delayed... I don't really understand >>> that part. >> >> This delay is basically designed around archival devices which are >> rarely read from and even more rarely written to. By delaying writes >> on 2 criteria ( designated cache buffer filling up or preset time >> duration from last write expiring) we can significantly reduce the >> writes on the parity device. This assumes that we are ok to lose a >> movie or two in case the parity disk is not totally up to date but are >> more interested in device longevity. >> >>> Sounds like multi-parity RAID6 with no parity rotation and >>> chunksize == devicesize >> >> RAID6 would present us with a joint device and currently only allows >> writes to that directly, yes? Any writes will be striped. > > > I am not totally sure I understand your design, but it seems to me that the > following solution could work for you: > > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so > expensive that you can't scrub) > > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those > two will never spin-down) in writeback mode with writeback_running=off . > This will prevent writes to backend and leave the backend array spun down. > When bcache is almost full (poll dirty_data), switch to writeback_running=on > and writethrough: it will wake up the backend raid6 array and flush all > dirty data. You can then then revert to writeback and writeback_running=off. > After this you can spin-down the backend array again. > > You also get read caching for free, which helps the backend array to stay > spun down as much as possible. > > Maybe you can modify bcache slightly so to implement an automatic switching > between the modes as described above, instead of polling the state from > outside. > > Would that work, or you are asking something different? > > EW > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-30 14:57 ` Anshuman Aggarwal @ 2014-10-30 17:25 ` Piergiorgio Sartor 2014-10-31 11:05 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Piergiorgio Sartor @ 2014-10-30 17:25 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Ethan Wilson, linux-raid On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: > What you are suggesting will work for delaying writing the checksum > (but still making 2 disks work non stop and lead to failure, cost > etc). Hi Anshuman, I'm a bit missing the point here. In my experience, with my storage systems, I change disks because they're too small, way long before they are too old (way long before they fail). That's why I end up with a collection of small HDDs. which, in turn, I recycled in some custom storage system (using disks of different size, like explained in one of the links posted before). Honestly, the only reason to spin down the disks, still in my experience, is for reducing power consumption. And this can be done with a RAID-6 without problems and in a extremely flexible way. So, the bottom line, still in my experience, is that this you're describing seems quite a nice situation. Or, I did not understood what you're proposing. Thanks, bye, pg > I am proposing N independent disks which are rarely accessed. When > parity has to be written to the remaining 1,2 ...X disks ...it is > batched up (bcache is feasible) and written out once in a while > depending on how much write is happening. N-1 disks stay spun down and > only X disks wake up periodically to get checksum written to (this > would be tweaked by the user based on how up to date he needs the > parity to be (tolerance of rebuilding parity in case of crash) and vs > disk access for each parity write) > > It can't be done using any RAID6 because RAID5/6 will stripe all the > data across the devices making any read access wake up all the > devices. Ditto for writing to parity on every write to a single disk. > > The architecture being proposed is a lazy write to manage parity for > individual disks which won't suffer from RAID catastrophic data loss > and concurrent disk. > > > > > On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: > > On 29/10/2014 10:25, Anshuman Aggarwal wrote: > >> > >> Right on most counts but please see comments below. > >> > >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: > >>> > >>> Just to be sure I understand, you would have N + X devices. Each of the > >>> N > >>> devices contains an independent filesystem and could be accessed directly > >>> if > >>> needed. Each of the X devices contains some codes so that if at most X > >>> devices in total died, you would still be able to recover all of the > >>> data. > >>> If more than X devices failed, you would still get complete data from the > >>> working devices. > >>> > >>> Every update would only write to the particular N device on which it is > >>> relevant, and all of the X devices. So N needs to be quite a bit bigger > >>> than X for the spin-down to be really worth it. > >>> > >>> Am I right so far? > >> > >> Perfectly right so far. I typically have a N to X ratio of 4 (4 > >> devices to 1 data) so spin down is totally worth it for data > >> protection but more on that below. > >> > >>> For some reason the writes to X are delayed... I don't really understand > >>> that part. > >> > >> This delay is basically designed around archival devices which are > >> rarely read from and even more rarely written to. By delaying writes > >> on 2 criteria ( designated cache buffer filling up or preset time > >> duration from last write expiring) we can significantly reduce the > >> writes on the parity device. This assumes that we are ok to lose a > >> movie or two in case the parity disk is not totally up to date but are > >> more interested in device longevity. > >> > >>> Sounds like multi-parity RAID6 with no parity rotation and > >>> chunksize == devicesize > >> > >> RAID6 would present us with a joint device and currently only allows > >> writes to that directly, yes? Any writes will be striped. > > > > > > I am not totally sure I understand your design, but it seems to me that the > > following solution could work for you: > > > > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, > > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so > > expensive that you can't scrub) > > > > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those > > two will never spin-down) in writeback mode with writeback_running=off . > > This will prevent writes to backend and leave the backend array spun down. > > When bcache is almost full (poll dirty_data), switch to writeback_running=on > > and writethrough: it will wake up the backend raid6 array and flush all > > dirty data. You can then then revert to writeback and writeback_running=off. > > After this you can spin-down the backend array again. > > > > You also get read caching for free, which helps the backend array to stay > > spun down as much as possible. > > > > Maybe you can modify bcache slightly so to implement an automatic switching > > between the modes as described above, instead of polling the state from > > outside. > > > > Would that work, or you are asking something different? > > > > EW > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- piergiorgio ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-30 17:25 ` Piergiorgio Sartor @ 2014-10-31 11:05 ` Anshuman Aggarwal 2014-10-31 14:25 ` Matt Garman 2014-11-01 12:55 ` Piergiorgio Sartor 0 siblings, 2 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-10-31 11:05 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: Ethan Wilson, linux-raid Hi pg, With MD raid striping all the writes not only does it keep ALL disks spinning to read/write the current content, it also leads to catastrophic data loss in case the rebuild/disk failure exceeds the number of parity disks. But more importantly, I find myself setting up multiple RAID levels (at least RAID6 and now thinking of more) just to make sure that MD raid will recover my data and not lose the whole cluster if an additional disk fails above the number of parity!!! The biggest advantage of the scheme that I have outlined is that with a single check sum I am mostly assure of a failed disk restoration and worst case only the media (movies/music) on the failing disk are lost not on the whole cluster. Also in my experience about disks and usage, while what you are saying was true a while ago when storage capacity had not hit multiple TBs. Now if I am buying 3-4 TB disks they are likely to last a while especially since the incremental % growth in sizes seem to be slowing down. Regards, Anshuman On 30 October 2014 22:55, Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: >> What you are suggesting will work for delaying writing the checksum >> (but still making 2 disks work non stop and lead to failure, cost >> etc). > > Hi Anshuman, > > I'm a bit missing the point here. > > In my experience, with my storage systems, I change > disks because they're too small, way long before they > are too old (way long before they fail). > That's why I end up with a collection of small HDDs. > which, in turn, I recycled in some custom storage > system (using disks of different size, like explained > in one of the links posted before). > > Honestly, the only reason to spin down the disks, still > in my experience, is for reducing power consumption. > And this can be done with a RAID-6 without problems > and in a extremely flexible way. > > So, the bottom line, still in my experience, is that > this you're describing seems quite a nice situation. > > Or, I did not understood what you're proposing. > > Thanks, > > bye, > > pg > >> I am proposing N independent disks which are rarely accessed. When >> parity has to be written to the remaining 1,2 ...X disks ...it is >> batched up (bcache is feasible) and written out once in a while >> depending on how much write is happening. N-1 disks stay spun down and >> only X disks wake up periodically to get checksum written to (this >> would be tweaked by the user based on how up to date he needs the >> parity to be (tolerance of rebuilding parity in case of crash) and vs >> disk access for each parity write) >> >> It can't be done using any RAID6 because RAID5/6 will stripe all the >> data across the devices making any read access wake up all the >> devices. Ditto for writing to parity on every write to a single disk. >> >> The architecture being proposed is a lazy write to manage parity for >> individual disks which won't suffer from RAID catastrophic data loss >> and concurrent disk. >> >> >> >> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote: >> >> >> >> Right on most counts but please see comments below. >> >> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> >>> >> >>> Just to be sure I understand, you would have N + X devices. Each of the >> >>> N >> >>> devices contains an independent filesystem and could be accessed directly >> >>> if >> >>> needed. Each of the X devices contains some codes so that if at most X >> >>> devices in total died, you would still be able to recover all of the >> >>> data. >> >>> If more than X devices failed, you would still get complete data from the >> >>> working devices. >> >>> >> >>> Every update would only write to the particular N device on which it is >> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger >> >>> than X for the spin-down to be really worth it. >> >>> >> >>> Am I right so far? >> >> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 >> >> devices to 1 data) so spin down is totally worth it for data >> >> protection but more on that below. >> >> >> >>> For some reason the writes to X are delayed... I don't really understand >> >>> that part. >> >> >> >> This delay is basically designed around archival devices which are >> >> rarely read from and even more rarely written to. By delaying writes >> >> on 2 criteria ( designated cache buffer filling up or preset time >> >> duration from last write expiring) we can significantly reduce the >> >> writes on the parity device. This assumes that we are ok to lose a >> >> movie or two in case the parity disk is not totally up to date but are >> >> more interested in device longevity. >> >> >> >>> Sounds like multi-parity RAID6 with no parity rotation and >> >>> chunksize == devicesize >> >> >> >> RAID6 would present us with a joint device and currently only allows >> >> writes to that directly, yes? Any writes will be striped. >> > >> > >> > I am not totally sure I understand your design, but it seems to me that the >> > following solution could work for you: >> > >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so >> > expensive that you can't scrub) >> > >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those >> > two will never spin-down) in writeback mode with writeback_running=off . >> > This will prevent writes to backend and leave the backend array spun down. >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on >> > and writethrough: it will wake up the backend raid6 array and flush all >> > dirty data. You can then then revert to writeback and writeback_running=off. >> > After this you can spin-down the backend array again. >> > >> > You also get read caching for free, which helps the backend array to stay >> > spun down as much as possible. >> > >> > Maybe you can modify bcache slightly so to implement an automatic switching >> > between the modes as described above, instead of polling the state from >> > outside. >> > >> > Would that work, or you are asking something different? >> > >> > EW >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > > piergiorgio ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-31 11:05 ` Anshuman Aggarwal @ 2014-10-31 14:25 ` Matt Garman 2014-11-01 12:55 ` Piergiorgio Sartor 1 sibling, 0 replies; 44+ messages in thread From: Matt Garman @ 2014-10-31 14:25 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, Mdadm (Re-posting as I forgot to change to plaintext mode for the mailing list, sorry for any dups.) In a later post, you said you had a 4-to-1 scheme, but it wasn't clear to me if that was 1 drive worth of data, and 4 drives worth of checksum/backup, or the other way around. In your proposed scheme, I assume you want your actual data drives to be spinning all the time? Otherwise, when you go to read data (play music/videos), you have the multi-second spinup delay... or is that OK with you? Some other considerations: modern 5400 RPM drives generally consume less than five watts in idle state[1]. Actual AC draw will be higher due to power supply inefficiency, so we'll err on the conservative side and say each drive requires 10 AC watts of power. My electrical rates in Chicago are about average for the USA (11 or 12 cents/kWH), and conveniently it roughly works out such that one always-on watt costs about $1/year. So, each always-running hard drive will cost about $10/year to run, less with a more efficient power supply. I know electricity is substantially more expensive in many parts of the world; or maybe you're running off-the-grid (e.g. solar) and have a very small power budget? On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > > - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot > based scheme (Its advantages are that its in user space and has cross > platform support but has the huge disadvantage of every checksum being > done from scratch slowing the system, causing immense wear and tear on > every snapshot and also losing any information updates upto the > snapshot point etc) Last time I looked at SnapRAID, it seemed like yours was its target use case. The "huge disadvantage of every checksum being done from scratch" sounds like a SnapRAID feature enhancement that might be simpler/easier/faster-to-get done than a major enhancement to the Linux kernel (just speculating though). But, on the other hand, by your use case description, writes are very infrequent, and you're willing to buffer checksum updates for quite a while... so what if you had a *monthly* cron job to do parity syncs? Schedule it for a time when the system is unlikely to be used to offset the increased load. That's only 12 "hard" tasks for the drive per year. I'm not an expert, but that doesn't "feel" like a lot of wear and tear. On the issue of wear and tear, I've mostly given up trying to understand what's best for my drives. One school of thought says many spinup-spindown cycles are actually harder on the drive than running 24/7. But maybe consumer drives actually aren't designed for 24/7 operation, so they're better off being cycled up and down. Or consumer drives can't handle the vibrations of being in a case with other 24/7 drives. But failure to"exercise" the entire drive regularly enough might result in a situation where an error has developed but you don't know until it's too late or your warranty period has expired. [1] http://www.silentpcreview.com/article29-page2.html On Fri, Oct 31, 2014 at 6:05 AM, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > Hi pg, > With MD raid striping all the writes not only does it keep ALL disks > spinning to read/write the current content, it also leads to > catastrophic data loss in case the rebuild/disk failure exceeds the > number of parity disks. > > But more importantly, I find myself setting up multiple RAID levels > (at least RAID6 and now thinking of more) just to make sure that MD > raid will recover my data and not lose the whole cluster if an > additional disk fails above the number of parity!!! The biggest > advantage of the scheme that I have outlined is that with a single > check sum I am mostly assure of a failed disk restoration and worst > case only the media (movies/music) on the failing disk are lost not on > the whole cluster. > > Also in my experience about disks and usage, while what you are saying > was true a while ago when storage capacity had not hit multiple TBs. > Now if I am buying 3-4 TB disks they are likely to last a while > especially since the incremental % growth in sizes seem to be slowing > down. > > Regards, > Anshuman > > On 30 October 2014 22:55, Piergiorgio Sartor > <piergiorgio.sartor@nexgo.de> wrote: >> On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: >>> What you are suggesting will work for delaying writing the checksum >>> (but still making 2 disks work non stop and lead to failure, cost >>> etc). >> >> Hi Anshuman, >> >> I'm a bit missing the point here. >> >> In my experience, with my storage systems, I change >> disks because they're too small, way long before they >> are too old (way long before they fail). >> That's why I end up with a collection of small HDDs. >> which, in turn, I recycled in some custom storage >> system (using disks of different size, like explained >> in one of the links posted before). >> >> Honestly, the only reason to spin down the disks, still >> in my experience, is for reducing power consumption. >> And this can be done with a RAID-6 without problems >> and in a extremely flexible way. >> >> So, the bottom line, still in my experience, is that >> this you're describing seems quite a nice situation. >> >> Or, I did not understood what you're proposing. >> >> Thanks, >> >> bye, >> >> pg >> >>> I am proposing N independent disks which are rarely accessed. When >>> parity has to be written to the remaining 1,2 ...X disks ...it is >>> batched up (bcache is feasible) and written out once in a while >>> depending on how much write is happening. N-1 disks stay spun down and >>> only X disks wake up periodically to get checksum written to (this >>> would be tweaked by the user based on how up to date he needs the >>> parity to be (tolerance of rebuilding parity in case of crash) and vs >>> disk access for each parity write) >>> >>> It can't be done using any RAID6 because RAID5/6 will stripe all the >>> data across the devices making any read access wake up all the >>> devices. Ditto for writing to parity on every write to a single disk. >>> >>> The architecture being proposed is a lazy write to manage parity for >>> individual disks which won't suffer from RAID catastrophic data loss >>> and concurrent disk. >>> >>> >>> >>> >>> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: >>> > On 29/10/2014 10:25, Anshuman Aggarwal wrote: >>> >> >>> >> Right on most counts but please see comments below. >>> >> >>> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >>> >>> >>> >>> Just to be sure I understand, you would have N + X devices. Each of the >>> >>> N >>> >>> devices contains an independent filesystem and could be accessed directly >>> >>> if >>> >>> needed. Each of the X devices contains some codes so that if at most X >>> >>> devices in total died, you would still be able to recover all of the >>> >>> data. >>> >>> If more than X devices failed, you would still get complete data from the >>> >>> working devices. >>> >>> >>> >>> Every update would only write to the particular N device on which it is >>> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger >>> >>> than X for the spin-down to be really worth it. >>> >>> >>> >>> Am I right so far? >>> >> >>> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 >>> >> devices to 1 data) so spin down is totally worth it for data >>> >> protection but more on that below. >>> >> >>> >>> For some reason the writes to X are delayed... I don't really understand >>> >>> that part. >>> >> >>> >> This delay is basically designed around archival devices which are >>> >> rarely read from and even more rarely written to. By delaying writes >>> >> on 2 criteria ( designated cache buffer filling up or preset time >>> >> duration from last write expiring) we can significantly reduce the >>> >> writes on the parity device. This assumes that we are ok to lose a >>> >> movie or two in case the parity disk is not totally up to date but are >>> >> more interested in device longevity. >>> >> >>> >>> Sounds like multi-parity RAID6 with no parity rotation and >>> >>> chunksize == devicesize >>> >> >>> >> RAID6 would present us with a joint device and currently only allows >>> >> writes to that directly, yes? Any writes will be striped. >>> > >>> > >>> > I am not totally sure I understand your design, but it seems to me that the >>> > following solution could work for you: >>> > >>> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, >>> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so >>> > expensive that you can't scrub) >>> > >>> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those >>> > two will never spin-down) in writeback mode with writeback_running=off . >>> > This will prevent writes to backend and leave the backend array spun down. >>> > When bcache is almost full (poll dirty_data), switch to writeback_running=on >>> > and writethrough: it will wake up the backend raid6 array and flush all >>> > dirty data. You can then then revert to writeback and writeback_running=off. >>> > After this you can spin-down the backend array again. >>> > >>> > You also get read caching for free, which helps the backend array to stay >>> > spun down as much as possible. >>> > >>> > Maybe you can modify bcache slightly so to implement an automatic switching >>> > between the modes as described above, instead of polling the state from >>> > outside. >>> > >>> > Would that work, or you are asking something different? >>> > >>> > EW >>> > >>> > -- >>> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> > the body of a message to majordomo@vger.kernel.org >>> > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> >> piergiorgio > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-31 11:05 ` Anshuman Aggarwal 2014-10-31 14:25 ` Matt Garman @ 2014-11-01 12:55 ` Piergiorgio Sartor 2014-11-06 2:29 ` Anshuman Aggarwal 1 sibling, 1 reply; 44+ messages in thread From: Piergiorgio Sartor @ 2014-11-01 12:55 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Piergiorgio Sartor, Ethan Wilson, linux-raid On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote: > Hi pg, > With MD raid striping all the writes not only does it keep ALL disks > spinning to read/write the current content, it also leads to > catastrophic data loss in case the rebuild/disk failure exceeds the > number of parity disks. Hi Anshuman, yes but do you have hard evidence that this is a common RAID-6 problem? Considering that we have now bad block list, write intent bitmap and proactive replacement, it does not seem to me really the main issue, having a triple fail in RAID-6. Considering that there are available libraries for more that 2 parities, I think the multiple failure case is quite a rarity. Furthermore, I suspect there are other type of catastrophic situation (lighting, for example) that can destroy an array completely. > But more importantly, I find myself setting up multiple RAID levels > (at least RAID6 and now thinking of more) just to make sure that MD > raid will recover my data and not lose the whole cluster if an > additional disk fails above the number of parity!!! The biggest > advantage of the scheme that I have outlined is that with a single > check sum I am mostly assure of a failed disk restoration and worst > case only the media (movies/music) on the failing disk are lost not on > the whole cluster. Each disk will have its own filesystem? If this is not the case, you cannot say if a single disk failure will lose only some files. > Also in my experience about disks and usage, while what you are saying > was true a while ago when storage capacity had not hit multiple TBs. > Now if I am buying 3-4 TB disks they are likely to last a while > especially since the incremental % growth in sizes seem to be slowing > down. As wrote above, you can safely replace disks before they fail, without compromising the array. bye, pg > Regards, > Anshuman > > On 30 October 2014 22:55, Piergiorgio Sartor > <piergiorgio.sartor@nexgo.de> wrote: > > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: > >> What you are suggesting will work for delaying writing the checksum > >> (but still making 2 disks work non stop and lead to failure, cost > >> etc). > > > > Hi Anshuman, > > > > I'm a bit missing the point here. > > > > In my experience, with my storage systems, I change > > disks because they're too small, way long before they > > are too old (way long before they fail). > > That's why I end up with a collection of small HDDs. > > which, in turn, I recycled in some custom storage > > system (using disks of different size, like explained > > in one of the links posted before). > > > > Honestly, the only reason to spin down the disks, still > > in my experience, is for reducing power consumption. > > And this can be done with a RAID-6 without problems > > and in a extremely flexible way. > > > > So, the bottom line, still in my experience, is that > > this you're describing seems quite a nice situation. > > > > Or, I did not understood what you're proposing. > > > > Thanks, > > > > bye, > > > > pg > > > >> I am proposing N independent disks which are rarely accessed. When > >> parity has to be written to the remaining 1,2 ...X disks ...it is > >> batched up (bcache is feasible) and written out once in a while > >> depending on how much write is happening. N-1 disks stay spun down and > >> only X disks wake up periodically to get checksum written to (this > >> would be tweaked by the user based on how up to date he needs the > >> parity to be (tolerance of rebuilding parity in case of crash) and vs > >> disk access for each parity write) > >> > >> It can't be done using any RAID6 because RAID5/6 will stripe all the > >> data across the devices making any read access wake up all the > >> devices. Ditto for writing to parity on every write to a single disk. > >> > >> The architecture being proposed is a lazy write to manage parity for > >> individual disks which won't suffer from RAID catastrophic data loss > >> and concurrent disk. > >> > >> > >> > >> > >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: > >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote: > >> >> > >> >> Right on most counts but please see comments below. > >> >> > >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: > >> >>> > >> >>> Just to be sure I understand, you would have N + X devices. Each of the > >> >>> N > >> >>> devices contains an independent filesystem and could be accessed directly > >> >>> if > >> >>> needed. Each of the X devices contains some codes so that if at most X > >> >>> devices in total died, you would still be able to recover all of the > >> >>> data. > >> >>> If more than X devices failed, you would still get complete data from the > >> >>> working devices. > >> >>> > >> >>> Every update would only write to the particular N device on which it is > >> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger > >> >>> than X for the spin-down to be really worth it. > >> >>> > >> >>> Am I right so far? > >> >> > >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 > >> >> devices to 1 data) so spin down is totally worth it for data > >> >> protection but more on that below. > >> >> > >> >>> For some reason the writes to X are delayed... I don't really understand > >> >>> that part. > >> >> > >> >> This delay is basically designed around archival devices which are > >> >> rarely read from and even more rarely written to. By delaying writes > >> >> on 2 criteria ( designated cache buffer filling up or preset time > >> >> duration from last write expiring) we can significantly reduce the > >> >> writes on the parity device. This assumes that we are ok to lose a > >> >> movie or two in case the parity disk is not totally up to date but are > >> >> more interested in device longevity. > >> >> > >> >>> Sounds like multi-parity RAID6 with no parity rotation and > >> >>> chunksize == devicesize > >> >> > >> >> RAID6 would present us with a joint device and currently only allows > >> >> writes to that directly, yes? Any writes will be striped. > >> > > >> > > >> > I am not totally sure I understand your design, but it seems to me that the > >> > following solution could work for you: > >> > > >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, > >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so > >> > expensive that you can't scrub) > >> > > >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those > >> > two will never spin-down) in writeback mode with writeback_running=off . > >> > This will prevent writes to backend and leave the backend array spun down. > >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on > >> > and writethrough: it will wake up the backend raid6 array and flush all > >> > dirty data. You can then then revert to writeback and writeback_running=off. > >> > After this you can spin-down the backend array again. > >> > > >> > You also get read caching for free, which helps the backend array to stay > >> > spun down as much as possible. > >> > > >> > Maybe you can modify bcache slightly so to implement an automatic switching > >> > between the modes as described above, instead of polling the state from > >> > outside. > >> > > >> > Would that work, or you are asking something different? > >> > > >> > EW > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> > the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > -- > > > > piergiorgio -- piergiorgio ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-01 12:55 ` Piergiorgio Sartor @ 2014-11-06 2:29 ` Anshuman Aggarwal 0 siblings, 0 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-06 2:29 UTC (permalink / raw) To: Piergiorgio Sartor; +Cc: Ethan Wilson, Mdadm On 1 November 2014 18:25, Piergiorgio Sartor <piergiorgio.sartor@nexgo.de> wrote: > On Fri, Oct 31, 2014 at 04:35:11PM +0530, Anshuman Aggarwal wrote: >> Hi pg, >> With MD raid striping all the writes not only does it keep ALL disks >> spinning to read/write the current content, it also leads to >> catastrophic data loss in case the rebuild/disk failure exceeds the >> number of parity disks. > > Hi Anshuman, > > yes but do you have hard evidence that > this is a common RAID-6 problem? > Considering that we have now bad block list, > write intent bitmap and proactive replacement, > it does not seem to me really the main issue, > having a triple fail in RAID-6. > Considering that there are available libraries > for more that 2 parities, I think the multiple > failure case is quite a rarity. > Furthermore, I suspect there are other type > of catastrophic situation (lighting, for example) > that can destroy an array completely. I have most definitely lost data when a drive fails and during reconstruction another drive fails (remember the array has been chugging away all drives active for 2-3 years). At this point I'm dead scared of losing another one to avoid catastrophic. If I dont' go out and buy a replacement right away i'm on borrowed time for my whole array. For home use this is not fun. > >> But more importantly, I find myself setting up multiple RAID levels >> (at least RAID6 and now thinking of more) just to make sure that MD >> raid will recover my data and not lose the whole cluster if an >> additional disk fails above the number of parity!!! The biggest >> advantage of the scheme that I have outlined is that with a single >> check sum I am mostly assure of a failed disk restoration and worst >> case only the media (movies/music) on the failing disk are lost not on >> the whole cluster. > > Each disk will have its own filesystem? > If this is not the case, you cannot say > if a single disk failure will lose only > some files. Indeed, each device will indeed be an independent block device and file system. Joined together by some union FS if the user so requires but that's not in scope for this discussion. > >> Also in my experience about disks and usage, while what you are saying >> was true a while ago when storage capacity had not hit multiple TBs. >> Now if I am buying 3-4 TB disks they are likely to last a while >> especially since the incremental % growth in sizes seem to be slowing >> down. > > As wrote above, you can safely replace > disks before they fail, without compromising > the array. Same point above. For home use, I might be away or not have time to give the array the TLC (tender loving care ;) it needs which is the only shortcoming of MD really...its hard on the disks and has potential of compromising the whole array (giving super fast R/W performance in return for sure) > > bye, > > pg > >> Regards, >> Anshuman >> >> On 30 October 2014 22:55, Piergiorgio Sartor >> <piergiorgio.sartor@nexgo.de> wrote: >> > On Thu, Oct 30, 2014 at 08:27:27PM +0530, Anshuman Aggarwal wrote: >> >> What you are suggesting will work for delaying writing the checksum >> >> (but still making 2 disks work non stop and lead to failure, cost >> >> etc). >> > >> > Hi Anshuman, >> > >> > I'm a bit missing the point here. >> > >> > In my experience, with my storage systems, I change >> > disks because they're too small, way long before they >> > are too old (way long before they fail). >> > That's why I end up with a collection of small HDDs. >> > which, in turn, I recycled in some custom storage >> > system (using disks of different size, like explained >> > in one of the links posted before). >> > >> > Honestly, the only reason to spin down the disks, still >> > in my experience, is for reducing power consumption. >> > And this can be done with a RAID-6 without problems >> > and in a extremely flexible way. >> > >> > So, the bottom line, still in my experience, is that >> > this you're describing seems quite a nice situation. >> > >> > Or, I did not understood what you're proposing. >> > >> > Thanks, >> > >> > bye, >> > >> > pg >> > >> >> I am proposing N independent disks which are rarely accessed. When >> >> parity has to be written to the remaining 1,2 ...X disks ...it is >> >> batched up (bcache is feasible) and written out once in a while >> >> depending on how much write is happening. N-1 disks stay spun down and >> >> only X disks wake up periodically to get checksum written to (this >> >> would be tweaked by the user based on how up to date he needs the >> >> parity to be (tolerance of rebuilding parity in case of crash) and vs >> >> disk access for each parity write) >> >> >> >> It can't be done using any RAID6 because RAID5/6 will stripe all the >> >> data across the devices making any read access wake up all the >> >> devices. Ditto for writing to parity on every write to a single disk. >> >> >> >> The architecture being proposed is a lazy write to manage parity for >> >> individual disks which won't suffer from RAID catastrophic data loss >> >> and concurrent disk. >> >> >> >> >> >> >> >> >> >> On 30 October 2014 00:57, Ethan Wilson <ethan.wilson@shiftmail.org> wrote: >> >> > On 29/10/2014 10:25, Anshuman Aggarwal wrote: >> >> >> >> >> >> Right on most counts but please see comments below. >> >> >> >> >> >> On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> >> >>> >> >> >>> Just to be sure I understand, you would have N + X devices. Each of the >> >> >>> N >> >> >>> devices contains an independent filesystem and could be accessed directly >> >> >>> if >> >> >>> needed. Each of the X devices contains some codes so that if at most X >> >> >>> devices in total died, you would still be able to recover all of the >> >> >>> data. >> >> >>> If more than X devices failed, you would still get complete data from the >> >> >>> working devices. >> >> >>> >> >> >>> Every update would only write to the particular N device on which it is >> >> >>> relevant, and all of the X devices. So N needs to be quite a bit bigger >> >> >>> than X for the spin-down to be really worth it. >> >> >>> >> >> >>> Am I right so far? >> >> >> >> >> >> Perfectly right so far. I typically have a N to X ratio of 4 (4 >> >> >> devices to 1 data) so spin down is totally worth it for data >> >> >> protection but more on that below. >> >> >> >> >> >>> For some reason the writes to X are delayed... I don't really understand >> >> >>> that part. >> >> >> >> >> >> This delay is basically designed around archival devices which are >> >> >> rarely read from and even more rarely written to. By delaying writes >> >> >> on 2 criteria ( designated cache buffer filling up or preset time >> >> >> duration from last write expiring) we can significantly reduce the >> >> >> writes on the parity device. This assumes that we are ok to lose a >> >> >> movie or two in case the parity disk is not totally up to date but are >> >> >> more interested in device longevity. >> >> >> >> >> >>> Sounds like multi-parity RAID6 with no parity rotation and >> >> >>> chunksize == devicesize >> >> >> >> >> >> RAID6 would present us with a joint device and currently only allows >> >> >> writes to that directly, yes? Any writes will be striped. >> >> > >> >> > >> >> > I am not totally sure I understand your design, but it seems to me that the >> >> > following solution could work for you: >> >> > >> >> > MD raid-6, maybe multi-parity (multi-parity not implemented yet in MD yet, >> >> > but just do a periodic scrub and 2 parities can be fine. Wake-up is not so >> >> > expensive that you can't scrub) >> >> > >> >> > Over that you put a raid1 of 2 x 4TB disks as a bcache cache device (those >> >> > two will never spin-down) in writeback mode with writeback_running=off . >> >> > This will prevent writes to backend and leave the backend array spun down. >> >> > When bcache is almost full (poll dirty_data), switch to writeback_running=on >> >> > and writethrough: it will wake up the backend raid6 array and flush all >> >> > dirty data. You can then then revert to writeback and writeback_running=off. >> >> > After this you can spin-down the backend array again. >> >> > >> >> > You also get read caching for free, which helps the backend array to stay >> >> > spun down as much as possible. >> >> > >> >> > Maybe you can modify bcache slightly so to implement an automatic switching >> >> > between the modes as described above, instead of polling the state from >> >> > outside. >> >> > >> >> > Would that work, or you are asking something different? >> >> > >> >> > EW >> >> > >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> >> > the body of a message to majordomo@vger.kernel.org >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> > -- >> > >> > piergiorgio > > -- > > piergiorgio ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-29 9:25 ` Anshuman Aggarwal 2014-10-29 19:27 ` Ethan Wilson @ 2014-10-30 15:00 ` Anshuman Aggarwal 2014-11-03 5:52 ` NeilBrown 1 sibling, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-10-30 15:00 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Would chunksize==disksize work? Wouldn't that lead to the entire parity be invalidated for any write to any of the disks (assuming md operates at a chunk level)...also please see my reply below On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > Right on most counts but please see comments below. > > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> Just to be sure I understand, you would have N + X devices. Each of the N >> devices contains an independent filesystem and could be accessed directly if >> needed. Each of the X devices contains some codes so that if at most X >> devices in total died, you would still be able to recover all of the data. >> If more than X devices failed, you would still get complete data from the >> working devices. >> >> Every update would only write to the particular N device on which it is >> relevant, and all of the X devices. So N needs to be quite a bit bigger >> than X for the spin-down to be really worth it. >> >> Am I right so far? > > Perfectly right so far. I typically have a N to X ratio of 4 (4 > devices to 1 data) so spin down is totally worth it for data > protection but more on that below. > >> >> For some reason the writes to X are delayed... I don't really understand >> that part. > > This delay is basically designed around archival devices which are > rarely read from and even more rarely written to. By delaying writes > on 2 criteria ( designated cache buffer filling up or preset time > duration from last write expiring) we can significantly reduce the > writes on the parity device. This assumes that we are ok to lose a > movie or two in case the parity disk is not totally up to date but are > more interested in device longevity. > >> >> Sounds like multi-parity RAID6 with no parity rotation and >> chunksize == devicesize > RAID6 would present us with a joint device and currently only allows > writes to that directly, yes? Any writes will be striped. > In any case would md raid allow the underlying device to be written to > directly? Also how would it know that the device has been written to > and hence parity has to be updated? What about the superblock which > the FS would not know about? > > Also except for the delayed checksum writing part which would be > significant if one of the objectives is to reduce the amount of > writes. Can we delay that in the code currently for RAID6? I > understand the objective of RAID6 is to ensure data recovery and we > are looking at a compromise in this case. > > If feasible, this can be an enhancement to MD RAID as well where N > devices are presented instead of a single joint device in case of > raid6 (maybe the multi part device can be individual disks?) > > It will certainly solve my problem of where to store the metadata. I > was currently hoping to just store it as a configuration file to be > read by the initramfs since in this case worst case scenario the > checksum goes out of sync and is rebuilt from scratch. > >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >> impartial opinion from me on that topic. > > I haven't hacked around the kernel internals much so far so will have > to dig out that history. I will welcome any particular links/mail > threads I should look at for guidance (with both yours and opposing > points of view) ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-10-30 15:00 ` Anshuman Aggarwal @ 2014-11-03 5:52 ` NeilBrown 2014-11-03 18:04 ` Piergiorgio Sartor ` (2 more replies) 0 siblings, 3 replies; 44+ messages in thread From: NeilBrown @ 2014-11-03 5:52 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 4572 bytes --] On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > Would chunksize==disksize work? Wouldn't that lead to the entire > parity be invalidated for any write to any of the disks (assuming md > operates at a chunk level)...also please see my reply below Operating at a chunk level would be a very poor design choice. md/raid5 operates in units of 1 page (4K). > > On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > > Right on most counts but please see comments below. > > > > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: > >> Just to be sure I understand, you would have N + X devices. Each of the N > >> devices contains an independent filesystem and could be accessed directly if > >> needed. Each of the X devices contains some codes so that if at most X > >> devices in total died, you would still be able to recover all of the data. > >> If more than X devices failed, you would still get complete data from the > >> working devices. > >> > >> Every update would only write to the particular N device on which it is > >> relevant, and all of the X devices. So N needs to be quite a bit bigger > >> than X for the spin-down to be really worth it. > >> > >> Am I right so far? > > > > Perfectly right so far. I typically have a N to X ratio of 4 (4 > > devices to 1 data) so spin down is totally worth it for data > > protection but more on that below. > > > >> > >> For some reason the writes to X are delayed... I don't really understand > >> that part. > > > > This delay is basically designed around archival devices which are > > rarely read from and even more rarely written to. By delaying writes > > on 2 criteria ( designated cache buffer filling up or preset time > > duration from last write expiring) we can significantly reduce the > > writes on the parity device. This assumes that we are ok to lose a > > movie or two in case the parity disk is not totally up to date but are > > more interested in device longevity. > > > >> > >> Sounds like multi-parity RAID6 with no parity rotation and > >> chunksize == devicesize > > RAID6 would present us with a joint device and currently only allows > > writes to that directly, yes? Any writes will be striped. If the chunksize equals the device size, then you need a very large write for it to be striped. > > In any case would md raid allow the underlying device to be written to > > directly? Also how would it know that the device has been written to > > and hence parity has to be updated? What about the superblock which > > the FS would not know about? No, you wouldn't write to the underlying device. You would carefully partition the RAID5 so each partition aligns exactly with an underlying device. Then write to the partition. > > > > Also except for the delayed checksum writing part which would be > > significant if one of the objectives is to reduce the amount of > > writes. Can we delay that in the code currently for RAID6? I > > understand the objective of RAID6 is to ensure data recovery and we > > are looking at a compromise in this case. "simple matter of programming" Of course there would be a limit to how much data can be buffered in memory before it has to be flushed out. If you are mostly storing movies, then they are probably too large to buffer. Why not just write them out straight away? NeilBrown > > > > If feasible, this can be an enhancement to MD RAID as well where N > > devices are presented instead of a single joint device in case of > > raid6 (maybe the multi part device can be individual disks?) > > > > It will certainly solve my problem of where to store the metadata. I > > was currently hoping to just store it as a configuration file to be > > read by the initramfs since in this case worst case scenario the > > checksum goes out of sync and is rebuilt from scratch. > > > >> > >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely > >> impartial opinion from me on that topic. > > > > I haven't hacked around the kernel internals much so far so will have > > to dig out that history. I will welcome any particular links/mail > > threads I should look at for guidance (with both yours and opposing > > points of view) > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-03 5:52 ` NeilBrown @ 2014-11-03 18:04 ` Piergiorgio Sartor 2014-11-06 2:24 ` Anshuman Aggarwal 2014-11-24 7:29 ` Anshuman Aggarwal 2 siblings, 0 replies; 44+ messages in thread From: Piergiorgio Sartor @ 2014-11-03 18:04 UTC (permalink / raw) To: NeilBrown; +Cc: Anshuman Aggarwal, linux-raid On Mon, Nov 03, 2014 at 04:52:17PM +1100, NeilBrown wrote: [...] > "simple matter of programming" > Of course there would be a limit to how much data can be buffered in memory > before it has to be flushed out. > If you are mostly storing movies, then they are probably too large to > buffer. Why not just write them out straight away? One scenario I can envision is the following. You've a bunch of HDDs in RAID-5/6, which are almost always in standby (spin down). Together, you've 2 SSDs in RAID-10. All the write (and read, if possible) operations are done towards the SSDs. When the SSD RAID is X% full, the RAID-5/6 is activated and the data *moved* (maybe copied, with proper cache policy) there. In case of reading (a large file), the RAID-5/6 is activated, the file copied to the SSD RAID, and, when finished, the HDDs put in standby again. Of course, this is *not* a block device protocol, it is a filesystem one. It is the FS that must handle the caching, because only the FS can know the file size, for example. bye, -- piergiorgio ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-03 5:52 ` NeilBrown 2014-11-03 18:04 ` Piergiorgio Sartor @ 2014-11-06 2:24 ` Anshuman Aggarwal 2014-11-24 7:29 ` Anshuman Aggarwal 2 siblings, 0 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-06 2:24 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm Pls see below On 3 November 2014 11:22, NeilBrown <neilb@suse.de> wrote: > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > >> Would chunksize==disksize work? Wouldn't that lead to the entire >> parity be invalidated for any write to any of the disks (assuming md >> operates at a chunk level)...also please see my reply below > > Operating at a chunk level would be a very poor design choice. md/raid5 > operates in units of 1 page (4K). > > >> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >> > Right on most counts but please see comments below. >> > >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> >> Just to be sure I understand, you would have N + X devices. Each of the N >> >> devices contains an independent filesystem and could be accessed directly if >> >> needed. Each of the X devices contains some codes so that if at most X >> >> devices in total died, you would still be able to recover all of the data. >> >> If more than X devices failed, you would still get complete data from the >> >> working devices. >> >> >> >> Every update would only write to the particular N device on which it is >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger >> >> than X for the spin-down to be really worth it. >> >> >> >> Am I right so far? >> > >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 >> > devices to 1 data) so spin down is totally worth it for data >> > protection but more on that below. >> > >> >> >> >> For some reason the writes to X are delayed... I don't really understand >> >> that part. >> > >> > This delay is basically designed around archival devices which are >> > rarely read from and even more rarely written to. By delaying writes >> > on 2 criteria ( designated cache buffer filling up or preset time >> > duration from last write expiring) we can significantly reduce the >> > writes on the parity device. This assumes that we are ok to lose a >> > movie or two in case the parity disk is not totally up to date but are >> > more interested in device longevity. >> > >> >> >> >> Sounds like multi-parity RAID6 with no parity rotation and >> >> chunksize == devicesize >> > RAID6 would present us with a joint device and currently only allows >> > writes to that directly, yes? Any writes will be striped. > > If the chunksize equals the device size, then you need a very large write for > it to be striped. > >> > In any case would md raid allow the underlying device to be written to >> > directly? Also how would it know that the device has been written to >> > and hence parity has to be updated? What about the superblock which >> > the FS would not know about? > > No, you wouldn't write to the underlying device. You would carefully > partition the RAID5 so each partition aligns exactly with an underlying > device. Then write to the partition. This is what I'm unclear about. Even with non rotating parity on RAID 5/6 is it possible to create md partitions such that the writes are effectively not striped (within each partition) and that each partition on the md device ends up writing only to that one device? How is this managed? My understanding is that raid5/6 will stripe any data blocks across all the devices making all of them spin up for each read and write. > >> > >> > Also except for the delayed checksum writing part which would be >> > significant if one of the objectives is to reduce the amount of >> > writes. Can we delay that in the code currently for RAID6? I >> > understand the objective of RAID6 is to ensure data recovery and we >> > are looking at a compromise in this case. > > "simple matter of programming" > Of course there would be a limit to how much data can be buffered in memory > before it has to be flushed out. > If you are mostly storing movies, then they are probably too large to > buffer. Why not just write them out straight away? Well, yeah if the buffer gets filled (such as by a movie) the parity will get written pretty much write away (the main data drive gets written to immediately anyways). The delay is to prevent parity drive spin ups due to a small updates on any one of the drives in the array. Maybe a small temp file created by a software etc. > > NeilBrown > > > >> > >> > If feasible, this can be an enhancement to MD RAID as well where N >> > devices are presented instead of a single joint device in case of >> > raid6 (maybe the multi part device can be individual disks?) >> > >> > It will certainly solve my problem of where to store the metadata. I >> > was currently hoping to just store it as a configuration file to be >> > read by the initramfs since in this case worst case scenario the >> > checksum goes out of sync and is rebuilt from scratch. >> > >> >> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >> >> impartial opinion from me on that topic. >> > >> > I haven't hacked around the kernel internals much so far so will have >> > to dig out that history. I will welcome any particular links/mail >> > threads I should look at for guidance (with both yours and opposing >> > points of view) >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-03 5:52 ` NeilBrown 2014-11-03 18:04 ` Piergiorgio Sartor 2014-11-06 2:24 ` Anshuman Aggarwal @ 2014-11-24 7:29 ` Anshuman Aggarwal 2014-11-24 22:50 ` NeilBrown 2 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-24 7:29 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > >> Would chunksize==disksize work? Wouldn't that lead to the entire >> parity be invalidated for any write to any of the disks (assuming md >> operates at a chunk level)...also please see my reply below > > Operating at a chunk level would be a very poor design choice. md/raid5 > operates in units of 1 page (4K). It appears that my requirement may be met by a partitionable md raid 4 array where the partitions are all on individual underlying block devices not striped across the block devices. Is that currently possible with md raid? I dont' see how but such an enhancement could do all that I had outlined earlier Is this possible to implement using RAID4 and MD already? can the partitions be made to write to individual block devices such that parity updates don't require reading all devices? To illustrate: -----------------RAID - 4 --------------------- | Device 1 Device 2 Device 3 Parity A1 B1 C1 P1 A2 B2 C2 P2 A3 B3 C3 P3 Each device gets written to independently (via a layer of block devices)...so Data on Device 1 is written as A1, A2, A3 contiguous blocks leading to updation of P1, P2 P3 (without causing any reads on devices 2 and 3 using XOR for the parity). In RAID4, IIUC data gets striped and all devices become a single block device. > > >> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >> > Right on most counts but please see comments below. >> > >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> >> Just to be sure I understand, you would have N + X devices. Each of the N >> >> devices contains an independent filesystem and could be accessed directly if >> >> needed. Each of the X devices contains some codes so that if at most X >> >> devices in total died, you would still be able to recover all of the data. >> >> If more than X devices failed, you would still get complete data from the >> >> working devices. >> >> >> >> Every update would only write to the particular N device on which it is >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger >> >> than X for the spin-down to be really worth it. >> >> >> >> Am I right so far? >> > >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 >> > devices to 1 data) so spin down is totally worth it for data >> > protection but more on that below. >> > >> >> >> >> For some reason the writes to X are delayed... I don't really understand >> >> that part. >> > >> > This delay is basically designed around archival devices which are >> > rarely read from and even more rarely written to. By delaying writes >> > on 2 criteria ( designated cache buffer filling up or preset time >> > duration from last write expiring) we can significantly reduce the >> > writes on the parity device. This assumes that we are ok to lose a >> > movie or two in case the parity disk is not totally up to date but are >> > more interested in device longevity. >> > >> >> >> >> Sounds like multi-parity RAID6 with no parity rotation and >> >> chunksize == devicesize >> > RAID6 would present us with a joint device and currently only allows >> > writes to that directly, yes? Any writes will be striped. > > If the chunksize equals the device size, then you need a very large write for > it to be striped. > >> > In any case would md raid allow the underlying device to be written to >> > directly? Also how would it know that the device has been written to >> > and hence parity has to be updated? What about the superblock which >> > the FS would not know about? > > No, you wouldn't write to the underlying device. You would carefully > partition the RAID5 so each partition aligns exactly with an underlying > device. Then write to the partition. > >> > >> > Also except for the delayed checksum writing part which would be >> > significant if one of the objectives is to reduce the amount of >> > writes. Can we delay that in the code currently for RAID6? I >> > understand the objective of RAID6 is to ensure data recovery and we >> > are looking at a compromise in this case. > > "simple matter of programming" > Of course there would be a limit to how much data can be buffered in memory > before it has to be flushed out. > If you are mostly storing movies, then they are probably too large to > buffer. Why not just write them out straight away? > > NeilBrown > > > >> > >> > If feasible, this can be an enhancement to MD RAID as well where N >> > devices are presented instead of a single joint device in case of >> > raid6 (maybe the multi part device can be individual disks?) >> > >> > It will certainly solve my problem of where to store the metadata. I >> > was currently hoping to just store it as a configuration file to be >> > read by the initramfs since in this case worst case scenario the >> > checksum goes out of sync and is rebuilt from scratch. >> > >> >> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >> >> impartial opinion from me on that topic. >> > >> > I haven't hacked around the kernel internals much so far so will have >> > to dig out that history. I will welcome any particular links/mail >> > threads I should look at for guidance (with both yours and opposing >> > points of view) >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 7:29 ` Anshuman Aggarwal @ 2014-11-24 22:50 ` NeilBrown 2014-11-26 6:24 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: NeilBrown @ 2014-11-24 22:50 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Mdadm [-- Attachment #1: Type: text/plain, Size: 7687 bytes --] On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: > > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal > > <anshuman.aggarwal@gmail.com> wrote: > > > >> Would chunksize==disksize work? Wouldn't that lead to the entire > >> parity be invalidated for any write to any of the disks (assuming md > >> operates at a chunk level)...also please see my reply below > > > > Operating at a chunk level would be a very poor design choice. md/raid5 > > operates in units of 1 page (4K). > > It appears that my requirement may be met by a partitionable md raid 4 > array where the partitions are all on individual underlying block > devices not striped across the block devices. Is that currently > possible with md raid? I dont' see how but such an enhancement could > do all that I had outlined earlier > > Is this possible to implement using RAID4 and MD already? Nearly. RAID4 currently requires the chunk size to be a power of 2. Rounding down the size of your drives to match that could waste nearly half the space. However it should work as a proof-of-concept. RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for RAID4/5/6 would be quite possible. > can the > partitions be made to write to individual block devices such that > parity updates don't require reading all devices? md/raid4 will currently tries to minimize total IO requests when performing an update, but prefer spreading the IO over more devices if the total number of requests is the same. So for a 4-drive RAID4, Updating a single block can be done by: read old data block, read parity, write data, write parity - 4 IO requests or read other 2 data blocks, write data, write parity - 4 IO requests. In this case it will prefer the second, which is not what you want. With 5-drive RAID4, the second option will require 5 IO requests, so the first will be chosen. It is quite trivial to flip this default for testing - if (rmw < rcw && rmw > 0) { + if (rmw <= rcw && rmw > 0) { If you had 5 drives, you could experiment with no code changes. Make the chunk size the largest power of 2 that fits in the device, and then partition to align the partitions on those boundaries. NeilBrown > > To illustrate: > -----------------RAID - 4 --------------------- > | > Device 1 Device 2 Device 3 Parity > A1 B1 C1 P1 > A2 B2 C2 P2 > A3 B3 C3 P3 > > Each device gets written to independently (via a layer of block > devices)...so Data on Device 1 is written as A1, A2, A3 contiguous > blocks leading to updation of P1, P2 P3 (without causing any reads on > devices 2 and 3 using XOR for the parity). > > In RAID4, IIUC data gets striped and all devices become a single block device. > > > > > > > >> > >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > >> > Right on most counts but please see comments below. > >> > > >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: > >> >> Just to be sure I understand, you would have N + X devices. Each of the N > >> >> devices contains an independent filesystem and could be accessed directly if > >> >> needed. Each of the X devices contains some codes so that if at most X > >> >> devices in total died, you would still be able to recover all of the data. > >> >> If more than X devices failed, you would still get complete data from the > >> >> working devices. > >> >> > >> >> Every update would only write to the particular N device on which it is > >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger > >> >> than X for the spin-down to be really worth it. > >> >> > >> >> Am I right so far? > >> > > >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 > >> > devices to 1 data) so spin down is totally worth it for data > >> > protection but more on that below. > >> > > >> >> > >> >> For some reason the writes to X are delayed... I don't really understand > >> >> that part. > >> > > >> > This delay is basically designed around archival devices which are > >> > rarely read from and even more rarely written to. By delaying writes > >> > on 2 criteria ( designated cache buffer filling up or preset time > >> > duration from last write expiring) we can significantly reduce the > >> > writes on the parity device. This assumes that we are ok to lose a > >> > movie or two in case the parity disk is not totally up to date but are > >> > more interested in device longevity. > >> > > >> >> > >> >> Sounds like multi-parity RAID6 with no parity rotation and > >> >> chunksize == devicesize > >> > RAID6 would present us with a joint device and currently only allows > >> > writes to that directly, yes? Any writes will be striped. > > > > If the chunksize equals the device size, then you need a very large write for > > it to be striped. > > > >> > In any case would md raid allow the underlying device to be written to > >> > directly? Also how would it know that the device has been written to > >> > and hence parity has to be updated? What about the superblock which > >> > the FS would not know about? > > > > No, you wouldn't write to the underlying device. You would carefully > > partition the RAID5 so each partition aligns exactly with an underlying > > device. Then write to the partition. > > > >> > > >> > Also except for the delayed checksum writing part which would be > >> > significant if one of the objectives is to reduce the amount of > >> > writes. Can we delay that in the code currently for RAID6? I > >> > understand the objective of RAID6 is to ensure data recovery and we > >> > are looking at a compromise in this case. > > > > "simple matter of programming" > > Of course there would be a limit to how much data can be buffered in memory > > before it has to be flushed out. > > If you are mostly storing movies, then they are probably too large to > > buffer. Why not just write them out straight away? > > > > NeilBrown > > > > > > > >> > > >> > If feasible, this can be an enhancement to MD RAID as well where N > >> > devices are presented instead of a single joint device in case of > >> > raid6 (maybe the multi part device can be individual disks?) > >> > > >> > It will certainly solve my problem of where to store the metadata. I > >> > was currently hoping to just store it as a configuration file to be > >> > read by the initramfs since in this case worst case scenario the > >> > checksum goes out of sync and is rebuilt from scratch. > >> > > >> >> > >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely > >> >> impartial opinion from me on that topic. > >> > > >> > I haven't hacked around the kernel internals much so far so will have > >> > to dig out that history. I will welcome any particular links/mail > >> > threads I should look at for guidance (with both yours and opposing > >> > points of view) > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in > >> the body of a message to majordomo@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 22:50 ` NeilBrown @ 2014-11-26 6:24 ` Anshuman Aggarwal 2014-12-01 16:00 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-26 6:24 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote: > On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > >> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: >> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal >> > <anshuman.aggarwal@gmail.com> wrote: >> > >> >> Would chunksize==disksize work? Wouldn't that lead to the entire >> >> parity be invalidated for any write to any of the disks (assuming md >> >> operates at a chunk level)...also please see my reply below >> > >> > Operating at a chunk level would be a very poor design choice. md/raid5 >> > operates in units of 1 page (4K). >> >> It appears that my requirement may be met by a partitionable md raid 4 >> array where the partitions are all on individual underlying block >> devices not striped across the block devices. Is that currently >> possible with md raid? I dont' see how but such an enhancement could >> do all that I had outlined earlier >> >> Is this possible to implement using RAID4 and MD already? > > Nearly. RAID4 currently requires the chunk size to be a power of 2. > Rounding down the size of your drives to match that could waste nearly half > the space. However it should work as a proof-of-concept. > > RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for > RAID4/5/6 would be quite possible. > >> can the >> partitions be made to write to individual block devices such that >> parity updates don't require reading all devices? > > md/raid4 will currently tries to minimize total IO requests when performing > an update, but prefer spreading the IO over more devices if the total number > of requests is the same. > > So for a 4-drive RAID4, Updating a single block can be done by: > read old data block, read parity, write data, write parity - 4 IO requests > or > read other 2 data blocks, write data, write parity - 4 IO requests. > > In this case it will prefer the second, which is not what you want. > With 5-drive RAID4, the second option will require 5 IO requests, so the first > will be chosen. > It is quite trivial to flip this default for testing > > - if (rmw < rcw && rmw > 0) { > + if (rmw <= rcw && rmw > 0) { > > > If you had 5 drives, you could experiment with no code changes. > Make the chunk size the largest power of 2 that fits in the device, and then > partition to align the partitions on those boundaries. If the chunk size is almost the same as the device size, I assume the entire chunk is not invalidated for parity on writing to a single block? i.e. if only 1 block is updated only that blocks parity will be read and written and not for the whole chunk? If thats' the case, what purpose does a chunk serve in md raid ? If that's not the case, it wouldn't work because a single block updation would lead to parity being written for the entire chunk, which is the size of the device I do have more than 5 drives though they are in use currently. I will create a small testing partition on each device of the same size and run the test on that after ensuring that the drives do go to sleep. > > NeilBrown > Thanks, Anshuman > >> >> To illustrate: >> -----------------RAID - 4 --------------------- >> | >> Device 1 Device 2 Device 3 Parity >> A1 B1 C1 P1 >> A2 B2 C2 P2 >> A3 B3 C3 P3 >> >> Each device gets written to independently (via a layer of block >> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous >> blocks leading to updation of P1, P2 P3 (without causing any reads on >> devices 2 and 3 using XOR for the parity). >> >> In RAID4, IIUC data gets striped and all devices become a single block device. >> >> >> > >> > >> >> >> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >> >> > Right on most counts but please see comments below. >> >> > >> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >> >> >> Just to be sure I understand, you would have N + X devices. Each of the N >> >> >> devices contains an independent filesystem and could be accessed directly if >> >> >> needed. Each of the X devices contains some codes so that if at most X >> >> >> devices in total died, you would still be able to recover all of the data. >> >> >> If more than X devices failed, you would still get complete data from the >> >> >> working devices. >> >> >> >> >> >> Every update would only write to the particular N device on which it is >> >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger >> >> >> than X for the spin-down to be really worth it. >> >> >> >> >> >> Am I right so far? >> >> > >> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 >> >> > devices to 1 data) so spin down is totally worth it for data >> >> > protection but more on that below. >> >> > >> >> >> >> >> >> For some reason the writes to X are delayed... I don't really understand >> >> >> that part. >> >> > >> >> > This delay is basically designed around archival devices which are >> >> > rarely read from and even more rarely written to. By delaying writes >> >> > on 2 criteria ( designated cache buffer filling up or preset time >> >> > duration from last write expiring) we can significantly reduce the >> >> > writes on the parity device. This assumes that we are ok to lose a >> >> > movie or two in case the parity disk is not totally up to date but are >> >> > more interested in device longevity. >> >> > >> >> >> >> >> >> Sounds like multi-parity RAID6 with no parity rotation and >> >> >> chunksize == devicesize >> >> > RAID6 would present us with a joint device and currently only allows >> >> > writes to that directly, yes? Any writes will be striped. >> > >> > If the chunksize equals the device size, then you need a very large write for >> > it to be striped. >> > >> >> > In any case would md raid allow the underlying device to be written to >> >> > directly? Also how would it know that the device has been written to >> >> > and hence parity has to be updated? What about the superblock which >> >> > the FS would not know about? >> > >> > No, you wouldn't write to the underlying device. You would carefully >> > partition the RAID5 so each partition aligns exactly with an underlying >> > device. Then write to the partition. >> > >> >> > >> >> > Also except for the delayed checksum writing part which would be >> >> > significant if one of the objectives is to reduce the amount of >> >> > writes. Can we delay that in the code currently for RAID6? I >> >> > understand the objective of RAID6 is to ensure data recovery and we >> >> > are looking at a compromise in this case. >> > >> > "simple matter of programming" >> > Of course there would be a limit to how much data can be buffered in memory >> > before it has to be flushed out. >> > If you are mostly storing movies, then they are probably too large to >> > buffer. Why not just write them out straight away? >> > >> > NeilBrown >> > >> > >> > >> >> > >> >> > If feasible, this can be an enhancement to MD RAID as well where N >> >> > devices are presented instead of a single joint device in case of >> >> > raid6 (maybe the multi part device can be individual disks?) >> >> > >> >> > It will certainly solve my problem of where to store the metadata. I >> >> > was currently hoping to just store it as a configuration file to be >> >> > read by the initramfs since in this case worst case scenario the >> >> > checksum goes out of sync and is rebuilt from scratch. >> >> > >> >> >> >> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >> >> >> impartial opinion from me on that topic. >> >> > >> >> > I haven't hacked around the kernel internals much so far so will have >> >> > to dig out that history. I will welcome any particular links/mail >> >> > threads I should look at for guidance (with both yours and opposing >> >> > points of view) >> >> -- >> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> >> the body of a message to majordomo@vger.kernel.org >> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-26 6:24 ` Anshuman Aggarwal @ 2014-12-01 16:00 ` Anshuman Aggarwal 2014-12-01 16:34 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-12-01 16:00 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 26 November 2014 at 11:54, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote: >> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >> >>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: >>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal >>> > <anshuman.aggarwal@gmail.com> wrote: >>> > >>> >> Would chunksize==disksize work? Wouldn't that lead to the entire >>> >> parity be invalidated for any write to any of the disks (assuming md >>> >> operates at a chunk level)...also please see my reply below >>> > >>> > Operating at a chunk level would be a very poor design choice. md/raid5 >>> > operates in units of 1 page (4K). >>> >>> It appears that my requirement may be met by a partitionable md raid 4 >>> array where the partitions are all on individual underlying block >>> devices not striped across the block devices. Is that currently >>> possible with md raid? I dont' see how but such an enhancement could >>> do all that I had outlined earlier >>> >>> Is this possible to implement using RAID4 and MD already? >> >> Nearly. RAID4 currently requires the chunk size to be a power of 2. >> Rounding down the size of your drives to match that could waste nearly half >> the space. However it should work as a proof-of-concept. >> >> RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for >> RAID4/5/6 would be quite possible. >> >>> can the >>> partitions be made to write to individual block devices such that >>> parity updates don't require reading all devices? >> >> md/raid4 will currently tries to minimize total IO requests when performing >> an update, but prefer spreading the IO over more devices if the total number >> of requests is the same. >> >> So for a 4-drive RAID4, Updating a single block can be done by: >> read old data block, read parity, write data, write parity - 4 IO requests >> or >> read other 2 data blocks, write data, write parity - 4 IO requests. >> >> In this case it will prefer the second, which is not what you want. >> With 5-drive RAID4, the second option will require 5 IO requests, so the first >> will be chosen. >> It is quite trivial to flip this default for testing >> >> - if (rmw < rcw && rmw > 0) { >> + if (rmw <= rcw && rmw > 0) { >> >> >> If you had 5 drives, you could experiment with no code changes. >> Make the chunk size the largest power of 2 that fits in the device, and then >> partition to align the partitions on those boundaries. > > If the chunk size is almost the same as the device size, I assume the > entire chunk is not invalidated for parity on writing to a single > block? i.e. if only 1 block is updated only that blocks parity will be > read and written and not for the whole chunk? If thats' the case, what > purpose does a chunk serve in md raid ? If that's not the case, it > wouldn't work because a single block updation would lead to parity > being written for the entire chunk, which is the size of the device > > I do have more than 5 drives though they are in use currently. I will > create a small testing partition on each device of the same size and > run the test on that after ensuring that the drives do go to sleep. > >> >> NeilBrown >> Wouldn't the meta data writes wake up all the disks in the cluster anyways (defeating the purpose)? This idea will require metadata to not be written out to each device (is that even possible or on the cards?) I am about to try out your suggestion with the chunk sizes anyways but thought about the metadata being a major stumbling block. > > Thanks, > Anshuman >> >>> >>> To illustrate: >>> -----------------RAID - 4 --------------------- >>> | >>> Device 1 Device 2 Device 3 Parity >>> A1 B1 C1 P1 >>> A2 B2 C2 P2 >>> A3 B3 C3 P3 >>> >>> Each device gets written to independently (via a layer of block >>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous >>> blocks leading to updation of P1, P2 P3 (without causing any reads on >>> devices 2 and 3 using XOR for the parity). >>> >>> In RAID4, IIUC data gets striped and all devices become a single block device. >>> >>> >>> > >>> > >>> >> >>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>> >> > Right on most counts but please see comments below. >>> >> > >>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >>> >> >> Just to be sure I understand, you would have N + X devices. Each of the N >>> >> >> devices contains an independent filesystem and could be accessed directly if >>> >> >> needed. Each of the X devices contains some codes so that if at most X >>> >> >> devices in total died, you would still be able to recover all of the data. >>> >> >> If more than X devices failed, you would still get complete data from the >>> >> >> working devices. >>> >> >> >>> >> >> Every update would only write to the particular N device on which it is >>> >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger >>> >> >> than X for the spin-down to be really worth it. >>> >> >> >>> >> >> Am I right so far? >>> >> > >>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 >>> >> > devices to 1 data) so spin down is totally worth it for data >>> >> > protection but more on that below. >>> >> > >>> >> >> >>> >> >> For some reason the writes to X are delayed... I don't really understand >>> >> >> that part. >>> >> > >>> >> > This delay is basically designed around archival devices which are >>> >> > rarely read from and even more rarely written to. By delaying writes >>> >> > on 2 criteria ( designated cache buffer filling up or preset time >>> >> > duration from last write expiring) we can significantly reduce the >>> >> > writes on the parity device. This assumes that we are ok to lose a >>> >> > movie or two in case the parity disk is not totally up to date but are >>> >> > more interested in device longevity. >>> >> > >>> >> >> >>> >> >> Sounds like multi-parity RAID6 with no parity rotation and >>> >> >> chunksize == devicesize >>> >> > RAID6 would present us with a joint device and currently only allows >>> >> > writes to that directly, yes? Any writes will be striped. >>> > >>> > If the chunksize equals the device size, then you need a very large write for >>> > it to be striped. >>> > >>> >> > In any case would md raid allow the underlying device to be written to >>> >> > directly? Also how would it know that the device has been written to >>> >> > and hence parity has to be updated? What about the superblock which >>> >> > the FS would not know about? >>> > >>> > No, you wouldn't write to the underlying device. You would carefully >>> > partition the RAID5 so each partition aligns exactly with an underlying >>> > device. Then write to the partition. >>> > >>> >> > >>> >> > Also except for the delayed checksum writing part which would be >>> >> > significant if one of the objectives is to reduce the amount of >>> >> > writes. Can we delay that in the code currently for RAID6? I >>> >> > understand the objective of RAID6 is to ensure data recovery and we >>> >> > are looking at a compromise in this case. >>> > >>> > "simple matter of programming" >>> > Of course there would be a limit to how much data can be buffered in memory >>> > before it has to be flushed out. >>> > If you are mostly storing movies, then they are probably too large to >>> > buffer. Why not just write them out straight away? >>> > >>> > NeilBrown >>> > >>> > >>> > >>> >> > >>> >> > If feasible, this can be an enhancement to MD RAID as well where N >>> >> > devices are presented instead of a single joint device in case of >>> >> > raid6 (maybe the multi part device can be individual disks?) >>> >> > >>> >> > It will certainly solve my problem of where to store the metadata. I >>> >> > was currently hoping to just store it as a configuration file to be >>> >> > read by the initramfs since in this case worst case scenario the >>> >> > checksum goes out of sync and is rebuilt from scratch. >>> >> > >>> >> >> >>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >>> >> >> impartial opinion from me on that topic. >>> >> > >>> >> > I haven't hacked around the kernel internals much so far so will have >>> >> > to dig out that history. I will welcome any particular links/mail >>> >> > threads I should look at for guidance (with both yours and opposing >>> >> > points of view) >>> >> -- >>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> >> the body of a message to majordomo@vger.kernel.org >>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> > >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-01 16:00 ` Anshuman Aggarwal @ 2014-12-01 16:34 ` Anshuman Aggarwal 2014-12-01 21:46 ` NeilBrown 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-12-01 16:34 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 1 December 2014 at 21:30, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 26 November 2014 at 11:54, Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote: >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal >>> <anshuman.aggarwal@gmail.com> wrote: >>> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal >>>> > <anshuman.aggarwal@gmail.com> wrote: >>>> > >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire >>>> >> parity be invalidated for any write to any of the disks (assuming md >>>> >> operates at a chunk level)...also please see my reply below >>>> > >>>> > Operating at a chunk level would be a very poor design choice. md/raid5 >>>> > operates in units of 1 page (4K). >>>> >>>> It appears that my requirement may be met by a partitionable md raid 4 >>>> array where the partitions are all on individual underlying block >>>> devices not striped across the block devices. Is that currently >>>> possible with md raid? I dont' see how but such an enhancement could >>>> do all that I had outlined earlier >>>> >>>> Is this possible to implement using RAID4 and MD already? >>> >>> Nearly. RAID4 currently requires the chunk size to be a power of 2. >>> Rounding down the size of your drives to match that could waste nearly half >>> the space. However it should work as a proof-of-concept. >>> >>> RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for >>> RAID4/5/6 would be quite possible. >>> >>>> can the >>>> partitions be made to write to individual block devices such that >>>> parity updates don't require reading all devices? >>> >>> md/raid4 will currently tries to minimize total IO requests when performing >>> an update, but prefer spreading the IO over more devices if the total number >>> of requests is the same. >>> >>> So for a 4-drive RAID4, Updating a single block can be done by: >>> read old data block, read parity, write data, write parity - 4 IO requests >>> or >>> read other 2 data blocks, write data, write parity - 4 IO requests. >>> >>> In this case it will prefer the second, which is not what you want. >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first >>> will be chosen. >>> It is quite trivial to flip this default for testing >>> >>> - if (rmw < rcw && rmw > 0) { >>> + if (rmw <= rcw && rmw > 0) { >>> >>> >>> If you had 5 drives, you could experiment with no code changes. >>> Make the chunk size the largest power of 2 that fits in the device, and then >>> partition to align the partitions on those boundaries. >> >> If the chunk size is almost the same as the device size, I assume the >> entire chunk is not invalidated for parity on writing to a single >> block? i.e. if only 1 block is updated only that blocks parity will be >> read and written and not for the whole chunk? If thats' the case, what >> purpose does a chunk serve in md raid ? If that's not the case, it >> wouldn't work because a single block updation would lead to parity >> being written for the entire chunk, which is the size of the device >> >> I do have more than 5 drives though they are in use currently. I will >> create a small testing partition on each device of the same size and >> run the test on that after ensuring that the drives do go to sleep. >> >>> >>> NeilBrown >>> > > Wouldn't the meta data writes wake up all the disks in the cluster > anyways (defeating the purpose)? This idea will require metadata to > not be written out to each device (is that even possible or on the > cards?) > > I am about to try out your suggestion with the chunk sizes anyways but > thought about the metadata being a major stumbling block. > And it seems to be confirmed that the metadata write is waking up the other drives. On any write to a particular drive the metadata update is accessing all the others. Am I correct in assuming that all metadata is currently written as part of the block device itself and that the external metadata is still embedded in each of the block devices (only the format of the metadata is defined externally?) I guess to implement this we would need to store metadata elsewhere which may be a major development work. Still that may be a flexibility desired in md raid for other reasons... Neil, your thoughts. >> >> Thanks, >> Anshuman >>> >>>> >>>> To illustrate: >>>> -----------------RAID - 4 --------------------- >>>> | >>>> Device 1 Device 2 Device 3 Parity >>>> A1 B1 C1 P1 >>>> A2 B2 C2 P2 >>>> A3 B3 C3 P3 >>>> >>>> Each device gets written to independently (via a layer of block >>>> devices)...so Data on Device 1 is written as A1, A2, A3 contiguous >>>> blocks leading to updation of P1, P2 P3 (without causing any reads on >>>> devices 2 and 3 using XOR for the parity). >>>> >>>> In RAID4, IIUC data gets striped and all devices become a single block device. >>>> >>>> >>>> > >>>> > >>>> >> >>>> >> On 29 October 2014 14:55, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>>> >> > Right on most counts but please see comments below. >>>> >> > >>>> >> > On 29 October 2014 14:35, NeilBrown <neilb@suse.de> wrote: >>>> >> >> Just to be sure I understand, you would have N + X devices. Each of the N >>>> >> >> devices contains an independent filesystem and could be accessed directly if >>>> >> >> needed. Each of the X devices contains some codes so that if at most X >>>> >> >> devices in total died, you would still be able to recover all of the data. >>>> >> >> If more than X devices failed, you would still get complete data from the >>>> >> >> working devices. >>>> >> >> >>>> >> >> Every update would only write to the particular N device on which it is >>>> >> >> relevant, and all of the X devices. So N needs to be quite a bit bigger >>>> >> >> than X for the spin-down to be really worth it. >>>> >> >> >>>> >> >> Am I right so far? >>>> >> > >>>> >> > Perfectly right so far. I typically have a N to X ratio of 4 (4 >>>> >> > devices to 1 data) so spin down is totally worth it for data >>>> >> > protection but more on that below. >>>> >> > >>>> >> >> >>>> >> >> For some reason the writes to X are delayed... I don't really understand >>>> >> >> that part. >>>> >> > >>>> >> > This delay is basically designed around archival devices which are >>>> >> > rarely read from and even more rarely written to. By delaying writes >>>> >> > on 2 criteria ( designated cache buffer filling up or preset time >>>> >> > duration from last write expiring) we can significantly reduce the >>>> >> > writes on the parity device. This assumes that we are ok to lose a >>>> >> > movie or two in case the parity disk is not totally up to date but are >>>> >> > more interested in device longevity. >>>> >> > >>>> >> >> >>>> >> >> Sounds like multi-parity RAID6 with no parity rotation and >>>> >> >> chunksize == devicesize >>>> >> > RAID6 would present us with a joint device and currently only allows >>>> >> > writes to that directly, yes? Any writes will be striped. >>>> > >>>> > If the chunksize equals the device size, then you need a very large write for >>>> > it to be striped. >>>> > >>>> >> > In any case would md raid allow the underlying device to be written to >>>> >> > directly? Also how would it know that the device has been written to >>>> >> > and hence parity has to be updated? What about the superblock which >>>> >> > the FS would not know about? >>>> > >>>> > No, you wouldn't write to the underlying device. You would carefully >>>> > partition the RAID5 so each partition aligns exactly with an underlying >>>> > device. Then write to the partition. >>>> > >>>> >> > >>>> >> > Also except for the delayed checksum writing part which would be >>>> >> > significant if one of the objectives is to reduce the amount of >>>> >> > writes. Can we delay that in the code currently for RAID6? I >>>> >> > understand the objective of RAID6 is to ensure data recovery and we >>>> >> > are looking at a compromise in this case. >>>> > >>>> > "simple matter of programming" >>>> > Of course there would be a limit to how much data can be buffered in memory >>>> > before it has to be flushed out. >>>> > If you are mostly storing movies, then they are probably too large to >>>> > buffer. Why not just write them out straight away? >>>> > >>>> > NeilBrown >>>> > >>>> > >>>> > >>>> >> > >>>> >> > If feasible, this can be an enhancement to MD RAID as well where N >>>> >> > devices are presented instead of a single joint device in case of >>>> >> > raid6 (maybe the multi part device can be individual disks?) >>>> >> > >>>> >> > It will certainly solve my problem of where to store the metadata. I >>>> >> > was currently hoping to just store it as a configuration file to be >>>> >> > read by the initramfs since in this case worst case scenario the >>>> >> > checksum goes out of sync and is rebuilt from scratch. >>>> >> > >>>> >> >> >>>> >> >> I wouldn't use device-mapper myself, but you are unlikely to get an entirely >>>> >> >> impartial opinion from me on that topic. >>>> >> > >>>> >> > I haven't hacked around the kernel internals much so far so will have >>>> >> > to dig out that history. I will welcome any particular links/mail >>>> >> > threads I should look at for guidance (with both yours and opposing >>>> >> > points of view) >>>> >> -- >>>> >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>> >> the body of a message to majordomo@vger.kernel.org >>>> >> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> > >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-01 16:34 ` Anshuman Aggarwal @ 2014-12-01 21:46 ` NeilBrown 2014-12-02 11:56 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: NeilBrown @ 2014-12-01 21:46 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Mdadm [-- Attachment #1: Type: text/plain, Size: 5522 bytes --] On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 1 December 2014 at 21:30, Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > > On 26 November 2014 at 11:54, Anshuman Aggarwal > > <anshuman.aggarwal@gmail.com> wrote: > >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote: > >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal > >>> <anshuman.aggarwal@gmail.com> wrote: > >>> > >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: > >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal > >>>> > <anshuman.aggarwal@gmail.com> wrote: > >>>> > > >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire > >>>> >> parity be invalidated for any write to any of the disks (assuming md > >>>> >> operates at a chunk level)...also please see my reply below > >>>> > > >>>> > Operating at a chunk level would be a very poor design choice. md/raid5 > >>>> > operates in units of 1 page (4K). > >>>> > >>>> It appears that my requirement may be met by a partitionable md raid 4 > >>>> array where the partitions are all on individual underlying block > >>>> devices not striped across the block devices. Is that currently > >>>> possible with md raid? I dont' see how but such an enhancement could > >>>> do all that I had outlined earlier > >>>> > >>>> Is this possible to implement using RAID4 and MD already? > >>> > >>> Nearly. RAID4 currently requires the chunk size to be a power of 2. > >>> Rounding down the size of your drives to match that could waste nearly half > >>> the space. However it should work as a proof-of-concept. > >>> > >>> RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for > >>> RAID4/5/6 would be quite possible. > >>> > >>>> can the > >>>> partitions be made to write to individual block devices such that > >>>> parity updates don't require reading all devices? > >>> > >>> md/raid4 will currently tries to minimize total IO requests when performing > >>> an update, but prefer spreading the IO over more devices if the total number > >>> of requests is the same. > >>> > >>> So for a 4-drive RAID4, Updating a single block can be done by: > >>> read old data block, read parity, write data, write parity - 4 IO requests > >>> or > >>> read other 2 data blocks, write data, write parity - 4 IO requests. > >>> > >>> In this case it will prefer the second, which is not what you want. > >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first > >>> will be chosen. > >>> It is quite trivial to flip this default for testing > >>> > >>> - if (rmw < rcw && rmw > 0) { > >>> + if (rmw <= rcw && rmw > 0) { > >>> > >>> > >>> If you had 5 drives, you could experiment with no code changes. > >>> Make the chunk size the largest power of 2 that fits in the device, and then > >>> partition to align the partitions on those boundaries. > >> > >> If the chunk size is almost the same as the device size, I assume the > >> entire chunk is not invalidated for parity on writing to a single > >> block? i.e. if only 1 block is updated only that blocks parity will be > >> read and written and not for the whole chunk? If thats' the case, what > >> purpose does a chunk serve in md raid ? If that's not the case, it > >> wouldn't work because a single block updation would lead to parity > >> being written for the entire chunk, which is the size of the device > >> > >> I do have more than 5 drives though they are in use currently. I will > >> create a small testing partition on each device of the same size and > >> run the test on that after ensuring that the drives do go to sleep. > >> > >>> > >>> NeilBrown > >>> > > > > Wouldn't the meta data writes wake up all the disks in the cluster > > anyways (defeating the purpose)? This idea will require metadata to > > not be written out to each device (is that even possible or on the > > cards?) > > > > I am about to try out your suggestion with the chunk sizes anyways but > > thought about the metadata being a major stumbling block. > > > > And it seems to be confirmed that the metadata write is waking up the > other drives. On any write to a particular drive the metadata update > is accessing all the others. > > Am I correct in assuming that all metadata is currently written as > part of the block device itself and that the external metadata is > still embedded in each of the block devices (only the format of the > metadata is defined externally?) I guess to implement this we would > need to store metadata elsewhere which may be a major development > work. Still that may be a flexibility desired in md raid for other > reasons... > > Neil, your thoughts. This is exactly why I suggested testing with existing code and seeing how far you can get. Thanks. For a full solution we probably do need some code changes here, but for further testing you could: 1/ make sure there is no bitmap (mdadm --grow --bitmap=none) 2/ set the safe_mode_delay to 0 echo 0 > /sys/block/mdXXX/md/safe_mode_delay when it won't try to update the metadata until you stop the array, or a device fails. Longer term: it would probably be good to only update the bitmap on the devices that are being written to - and to merge all bitmaps when assembling the array. Also when there is a bitmap, the safe_mode functionality should probably be disabled. NeilBrown [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-01 21:46 ` NeilBrown @ 2014-12-02 11:56 ` Anshuman Aggarwal 2014-12-16 16:25 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-12-02 11:56 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-) will find more space on my drives and do a larger test but don't see why it shouldn't work) Here are the following caveats (and questions): - Neil, like you pointed out, the power of 2 chunk size will probably need a code change (in the kernel or only in the userspace tool?) - Any performance or other reasons why a terabyte size chunk may not be feasible? - Implications of safe_mode_delay - Would the metadata be updated on the block device be written to and the parity device as well? - If the drive fails which is the same as the drive being written to, would that lack of metadata updates to the other devices affect reconstruction? - Adding new devices (is it possible to move the parity to the disk being added? How does device addition work for RAID4 ...is it added as a zero-ed out device with parity disk remaining the same) On 2 December 2014 at 03:16, NeilBrown <neilb@suse.de> wrote: > On Mon, 1 Dec 2014 22:04:42 +0530 Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > >> On 1 December 2014 at 21:30, Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >> > On 26 November 2014 at 11:54, Anshuman Aggarwal >> > <anshuman.aggarwal@gmail.com> wrote: >> >> On 25 November 2014 at 04:20, NeilBrown <neilb@suse.de> wrote: >> >>> On Mon, 24 Nov 2014 12:59:47 +0530 Anshuman Aggarwal >> >>> <anshuman.aggarwal@gmail.com> wrote: >> >>> >> >>>> On 3 November 2014 at 11:22, NeilBrown <neilb@suse.de> wrote: >> >>>> > On Thu, 30 Oct 2014 20:30:40 +0530 Anshuman Aggarwal >> >>>> > <anshuman.aggarwal@gmail.com> wrote: >> >>>> > >> >>>> >> Would chunksize==disksize work? Wouldn't that lead to the entire >> >>>> >> parity be invalidated for any write to any of the disks (assuming md >> >>>> >> operates at a chunk level)...also please see my reply below >> >>>> > >> >>>> > Operating at a chunk level would be a very poor design choice. md/raid5 >> >>>> > operates in units of 1 page (4K). >> >>>> >> >>>> It appears that my requirement may be met by a partitionable md raid 4 >> >>>> array where the partitions are all on individual underlying block >> >>>> devices not striped across the block devices. Is that currently >> >>>> possible with md raid? I dont' see how but such an enhancement could >> >>>> do all that I had outlined earlier >> >>>> >> >>>> Is this possible to implement using RAID4 and MD already? >> >>> >> >>> Nearly. RAID4 currently requires the chunk size to be a power of 2. >> >>> Rounding down the size of your drives to match that could waste nearly half >> >>> the space. However it should work as a proof-of-concept. >> >>> >> >>> RAID0 supports non-power-of-2 chunk sizes. Doing the same thing for >> >>> RAID4/5/6 would be quite possible. >> >>> >> >>>> can the >> >>>> partitions be made to write to individual block devices such that >> >>>> parity updates don't require reading all devices? >> >>> >> >>> md/raid4 will currently tries to minimize total IO requests when performing >> >>> an update, but prefer spreading the IO over more devices if the total number >> >>> of requests is the same. >> >>> >> >>> So for a 4-drive RAID4, Updating a single block can be done by: >> >>> read old data block, read parity, write data, write parity - 4 IO requests >> >>> or >> >>> read other 2 data blocks, write data, write parity - 4 IO requests. >> >>> >> >>> In this case it will prefer the second, which is not what you want. >> >>> With 5-drive RAID4, the second option will require 5 IO requests, so the first >> >>> will be chosen. >> >>> It is quite trivial to flip this default for testing >> >>> >> >>> - if (rmw < rcw && rmw > 0) { >> >>> + if (rmw <= rcw && rmw > 0) { >> >>> >> >>> >> >>> If you had 5 drives, you could experiment with no code changes. >> >>> Make the chunk size the largest power of 2 that fits in the device, and then >> >>> partition to align the partitions on those boundaries. >> >> >> >> If the chunk size is almost the same as the device size, I assume the >> >> entire chunk is not invalidated for parity on writing to a single >> >> block? i.e. if only 1 block is updated only that blocks parity will be >> >> read and written and not for the whole chunk? If thats' the case, what >> >> purpose does a chunk serve in md raid ? If that's not the case, it >> >> wouldn't work because a single block updation would lead to parity >> >> being written for the entire chunk, which is the size of the device >> >> >> >> I do have more than 5 drives though they are in use currently. I will >> >> create a small testing partition on each device of the same size and >> >> run the test on that after ensuring that the drives do go to sleep. >> >> >> >>> >> >>> NeilBrown >> >>> >> > >> > Wouldn't the meta data writes wake up all the disks in the cluster >> > anyways (defeating the purpose)? This idea will require metadata to >> > not be written out to each device (is that even possible or on the >> > cards?) >> > >> > I am about to try out your suggestion with the chunk sizes anyways but >> > thought about the metadata being a major stumbling block. >> > >> >> And it seems to be confirmed that the metadata write is waking up the >> other drives. On any write to a particular drive the metadata update >> is accessing all the others. >> >> Am I correct in assuming that all metadata is currently written as >> part of the block device itself and that the external metadata is >> still embedded in each of the block devices (only the format of the >> metadata is defined externally?) I guess to implement this we would >> need to store metadata elsewhere which may be a major development >> work. Still that may be a flexibility desired in md raid for other >> reasons... >> >> Neil, your thoughts. > > This is exactly why I suggested testing with existing code and seeing how far > you can get. Thanks. > > For a full solution we probably do need some code changes here, but for > further testing you could: > 1/ make sure there is no bitmap (mdadm --grow --bitmap=none) > 2/ set the safe_mode_delay to 0 > echo 0 > /sys/block/mdXXX/md/safe_mode_delay > > when it won't try to update the metadata until you stop the array, or a > device fails. > > Longer term: it would probably be good to only update the bitmap on the > devices that are being written to - and to merge all bitmaps when assembling > the array. Also when there is a bitmap, the safe_mode functionality should > probably be disabled. > > NeilBrown > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-02 11:56 ` Anshuman Aggarwal @ 2014-12-16 16:25 ` Anshuman Aggarwal 2014-12-16 21:49 ` NeilBrown 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-12-16 16:25 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 2 December 2014 at 17:26, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-) > will find more space on my drives and do a larger test but don't see > why it shouldn't work) > Here are the following caveats (and questions): > - Neil, like you pointed out, the power of 2 chunk size will probably > need a code change (in the kernel or only in the userspace tool?) > - Any performance or other reasons why a terabyte size chunk may > not be feasible? > - Implications of safe_mode_delay > - Would the metadata be updated on the block device be written to > and the parity device as well? > - If the drive fails which is the same as the drive being written > to, would that lack of metadata updates to the other devices affect > reconstruction? > - Adding new devices (is it possible to move the parity to the disk > being added? How does device addition work for RAID4 ...is it added as > a zero-ed out device with parity disk remaining the same) > > Neil, sorry to try to bump this thread. Could you please look over the questions and address the points on the remaining items that can make it a working solution? Thanks ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-16 16:25 ` Anshuman Aggarwal @ 2014-12-16 21:49 ` NeilBrown 2014-12-17 6:40 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: NeilBrown @ 2014-12-16 21:49 UTC (permalink / raw) To: Anshuman Aggarwal; +Cc: Mdadm [-- Attachment #1: Type: text/plain, Size: 1928 bytes --] On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 2 December 2014 at 17:26, Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-) > > will find more space on my drives and do a larger test but don't see > > why it shouldn't work) > > Here are the following caveats (and questions): > > - Neil, like you pointed out, the power of 2 chunk size will probably > > need a code change (in the kernel or only in the userspace tool?) In the kernel too. > > - Any performance or other reasons why a terabyte size chunk may > > not be feasible? Not that I can think of. > > - Implications of safe_mode_delay > > - Would the metadata be updated on the block device be written to > > and the parity device as well? Probably. Hard to give a specific answer to vague question. > > - If the drive fails which is the same as the drive being written > > to, would that lack of metadata updates to the other devices affect > > reconstruction? Again, to give a precise answer, a detailed question is needed. Obviously any change would have to made in such a way to ensure that things which needed to work, did work. > > - Adding new devices (is it possible to move the parity to the disk > > being added? How does device addition work for RAID4 ...is it added as > > a zero-ed out device with parity disk remaining the same) RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices. Currently if you add a device to such an array ...... I'm not sure what it will do. It should be possible to make it just write zeros out. NeilBrown > > > > > > Neil, sorry to try to bump this thread. Could you please look over the > questions and address the points on the remaining items that can make > it a working solution? Thanks [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 811 bytes --] ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-16 21:49 ` NeilBrown @ 2014-12-17 6:40 ` Anshuman Aggarwal 2015-01-06 11:40 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-12-17 6:40 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote: > On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: > >> On 2 December 2014 at 17:26, Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-) >> > will find more space on my drives and do a larger test but don't see >> > why it shouldn't work) >> > Here are the following caveats (and questions): >> > - Neil, like you pointed out, the power of 2 chunk size will probably >> > need a code change (in the kernel or only in the userspace tool?) > > In the kernel too. Is this something that you would consider implementing soon? Is there a performance/other impact to any other consideration to remove this limitation.. could you elaborate on the reason why it was there in the first place? If this is a case of patches are welcome, please guide on where to start looking/working even if its just > >> > - Any performance or other reasons why a terabyte size chunk may >> > not be feasible? > > Not that I can think of. > >> > - Implications of safe_mode_delay >> > - Would the metadata be updated on the block device be written to >> > and the parity device as well? > > Probably. Hard to give a specific answer to vague question. I should clarify. For example in a 5 device RAID4, lets say block is being written to device 1 and parity is on device 5 and devices 2,3,4 are sleeping (spun down). If we set safe_mode_delay to 0 and md decides to update the parity without involving the blocks on the other 3 devices and just updates the parity by doing a read, compute, write to device 5 will the metadata be updated on both device 1 and 5 even though safe_mode_delay is 0? > >> > - If the drive fails which is the same as the drive being written >> > to, would that lack of metadata updates to the other devices affect >> > reconstruction? > > Again, to give a precise answer, a detailed question is needed. Obviously > any change would have to made in such a way to ensure that things which > needed to work, did work. Continuing from the previous example, lets say device 1 fails after a write which only updated metadata on 1 and 5 while 2,3,4 were sleeping. In that case to access the data from 1, md will use 2,3,4,5 but will it then update the metadata from 5 onto 2,3,4? I hope I am making this clear. > > >> > - Adding new devices (is it possible to move the parity to the disk >> > being added? How does device addition work for RAID4 ...is it added as >> > a zero-ed out device with parity disk remaining the same) > > RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices. > Currently if you add a device to such an array ...... I'm not sure what it > will do. It should be possible to make it just write zeros out. > Once again, is this something that can make its way to your roadmap? If so, great.. otherwise could you steer me towards where in the md kernel and mdadm source I should be looking to make these changes. Thanks again. > > NeilBrown > > >> > >> > >> >> Neil, sorry to try to bump this thread. Could you please look over the >> questions and address the points on the remaining items that can make >> it a working solution? Thanks > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum 2014-12-17 6:40 ` Anshuman Aggarwal @ 2015-01-06 11:40 ` Anshuman Aggarwal 0 siblings, 0 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2015-01-06 11:40 UTC (permalink / raw) To: NeilBrown; +Cc: Mdadm On 17 December 2014 at 12:10, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 17 December 2014 at 03:19, NeilBrown <neilb@suse.de> wrote: >> On Tue, 16 Dec 2014 21:55:15 +0530 Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >> >>> On 2 December 2014 at 17:26, Anshuman Aggarwal >>> <anshuman.aggarwal@gmail.com> wrote: >>> > It works! (Atleast on a sample 5 MB device with 5 x 1MB partitions :-) >>> > will find more space on my drives and do a larger test but don't see >>> > why it shouldn't work) >>> > Here are the following caveats (and questions): >>> > - Neil, like you pointed out, the power of 2 chunk size will probably >>> > need a code change (in the kernel or only in the userspace tool?) >> >> In the kernel too. > > Is this something that you would consider implementing soon? Is there > a performance/other impact to any other consideration to remove this > limitation.. could you elaborate on the reason why it was there in the > first place? > > If this is a case of patches are welcome, please guide on where to > start looking/working even if its just > >> >>> > - Any performance or other reasons why a terabyte size chunk may >>> > not be feasible? >> >> Not that I can think of. >> >>> > - Implications of safe_mode_delay >>> > - Would the metadata be updated on the block device be written to >>> > and the parity device as well? >> >> Probably. Hard to give a specific answer to vague question. > > I should clarify. > > For example in a 5 device RAID4, lets say block is being written to > device 1 and parity is on device 5 and devices 2,3,4 are sleeping > (spun down). If we set safe_mode_delay to 0 and md decides to update > the parity without involving the blocks on the other 3 devices and > just updates the parity by doing a read, compute, write to device 5 > will the metadata be updated on both device 1 and 5 even though > safe_mode_delay is 0? > >> >>> > - If the drive fails which is the same as the drive being written >>> > to, would that lack of metadata updates to the other devices affect >>> > reconstruction? >> >> Again, to give a precise answer, a detailed question is needed. Obviously >> any change would have to made in such a way to ensure that things which >> needed to work, did work. > > Continuing from the previous example, lets say device 1 fails after a > write which only updated metadata on 1 and 5 while 2,3,4 were > sleeping. In that case to access the data from 1, md will use 2,3,4,5 > but will it then update the metadata from 5 onto 2,3,4? I hope I am > making this clear. > >> >> >>> > - Adding new devices (is it possible to move the parity to the disk >>> > being added? How does device addition work for RAID4 ...is it added as >>> > a zero-ed out device with parity disk remaining the same) >> >> RAID5 or RAID6 with ALGORITHM_PARITY_0 puts the parity on the early devices. >> Currently if you add a device to such an array ...... I'm not sure what it >> will do. It should be possible to make it just write zeros out. >> > > Once again, is this something that can make its way to your roadmap? > If so, great.. otherwise could you steer me towards where in the md > kernel and mdadm source I should be looking to make these changes. > Thanks again. > >> >> NeilBrown >> >> >>> > >>> > >>> >>> Neil, sorry to try to bump this thread. Could you please look over the >>> questions and address the points on the remaining items that can make >>> it a working solution? Thanks >> Hi Neil, Could you please find a minute to give your input to the above? Your guidance will go a long way towards making this a reality and it may be useful to the community at large with the new Seagate 8TB archival drives which seem to be more geared towards occasional use but would still benefit from a RAID like redundancy. Many thanks, Anshuman ^ permalink raw reply [flat|nested] 44+ messages in thread
[parent not found: <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>]
* Re: Split RAID: Proposal for archival RAID using incremental batch checksum [not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com> @ 2014-11-01 5:36 ` Anshuman Aggarwal 0 siblings, 0 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-01 5:36 UTC (permalink / raw) To: Matt Garman; +Cc: Mdadm On 31 October 2014 19:53, Matt Garman <matthew.garman@gmail.com> wrote: > In a later post, you said you had a 4-to-1 scheme, but it wasn't clear to me > if that was 1 drive worth of data, and 4 drives worth of checksum/backup, or > the other way around. I was wondering if anybody would catch that slip. I meant 4 data to 1 parity seems about the right mix to me so far based on the my read and feel of probability of drive failure. > > In your proposed scheme, I assume you want your actual data drives to be > spinning all the time? Otherwise, when you go to read data (play > music/videos), you have the multi-second spinup delay... or is that OK with > you? Well, actually in my experience with 6-8, 2-4TB drives there is a lot of music/video content that I dont' end up playing that often. Those drives can easily be spun down (maybe for days on end and at least all night) and a small initial (one time) delay before playing a file who drive hasn't been accessed easily seems like a good trade off ( both for power and drive life ) > > Some other considerations: modern 5400 RPM drives generally consume less > than five watts in idle state[1]. Actual AC draw will be higher due to > power supply inefficiency, so we'll err on the conservative side and say > each drive requires 10 AC watts of power. My electrical rates in Chicago > are about average for the USA (11 or 12 cents/kWH), and conveniently it > roughly works out such that one always-on watt costs about $1/year. So, > each always-running hard drive will cost about $10/year to run, less with a > more efficient power supply. I know electricity is substantially more > expensive in many parts of the world; or maybe you're running off-the-grid > (e.g. solar) and have a very small power budget? Besides the cost, there is an environmental aspect. If something has superior efficiency and increases life of the product isn't it a good thing wherever we live on the planet. BTW great calculation but I moved back (to India) from San Francisco some time ago :) and the electricity cost is quite high (and availability of supply is not 100% yet). I'd like to maximize my backups and spinning disks that are not being used for hours on end sounds bad. Just to add, internet is metered per GB in many parts (and in mine sadly :( for high speed access (meaning 4-8 MBps) so I have to store content locally (before cloud suggestions are thrown around) > > On Wed, Oct 29, 2014 at 2:15 AM, Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: >> >> - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot >> based scheme (Its advantages are that its in user space and has cross >> platform support but has the huge disadvantage of every checksum being >> done from scratch slowing the system, causing immense wear and tear on >> every snapshot and also losing any information updates upto the >> snapshot point etc) > > > Last time I looked at SnapRAID, it seemed like yours was its target use > case. The "huge disadvantage of every checksum being done from scratch" > sounds like a SnapRAID feature enhancement that might be > simpler/easier/faster-to-get done than a major enhancement to the Linux > kernel (just speculating though). SnapRAID can't be enhanced without involving the kernel because the delta checksum will require knowing which blocks were written to and only a kernel level driver can know that. This is a hard reality, no way around it and that was my reason to propose this. > > But, on the other hand, by your use case description, writes are very > infrequent, and you're willing to buffer checksum updates for quite a > while... so what if you had a *monthly* cron job to do parity syncs? > Schedule it for a time when the system is unlikely to be used to offset the > increased load. That's only 12 "hard" tasks for the drive per year. I'm > not an expert, but that doesn't "feel" like a lot of wear and tear. Well, again, between infrequent updates down to weekly or monthly crons sounds like a bad compromise either way when a better incremental update could store the checksum in a buffer and write them out eventually (2-3 times a day). Almost always the buffer will get written out giving us an updated parity with little to none "extra" wear and tear. > > On the issue of wear and tear, I've mostly given up trying to understand > what's best for my drives. One school of thought says many spinup-spindown > cycles are actually harder on the drive than running 24/7. But maybe > consumer drives actually aren't designed for 24/7 operation, so they're > better off being cycled up and down. Or consumer drives can't handle the > vibrations of being in a case with other 24/7 drives. But failure > to"exercise" the entire drive regularly enough might result in a situation > where an error has developed but you don't know until it's too late or your > warranty period has expired. You are right about consumer drives where spin downs are good ...with a time of an hour or so should reduce unnecessary spin up/downs. Once spun down, most may stay that way for days which is better for all of us (energy, wastage of drives etc). Spin down technology is progressing faster than block failure (also because block density is going up causing media failure and not the head failure to be the primary cause of drive outage) The drive can be tested periodically (by non destructive bad blocks etc) as a pure testing exercise to find errors being developed. There is no need to needlessly stress the drives out by reading/writing to all parts continuously. Also RAID speeds are often no longer required due to the higher R/W coming from the drives. Thanks for reading and writing such a thorough reply. Neil, would you be willing to assist/guide in helping design or with the best approach to the same? I would like to avoid the obvious pitfalls that any new kernel block level device writer is bound to face. Regards, Anshuman > > > [1] http://www.silentpcreview.com/article29-page2.html > > ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum @ 2014-11-21 10:15 Anshuman Aggarwal 2014-11-21 11:41 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-21 10:15 UTC (permalink / raw) To: kernelnewbies I'd a appreciate any help/pointers in implementing the proposal below including the right path to get this into the kernel itself. ---------------------------------- I'm outlining below a proposal for a RAID device mapper virtual block device for the kernel which adds "split raid" functionality on an incremental batch basis for a home media server/archived content which is rarely accessed. Given a set of N+X block devices (of the same size but smallest common size wins) the SplitRAID device mapper device generates virtual devices which are passthrough for N devices and write a Batched/Delayed checksum into the X devices so as to allow offline recovery of block on the N devices in case of a single disk failure. Advantages over conventional RAID: - Disks can be spun down reducing wear and tear over MD RAID Levels (such as 1, 10, 5,6) in the case of rarely accessed archival content - Prevent catastrophic data loss for multiple device failure since each block device is independent and hence unlike MD RAID will only lose data incrementally. - Performance degradation for writes can be achieved by keeping the checksum update asynchronous and delaying the fsync to the checksum block device. In the event of improper shutdown the checksum may not have all the updated data but will be mostly up to date which is often acceptable for home media server requirements. A flag can be set in case the checksum block device was shutdown properly indicating that a full checksum rebuild is not required. Existing solutions considered: - SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot based scheme (Its advantages are that its in user space and has cross platform support but has the huge disadvantage of every checksum being done from scratch slowing the system, causing immense wear and tear on every snapshot and also losing any information updates upto the snapshot point etc) I'd like to get opinions on the pros and cons of this proposal from more experienced people on the list to redirect suitably on the following questions: - Maybe this can already be done using the block devices available in the kernel? - If not, Device mapper the right API to use? (I think so) - What would be the best block devices code to look at to implement? Regards, Anshuman Aggarwal ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-21 10:15 Anshuman Aggarwal @ 2014-11-21 11:41 ` Greg Freemyer 2014-11-21 18:48 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-21 11:41 UTC (permalink / raw) To: kernelnewbies On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >I'd a appreciate any help/pointers in implementing the proposal below >including the right path to get this into the kernel itself. >---------------------------------- >I'm outlining below a proposal for a RAID device mapper virtual block >device for the kernel which adds "split raid" functionality on an >incremental batch basis for a home media server/archived content which >is rarely accessed. > >Given a set of N+X block devices (of the same size but smallest common >size wins) > >the SplitRAID device mapper device generates virtual devices which are >passthrough for N devices and write a Batched/Delayed checksum into >the X devices so as to allow offline recovery of block on the N >devices in case of a single disk failure. > >Advantages over conventional RAID: > >- Disks can be spun down reducing wear and tear over MD RAID Levels >(such as 1, 10, 5,6) in the case of rarely accessed archival content > >- Prevent catastrophic data loss for multiple device failure since >each block device is independent and hence unlike MD RAID will only >lose data incrementally. > >- Performance degradation for writes can be achieved by keeping the >checksum update asynchronous and delaying the fsync to the checksum >block device. > >In the event of improper shutdown the checksum may not have all the >updated data but will be mostly up to date which is often acceptable >for home media server requirements. A flag can be set in case the >checksum block device was shutdown properly indicating that a full >checksum rebuild is not required. > >Existing solutions considered: > >- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot >based scheme (Its advantages are that its in user space and has cross >platform support but has the huge disadvantage of every checksum being >done from scratch slowing the system, causing immense wear and tear on >every snapshot and also losing any information updates upto the >snapshot point etc) > >I'd like to get opinions on the pros and cons of this proposal from >more experienced people on the list to redirect suitably on the >following questions: > >- Maybe this can already be done using the block devices available in >the kernel? > >- If not, Device mapper the right API to use? (I think so) > >- What would be the best block devices code to look at to implement? > > >Regards, > >Anshuman Aggarwal > >_______________________________________________ >Kernelnewbies mailing list >Kernelnewbies at kernelnewbies.org >http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies I think I understand the proposal. You say N pass-through drives. I assume concatenated? If the N drives were instead in a Raid-0 stripe set and your X drives was just a single parity drive, then you would have described Raid-4. There are use cases for raid 4 and you have described a good one (rarely used data where random w/o performance is not key). I don't know if mdraid supports raid-4 or not. If not I suggest adding raid-4 support is something else you might want to look at. Anyway, at a minimum add raid-4 to the existing solutions considered section. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-21 11:41 ` Greg Freemyer @ 2014-11-21 18:48 ` Anshuman Aggarwal 2014-11-22 13:17 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-21 18:48 UTC (permalink / raw) To: kernelnewbies N pass through but with their own filesystems. Concatenation is via some kind of union fs solution not at the block level. Data is not supposed to be striped (this is critical so as to prevent all drives to be required to be accessed for consecutive data) Idea is that each drive can work independently and the last drive stores parity to save data in case of failure of any one drive. Any suggestions from anyone on where to start with such a driver..it seems like a block driver for the parity drive but which depends on intercepting the writes to other drives. On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com> wrote: > > > On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>I'd a appreciate any help/pointers in implementing the proposal below >>including the right path to get this into the kernel itself. >>---------------------------------- >>I'm outlining below a proposal for a RAID device mapper virtual block >>device for the kernel which adds "split raid" functionality on an >>incremental batch basis for a home media server/archived content which >>is rarely accessed. >> >>Given a set of N+X block devices (of the same size but smallest common >>size wins) >> >>the SplitRAID device mapper device generates virtual devices which are >>passthrough for N devices and write a Batched/Delayed checksum into >>the X devices so as to allow offline recovery of block on the N >>devices in case of a single disk failure. >> >>Advantages over conventional RAID: >> >>- Disks can be spun down reducing wear and tear over MD RAID Levels >>(such as 1, 10, 5,6) in the case of rarely accessed archival content >> >>- Prevent catastrophic data loss for multiple device failure since >>each block device is independent and hence unlike MD RAID will only >>lose data incrementally. >> >>- Performance degradation for writes can be achieved by keeping the >>checksum update asynchronous and delaying the fsync to the checksum >>block device. >> >>In the event of improper shutdown the checksum may not have all the >>updated data but will be mostly up to date which is often acceptable >>for home media server requirements. A flag can be set in case the >>checksum block device was shutdown properly indicating that a full >>checksum rebuild is not required. >> >>Existing solutions considered: >> >>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot >>based scheme (Its advantages are that its in user space and has cross >>platform support but has the huge disadvantage of every checksum being >>done from scratch slowing the system, causing immense wear and tear on >>every snapshot and also losing any information updates upto the >>snapshot point etc) >> >>I'd like to get opinions on the pros and cons of this proposal from >>more experienced people on the list to redirect suitably on the >>following questions: >> >>- Maybe this can already be done using the block devices available in >>the kernel? >> >>- If not, Device mapper the right API to use? (I think so) >> >>- What would be the best block devices code to look at to implement? >> >> >>Regards, >> >>Anshuman Aggarwal >> >>_______________________________________________ >>Kernelnewbies mailing list >>Kernelnewbies at kernelnewbies.org >>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies > > I think I understand the proposal. > > You say N pass-through drives. I assume concatenated? > > If the N drives were instead in a Raid-0 stripe set and your X drives was just a single parity drive, then you would have described Raid-4. > > There are use cases for raid 4 and you have described a good one (rarely used data where random w/o performance is not key). > > I don't know if mdraid supports raid-4 or not. If not I suggest adding raid-4 support is something else you might want to look at. > > Anyway, at a minimum add raid-4 to the existing solutions considered section. > > Greg > > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-21 18:48 ` Anshuman Aggarwal @ 2014-11-22 13:17 ` Greg Freemyer 2014-11-22 13:22 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-22 13:17 UTC (permalink / raw) To: kernelnewbies Top posting is strongly discouraged on all kernel related mailing lists including this one. I've moved your reply to the bottom and then replied after that. In future I will ignore replies that are top posted. >On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com> >wrote: >> >> >> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal ><anshuman.aggarwal@gmail.com> wrote: >>>I'd a appreciate any help/pointers in implementing the proposal below >>>including the right path to get this into the kernel itself. >>>---------------------------------- >>>I'm outlining below a proposal for a RAID device mapper virtual block >>>device for the kernel which adds "split raid" functionality on an >>>incremental batch basis for a home media server/archived content >which >>>is rarely accessed. >>> >>>Given a set of N+X block devices (of the same size but smallest >common >>>size wins) >>> >>>the SplitRAID device mapper device generates virtual devices which >are >>>passthrough for N devices and write a Batched/Delayed checksum into >>>the X devices so as to allow offline recovery of block on the N >>>devices in case of a single disk failure. >>> >>>Advantages over conventional RAID: >>> >>>- Disks can be spun down reducing wear and tear over MD RAID Levels >>>(such as 1, 10, 5,6) in the case of rarely accessed archival content >>> >>>- Prevent catastrophic data loss for multiple device failure since >>>each block device is independent and hence unlike MD RAID will only >>>lose data incrementally. >>> >>>- Performance degradation for writes can be achieved by keeping the >>>checksum update asynchronous and delaying the fsync to the checksum >>>block device. >>> >>>In the event of improper shutdown the checksum may not have all the >>>updated data but will be mostly up to date which is often acceptable >>>for home media server requirements. A flag can be set in case the >>>checksum block device was shutdown properly indicating that a full >>>checksum rebuild is not required. >>> >>>Existing solutions considered: >>> >>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot >>>based scheme (Its advantages are that its in user space and has cross >>>platform support but has the huge disadvantage of every checksum >being >>>done from scratch slowing the system, causing immense wear and tear >on >>>every snapshot and also losing any information updates upto the >>>snapshot point etc) >>> >>>I'd like to get opinions on the pros and cons of this proposal from >>>more experienced people on the list to redirect suitably on the >>>following questions: >>> >>>- Maybe this can already be done using the block devices available in >>>the kernel? >>> >>>- If not, Device mapper the right API to use? (I think so) >>> >>>- What would be the best block devices code to look at to implement? >>> >>> >>>Regards, >>> >>>Anshuman Aggarwal >>> >>>_______________________________________________ >>>Kernelnewbies mailing list >>>Kernelnewbies at kernelnewbies.org >>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies >> >> I think I understand the proposal. >> >> You say N pass-through drives. I assume concatenated? >> >> If the N drives were instead in a Raid-0 stripe set and your X drives >was just a single parity drive, then you would have described Raid-4. >> >> There are use cases for raid 4 and you have described a good one >(rarely used data where random w/o performance is not key). >> >> I don't know if mdraid supports raid-4 or not. If not I suggest >adding raid-4 support is something else you might want to look at. >> >> Anyway, at a minimum add raid-4 to the existing solutions considered >section. >> >> Greg On November 21, 2014 1:48:57 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >N pass through but with their own filesystems. Concatenation is via >some kind of union fs solution not at the block level. Data is not >supposed to be striped (this is critical so as to prevent all drives >to be required to be accessed for consecutive data) I'm ignorant of how unionfs works, so I can offer no feedback about it. I see no real issue doing it with a block level solution with device mapper (dm) as the implementation. I'm going to ignore implementation for the rest of this email and discuss the goal. Can you detail what you see a single page write to D1 doing? You talked about batching / delaying the checksum writes, but I didn't understand how that made things more efficient, nor the reason for the delay. I assume you know raid 4 and 5 work like this: Read D1old Read Pold Pnew=(Pold^D1old)^D1new Write Pnew Write D1new Ie. 2 page reads and 2 page writes to update a single page. The 2 reads and the 2 writes take place in parallel, so if the disks are otherwise idle, then the time involved is one disk seek and 2 disk rotations. Let's say 25 msecs for the seek and 12 msecs per rotation. That is 49 msecs total. I think that is about right for a low performance rotating drive, but I didn't pull out any specs to double check my memory. While that is a lot of i/o overhead (4x), it is how raid 4 and 5 work and I assume your split raid would have to do something similar. With a normal non raided disk a single block write requires a seek and a rotation, so 37 msecs, thus very little clock time overhead for raid 4 or 5 for small random i/o block writes. Is that also true of your split raid? The delayed checksum writes confuse me. --- Where I'm concerned about your solution for performance is with a full stride write. Let's look at how a 4 disk raid 4 would write a full stride: Pnew = D1new ^ D2new ^ D3new Write D1 Write D2 Write D3 Write P So only 4 writes to write 3 data blocks. Even better all take place in parallel so you can accomplish 3x the data writes to disk that a single non-raided disk can. Thus for streaming writes, raid 4 or 5 see a performance boost over a single drive. I see nothing similar in your split raid. The same is true of streaming reads, raid 4 and 5 get performance gains from reading from the drives in parallel. I don't see any ability for that same gain in your split raid. In the real world raid 4 is rarely used because having a static parity drive offers no advantage I know of over having the parity handled as raid 5 does it. === Thus if your split raid was in kernel and I was setting up a streaming media server the choice would be between raid 5 and your split raid. Raid 5 I believe would have superior performance, but split raid would have a less catastrophic failure mode if 2 drives failed at once. Do I have right? Greg >Idea is that each drive can work independently and the last drive >stores parity to save data in case of failure of any one drive. > >Any suggestions from anyone on where to start with such a driver..it >seems like a block driver for the parity drive but which depends on >intercepting the writes to other drives. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-22 13:17 ` Greg Freemyer @ 2014-11-22 13:22 ` Anshuman Aggarwal 2014-11-22 14:03 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-22 13:22 UTC (permalink / raw) To: kernelnewbies On 22 November 2014 at 18:47, Greg Freemyer <greg.freemyer@gmail.com> wrote: > Top posting is strongly discouraged on all kernel related mailing lists including this one. I've moved your reply to the bottom and then replied after that. In future I will ignore replies that are top posted. > > >>On 21 November 2014 17:11, Greg Freemyer <greg.freemyer@gmail.com> >>wrote: >>> >>> >>> On November 21, 2014 5:15:43 AM EST, Anshuman Aggarwal >><anshuman.aggarwal@gmail.com> wrote: >>>>I'd a appreciate any help/pointers in implementing the proposal below >>>>including the right path to get this into the kernel itself. >>>>---------------------------------- >>>>I'm outlining below a proposal for a RAID device mapper virtual block >>>>device for the kernel which adds "split raid" functionality on an >>>>incremental batch basis for a home media server/archived content >>which >>>>is rarely accessed. >>>> >>>>Given a set of N+X block devices (of the same size but smallest >>common >>>>size wins) >>>> >>>>the SplitRAID device mapper device generates virtual devices which >>are >>>>passthrough for N devices and write a Batched/Delayed checksum into >>>>the X devices so as to allow offline recovery of block on the N >>>>devices in case of a single disk failure. >>>> >>>>Advantages over conventional RAID: >>>> >>>>- Disks can be spun down reducing wear and tear over MD RAID Levels >>>>(such as 1, 10, 5,6) in the case of rarely accessed archival content >>>> >>>>- Prevent catastrophic data loss for multiple device failure since >>>>each block device is independent and hence unlike MD RAID will only >>>>lose data incrementally. >>>> >>>>- Performance degradation for writes can be achieved by keeping the >>>>checksum update asynchronous and delaying the fsync to the checksum >>>>block device. >>>> >>>>In the event of improper shutdown the checksum may not have all the >>>>updated data but will be mostly up to date which is often acceptable >>>>for home media server requirements. A flag can be set in case the >>>>checksum block device was shutdown properly indicating that a full >>>>checksum rebuild is not required. >>>> >>>>Existing solutions considered: >>>> >>>>- SnapRAID (http://snapraid.sourceforge.net/) which is a snapshot >>>>based scheme (Its advantages are that its in user space and has cross >>>>platform support but has the huge disadvantage of every checksum >>being >>>>done from scratch slowing the system, causing immense wear and tear >>on >>>>every snapshot and also losing any information updates upto the >>>>snapshot point etc) >>>> >>>>I'd like to get opinions on the pros and cons of this proposal from >>>>more experienced people on the list to redirect suitably on the >>>>following questions: >>>> >>>>- Maybe this can already be done using the block devices available in >>>>the kernel? >>>> >>>>- If not, Device mapper the right API to use? (I think so) >>>> >>>>- What would be the best block devices code to look at to implement? >>>> >>>> >>>>Regards, >>>> >>>>Anshuman Aggarwal >>>> >>>>_______________________________________________ >>>>Kernelnewbies mailing list >>>>Kernelnewbies at kernelnewbies.org >>>>http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies >>> >>> I think I understand the proposal. >>> >>> You say N pass-through drives. I assume concatenated? >>> >>> If the N drives were instead in a Raid-0 stripe set and your X drives >>was just a single parity drive, then you would have described Raid-4. >>> >>> There are use cases for raid 4 and you have described a good one >>(rarely used data where random w/o performance is not key). >>> >>> I don't know if mdraid supports raid-4 or not. If not I suggest >>adding raid-4 support is something else you might want to look at. >>> >>> Anyway, at a minimum add raid-4 to the existing solutions considered >>section. >>> >>> Greg > On November 21, 2014 1:48:57 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>N pass through but with their own filesystems. Concatenation is via >>some kind of union fs solution not at the block level. Data is not >>supposed to be striped (this is critical so as to prevent all drives >>to be required to be accessed for consecutive data) > > I'm ignorant of how unionfs works, so I can offer no feedback about it. > > I see no real issue doing it with a block level solution with device mapper (dm) as the implementation. I'm going to ignore implementation for the rest of this email and discuss the goal. > > Can you detail what you see a single page write to D1 doing? > > You talked about batching / delaying the checksum writes, but I didn't understand how that made things more efficient, nor the reason for the delay. > > I assume you know raid 4 and 5 work like this: > > Read D1old > Read Pold > Pnew=(Pold^D1old)^D1new > Write Pnew > Write D1new > > Ie. 2 page reads and 2 page writes to update a single page. > > The 2 reads and the 2 writes take place in parallel, so if the disks are otherwise idle, then the time involved is one disk seek and 2 disk rotations. Let's say 25 msecs for the seek and 12 msecs per rotation. That is 49 msecs total. I think that is about right for a low performance rotating drive, but I didn't pull out any specs to double check my memory. > > While that is a lot of i/o overhead (4x), it is how raid 4 and 5 work and I assume your split raid would have to do something similar. With a normal non raided disk a single block write requires a seek and a rotation, so 37 msecs, thus very little clock time overhead for raid 4 or 5 for small random i/o block writes. > > Is that also true of your split raid? The delayed checksum writes confuse me. > --- > > Where I'm concerned about your solution for performance is with a full stride write. Let's look at how a 4 disk raid 4 would write a full stride: > > Pnew = D1new ^ D2new ^ D3new > Write D1 > Write D2 > Write D3 > Write P > > So only 4 writes to write 3 data blocks. Even better all take place in parallel so you can accomplish 3x the data writes to disk that a single non-raided disk can. > > Thus for streaming writes, raid 4 or 5 see a performance boost over a single drive. > > I see nothing similar in your split raid. > > The same is true of streaming reads, raid 4 and 5 get performance gains from reading from the drives in parallel. I don't see any ability for that same gain in your split raid. > > In the real world raid 4 is rarely used because having a static parity drive offers no advantage I know of over having the parity handled as raid 5 does it. > > === > Thus if your split raid was in kernel and I was setting up a streaming media server the choice would be between raid 5 and your split raid. Raid 5 I believe would have superior performance, but split raid would have a less catastrophic failure mode if 2 drives failed at once. > > Do I have right? > > Greg > > > > > > >>Idea is that each drive can work independently and the last drive >>stores parity to save data in case of failure of any one drive. >> >>Any suggestions from anyone on where to start with such a driver..it >>seems like a block driver for the parity drive but which depends on >>intercepting the writes to other drives. > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. You have the motivation and goal quite opposite from what is intended. In a home media server, the RAID6 mdadm setup that I currently have keeps all the disks spinning and running for writes which could be done just to the last disk while the others are in sleep mode (head parked etc) Its not about performance at all. Its about longevity of the HDDs. The entire proposal is focused entirely on extending the life of the drives. By not using stripes, we restrict writes to happen to just 1 drive and the XOR output to the parity drive which then explains the delayed and batched checksum (resulting in fewer writes to the parity drive). The intention is that if a drive fails then maybe we lose 1 or 2 movies but the rest is restorable from parity. Also another advantage over RAID5 or RAID6 is that in the event of multiple drive failure we only lose the content on the failed drive not the whole cluster/RAID. Did I clarify better this time around? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-22 13:22 ` Anshuman Aggarwal @ 2014-11-22 14:03 ` Greg Freemyer 2014-11-22 14:43 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-22 14:03 UTC (permalink / raw) To: kernelnewbies On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > By not using stripes, we restrict writes to happen to just 1 drive and > the XOR output to the parity drive which then explains the delayed and > batched checksum (resulting in fewer writes to the parity drive). The > intention is that if a drive fails then maybe we lose 1 or 2 movies > but the rest is restorable from parity. > > Also another advantage over RAID5 or RAID6 is that in the event of > multiple drive failure we only lose the content on the failed drive > not the whole cluster/RAID. > > Did I clarify better this time around? I still don't understand the delayed checksum/parity. With classic raid 4, writing 1 GB of data to just D1 would require 1 GB of data first be read from D1 and 1 GB read from P then 1 GB written to both D1 and P. 4 GB worth of I/O total. With your proposal, if you stream 1 GB of data to a file on D1: - Does the old/previous data on D1 have to be read? - How much data goes to the parity drive? - Does the old data on the parity drive have to be read? - Why does delaying it reduce that volume compared to Raid 4? - In the event drive 1 fails, can its content be re-created from the other drives? Greg -- Greg Freemyer ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-22 14:03 ` Greg Freemyer @ 2014-11-22 14:43 ` Anshuman Aggarwal 2014-11-22 14:54 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-22 14:43 UTC (permalink / raw) To: kernelnewbies On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com> wrote: > On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal > <anshuman.aggarwal@gmail.com> wrote: >> By not using stripes, we restrict writes to happen to just 1 drive and >> the XOR output to the parity drive which then explains the delayed and >> batched checksum (resulting in fewer writes to the parity drive). The >> intention is that if a drive fails then maybe we lose 1 or 2 movies >> but the rest is restorable from parity. >> >> Also another advantage over RAID5 or RAID6 is that in the event of >> multiple drive failure we only lose the content on the failed drive >> not the whole cluster/RAID. >> >> Did I clarify better this time around? > > I still don't understand the delayed checksum/parity. > > With classic raid 4, writing 1 GB of data to just D1 would require 1 > GB of data first be read from D1 and 1 GB read from P then 1 GB > written to both D1 and P. 4 GB worth of I/O total. > > With your proposal, if you stream 1 GB of data to a file on D1: > > - Does the old/previous data on D1 have to be read? > > - How much data goes to the parity drive? > > - Does the old data on the parity drive have to be read? > > - Why does delaying it reduce that volume compared to Raid 4? > > - In the event drive 1 fails, can its content be re-created from the > other drives? > > Greg > -- > Greg Freemyer Two things: Delayed writes basically to allow the parity drive to spin down if the parity writing is only 1 block instead of spinning up the drive for every write (obviously the data drive has to be spun up). Delays will be both time and size constrained. For a large write such as a 1 GB of data to file it would trigger a configurable maximum delaying limit which would then dump to parity drive immediately preventing memory overuse. This again ties in to the fact that the content is not 'critical' so if parity was not dumped when a drive fails, worst case you only lose the latest file. Delayed writes may be done via bcache or a similar implementation which caches the writes in memory and need not be part of the split raid driver at all. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-22 14:43 ` Anshuman Aggarwal @ 2014-11-22 14:54 ` Greg Freemyer 2014-11-24 5:36 ` SandeepKsinha 0 siblings, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-22 14:54 UTC (permalink / raw) To: kernelnewbies On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com> >wrote: >> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >>> By not using stripes, we restrict writes to happen to just 1 drive >and >>> the XOR output to the parity drive which then explains the delayed >and >>> batched checksum (resulting in fewer writes to the parity drive). >The >>> intention is that if a drive fails then maybe we lose 1 or 2 movies >>> but the rest is restorable from parity. >>> >>> Also another advantage over RAID5 or RAID6 is that in the event of >>> multiple drive failure we only lose the content on the failed drive >>> not the whole cluster/RAID. >>> >>> Did I clarify better this time around? >> >> I still don't understand the delayed checksum/parity. >> >> With classic raid 4, writing 1 GB of data to just D1 would require 1 >> GB of data first be read from D1 and 1 GB read from P then 1 GB >> written to both D1 and P. 4 GB worth of I/O total. >> >> With your proposal, if you stream 1 GB of data to a file on D1: >> >> - Does the old/previous data on D1 have to be read? >> >> - How much data goes to the parity drive? >> >> - Does the old data on the parity drive have to be read? >> >> - Why does delaying it reduce that volume compared to Raid 4? >> >> - In the event drive 1 fails, can its content be re-created from the >> other drives? >> >> Greg >> -- >> Greg Freemyer > >Two things: >Delayed writes basically to allow the parity drive to spin down if the >parity writing is only 1 block instead of spinning up the drive for >every write (obviously the data drive has to be spun up). Delays will >be both time and size constrained. >For a large write such as a 1 GB of data to file it would trigger a >configurable maximum delaying limit which would then dump to parity >drive immediately preventing memory overuse. > >This again ties in to the fact that the content is not 'critical' so >if parity was not dumped when a drive fails, worst case you only lose >the latest file. > >Delayed writes may be done via bcache or a similar implementation >which caches the writes in memory and need not be part of the split >raid driver at all. That provided little clarity. File systems like xfs queue (delay) significant amounts of actual data before writing it to disk. The same is true of journal data. If all you are doing is caching the parity up until their is enough to bother with, then a filesystem designed for streamed data already does the for the data drive, thus you don't need to do anything new for the parity drive, just run it in sync with the data drive. At this point I interpret your proposal to be: Implement a Raid 4 like setup, but instead if stripping the date data drives, concatenate them. That is something I haven't seen done, but I can see why you would want it. Implementing via unionfs I don't understand, but as a new device mapper mechanism it seems very logical. Obviously, I'm not a device mapper maintainer, so I'm not saying it would be accepted, but if I'm right you can now have a discussion of just a few sentences which explain your goal. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-22 14:54 ` Greg Freemyer @ 2014-11-24 5:36 ` SandeepKsinha 2014-11-24 6:48 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: SandeepKsinha @ 2014-11-24 5:36 UTC (permalink / raw) To: kernelnewbies On Sat, Nov 22, 2014 at 8:24 PM, Greg Freemyer <greg.freemyer@gmail.com> wrote: > > > On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal < > anshuman.aggarwal at gmail.com> wrote: > >On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com> > >wrote: > >> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal > >> <anshuman.aggarwal@gmail.com> wrote: > >>> By not using stripes, we restrict writes to happen to just 1 drive > >and > >>> the XOR output to the parity drive which then explains the delayed > >and > >>> batched checksum (resulting in fewer writes to the parity drive). > >The > >>> intention is that if a drive fails then maybe we lose 1 or 2 movies > >>> but the rest is restorable from parity. > >>> > >>> Also another advantage over RAID5 or RAID6 is that in the event of > >>> multiple drive failure we only lose the content on the failed drive > >>> not the whole cluster/RAID. > >>> > >>> Did I clarify better this time around? > >> > >> I still don't understand the delayed checksum/parity. > >> > >> With classic raid 4, writing 1 GB of data to just D1 would require 1 > >> GB of data first be read from D1 and 1 GB read from P then 1 GB > >> written to both D1 and P. 4 GB worth of I/O total. > >> > >> With your proposal, if you stream 1 GB of data to a file on D1: > >> > >> - Does the old/previous data on D1 have to be read? > >> > >> - How much data goes to the parity drive? > >> > >> - Does the old data on the parity drive have to be read? > >> > >> - Why does delaying it reduce that volume compared to Raid 4? > >> > >> - In the event drive 1 fails, can its content be re-created from the > >> other drives? > >> > >> Greg > >> -- > >> Greg Freemyer > > > >Two things: > >Delayed writes basically to allow the parity drive to spin down if the > >parity writing is only 1 block instead of spinning up the drive for > >every write (obviously the data drive has to be spun up). Delays will > >be both time and size constrained. > >For a large write such as a 1 GB of data to file it would trigger a > >configurable maximum delaying limit which would then dump to parity > >drive immediately preventing memory overuse. > > > >This again ties in to the fact that the content is not 'critical' so > >if parity was not dumped when a drive fails, worst case you only lose > >the latest file. > > > >Delayed writes may be done via bcache or a similar implementation > >which caches the writes in memory and need not be part of the split > >raid driver at all. > > That provided little clarity. > > File systems like xfs queue (delay) significant amounts of actual data > before writing it to disk. The same is true of journal data. If all you > are doing is caching the parity up until their is enough to bother with, > then a filesystem designed for streamed data already does the for the data > drive, thus you don't need to do anything new for the parity drive, just > run it in sync with the data drive. > > At this point I interpret your proposal to be: > > Implement a Raid 4 like setup, but instead if stripping the date data > drives, concatenate them. > > That is something I haven't seen done, but I can see why you would want > it. Implementing via unionfs I don't understand, but as a new device > mapper mechanism it seems very logical. > > Obviously, I'm not a device mapper maintainer, so I'm not saying it would > be accepted, but if I'm right you can now have a discussion of just a few > sentences which explain your goal. > > RAID4 support does not exist in the mainline. Anshuman, you might want to reach out to Neil Brown who is the maintainer for dmraid. IIUC, your requirement can be well implemented by writing a new device mapper target. That will make it modular and will help you make improvements to it easily. > Greg > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. > > _______________________________________________ > Kernelnewbies mailing list > Kernelnewbies at kernelnewbies.org > http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies > -- Regards, Sandeep. ?To learn is to change. Education is a process that changes the learner.? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20141124/f9c1f9b1/attachment.html ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 5:36 ` SandeepKsinha @ 2014-11-24 6:48 ` Anshuman Aggarwal 2014-11-24 13:19 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-24 6:48 UTC (permalink / raw) To: kernelnewbies Sandeep, This isn't exactly RAID4 (only thing in common is a single parity disk but the data is not striped at all). I did bring it up on the linux-raid mailing list and have had a short conversation with Neil. He wasn't too excited about device mapper but didn't indicate why or why not. I would like to have this as a layer for each block device on top of the original block devices (intercepting write requests to the block devices and updating the parity disk). Is device mapper the write interface? What are the others? Also if I don't store the metadata on the block device itself (to allow the block device to be unaware of the RAID4 on top...how would the kernel be informed of which devices together form the Split RAID. Appreciate the help. Thanks, Anshuman On 24 November 2014 at 11:06, SandeepKsinha <sandeepksinha@gmail.com> wrote: > > > On Sat, Nov 22, 2014 at 8:24 PM, Greg Freemyer <greg.freemyer@gmail.com> > wrote: >> >> >> >> On November 22, 2014 9:43:23 AM EST, Anshuman Aggarwal >> <anshuman.aggarwal@gmail.com> wrote: >> >On 22 November 2014 at 19:33, Greg Freemyer <greg.freemyer@gmail.com> >> >wrote: >> >> On Sat, Nov 22, 2014 at 8:22 AM, Anshuman Aggarwal >> >> <anshuman.aggarwal@gmail.com> wrote: >> >>> By not using stripes, we restrict writes to happen to just 1 drive >> >and >> >>> the XOR output to the parity drive which then explains the delayed >> >and >> >>> batched checksum (resulting in fewer writes to the parity drive). >> >The >> >>> intention is that if a drive fails then maybe we lose 1 or 2 movies >> >>> but the rest is restorable from parity. >> >>> >> >>> Also another advantage over RAID5 or RAID6 is that in the event of >> >>> multiple drive failure we only lose the content on the failed drive >> >>> not the whole cluster/RAID. >> >>> >> >>> Did I clarify better this time around? >> >> >> >> I still don't understand the delayed checksum/parity. >> >> >> >> With classic raid 4, writing 1 GB of data to just D1 would require 1 >> >> GB of data first be read from D1 and 1 GB read from P then 1 GB >> >> written to both D1 and P. 4 GB worth of I/O total. >> >> >> >> With your proposal, if you stream 1 GB of data to a file on D1: >> >> >> >> - Does the old/previous data on D1 have to be read? >> >> >> >> - How much data goes to the parity drive? >> >> >> >> - Does the old data on the parity drive have to be read? >> >> >> >> - Why does delaying it reduce that volume compared to Raid 4? >> >> >> >> - In the event drive 1 fails, can its content be re-created from the >> >> other drives? >> >> >> >> Greg >> >> -- >> >> Greg Freemyer >> > >> >Two things: >> >Delayed writes basically to allow the parity drive to spin down if the >> >parity writing is only 1 block instead of spinning up the drive for >> >every write (obviously the data drive has to be spun up). Delays will >> >be both time and size constrained. >> >For a large write such as a 1 GB of data to file it would trigger a >> >configurable maximum delaying limit which would then dump to parity >> >drive immediately preventing memory overuse. >> > >> >This again ties in to the fact that the content is not 'critical' so >> >if parity was not dumped when a drive fails, worst case you only lose >> >the latest file. >> > >> >Delayed writes may be done via bcache or a similar implementation >> >which caches the writes in memory and need not be part of the split >> >raid driver at all. >> >> That provided little clarity. >> >> File systems like xfs queue (delay) significant amounts of actual data >> before writing it to disk. The same is true of journal data. If all you >> are doing is caching the parity up until their is enough to bother with, >> then a filesystem designed for streamed data already does the for the data >> drive, thus you don't need to do anything new for the parity drive, just run >> it in sync with the data drive. >> >> At this point I interpret your proposal to be: >> >> Implement a Raid 4 like setup, but instead if stripping the date data >> drives, concatenate them. >> >> That is something I haven't seen done, but I can see why you would want >> it. Implementing via unionfs I don't understand, but as a new device mapper >> mechanism it seems very logical. >> >> Obviously, I'm not a device mapper maintainer, so I'm not saying it would >> be accepted, but if I'm right you can now have a discussion of just a few >> sentences which explain your goal. >> > > RAID4 support does not exist in the mainline. Anshuman, you might want to > reach out to Neil Brown who is the maintainer for dmraid. > IIUC, your requirement can be well implemented by writing a new device > mapper target. That will make it modular and will help you make improvements > to it easily. > > > > >> >> Greg >> -- >> Sent from my Android phone with K-9 Mail. Please excuse my brevity. >> >> _______________________________________________ >> Kernelnewbies mailing list >> Kernelnewbies at kernelnewbies.org >> http://lists.kernelnewbies.org/mailman/listinfo/kernelnewbies > > > > > -- > Regards, > Sandeep. > > > > > > > ?To learn is to change. Education is a process that changes the learner.? ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 6:48 ` Anshuman Aggarwal @ 2014-11-24 13:19 ` Greg Freemyer 2014-11-24 17:28 ` Anshuman Aggarwal 0 siblings, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-24 13:19 UTC (permalink / raw) To: kernelnewbies On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >Sandeep, > This isn't exactly RAID4 (only thing in common is a single parity >disk but the data is not striped at all). I did bring it up on the >linux-raid mailing list and have had a short conversation with Neil. >He wasn't too excited about device mapper but didn't indicate why or >why not. If it was early in your proposal it may simply be he didn't understand it. The delayed writes to the parity disk you described would have been tough for device mapper to manage. It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern. The only reason you provided was reduced wear and tear for the parity drive. Reduced wear and tear in this case is a red herring. The kernel already buffers writes to the data disk, so no need to separately buffer parity writes. >I would like to have this as a layer for each block device on top of >the original block devices (intercepting write requests to the block >devices and updating the parity disk). Is device mapper the write >interface? I think yes, but dm and md are actually separate. I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them: https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt You will need to add code to both the dm and md kernel code. I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well. > What are the others? Well btrfs as an example incorporates a lot of raid capability into the filesystem. Thus btrfs is a monolithic driver that has consumed much of the dm/md layer. I can't speak to why they are doing that, but I find it troubling. Having monolithic aspects to the kernel has always been something the Linux kernel avoided. > Also if I don't store the metadata on >the block device itself (to allow the block device to be unaware of >the RAID4 on top...how would the kernel be informed of which devices >together form the Split RAID. I don't understand the question. I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control. (mdadm for md, pvcreate for dm). Then configure the split raid setup. Have you gone through the process of creating a raid5 with mdadm. If not at least read a howto about it. https://raid.wiki.kernel.org/index.php/RAID_setup I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume). Same for the other data drives. Then use mkfs to put a filesystem on each lv. The filesystem has no knowledge there is a split raid below it. It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls. Ie. For a read, it is a straight passthrough. For a write, the old data and old parity have to be read in, modified, written out. Device mapper does this now for raid 4/5/6, so most of the code is in place. >Appreciate the help. > >Thanks, >Anshuman I just realized I replied to a top post. Seriously, don't do that on kernel lists if you want to be taken seriously. It immediately identifies you as unfamiliar with the kernel mailing list netiquette. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 13:19 ` Greg Freemyer @ 2014-11-24 17:28 ` Anshuman Aggarwal 2014-11-24 18:10 ` Valdis.Kletnieks at vt.edu 2014-11-25 4:56 ` Greg Freemyer 0 siblings, 2 replies; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-24 17:28 UTC (permalink / raw) To: kernelnewbies On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com> wrote: > > > On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>Sandeep, >> This isn't exactly RAID4 (only thing in common is a single parity >>disk but the data is not striped at all). I did bring it up on the >>linux-raid mailing list and have had a short conversation with Neil. >>He wasn't too excited about device mapper but didn't indicate why or >>why not. > > If it was early in your proposal it may simply be he didn't understand it. > > The delayed writes to the parity disk you described would have been tough for device mapper to manage. It doesn't typically maintain its own longer term buffers, so that would have been something that might have given him concern. The only reason you provided was reduced wear and tear for the parity drive. > > Reduced wear and tear in this case is a red herring. The kernel already buffers writes to the data disk, so no need to separately buffer parity writes. Fair enough, the delay in buffering for the parity writes is an independent issue which can be deferred easily. > >>I would like to have this as a layer for each block device on top of >>the original block devices (intercepting write requests to the block >>devices and updating the parity disk). Is device mapper the write >>interface? > > I think yes, but dm and md are actually separate. I think of dm as a subset of md, but if you are going to really do this you will need to learn the details better than I know them: > > https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt > > You will need to add code to both the dm and md kernel code. > > I assume you know that both mdraid (mdadm) and lvm userspace tools are used to manage device mapper, so you would have to add user space support to mdraid/lvm as well. > >> What are the others? > > Well btrfs as an example incorporates a lot of raid capability into the filesystem. Thus btrfs is a monolithic driver that has consumed much of the dm/md layer. I can't speak to why they are doing that, but I find it troubling. Having monolithic aspects to the kernel has always been something the Linux kernel avoided. > >> Also if I don't store the metadata on >>the block device itself (to allow the block device to be unaware of >>the RAID4 on top...how would the kernel be informed of which devices >>together form the Split RAID. > > I don't understand the question. mdadm typically has a metadata superblock stored on the block device which identifies the block device as part of the RAID and typically prevents it from directly recognized by file system code . I was wondering if Split RAID block devices can be made to be unaware to the RAID scheme on top and be fully mountable and usable without the raid drivers (of course invalidating the parity if any of them are written to). This allows a parity disk to be added to existing block devices without having to setup the superblock on the underlying devices. Hope that is clear now? > > I haven't thought through the process, but with mdraid/lvm you would identify the physical drives as under dm control. (mdadm for md, pvcreate for dm). Then configure the split raid setup. > > Have you gone through the process of creating a raid5 with mdadm. If not at least read a howto about it. > > https://raid.wiki.kernel.org/index.php/RAID_setup Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm for more than a few years and handled multiple failures. I am reasonably familiar with md reconstruction too. It is the performance oriented but disk intensive nature of mdadm that I would like to vary on for a home media server. > > I assume you would have mdadm form your multi-disk split raid volume composed of all the physical disks, then use lvm commands to define the block range on the the first drive as a lv (logical volume). Same for the other data drives. > > Then use mkfs to put a filesystem on each lv. Maybe it can also be done via md raid creating a partitionable array where each partition corresponds to an underlying block device without any striping. > > The filesystem has no knowledge there is a split raid below it. It simply reads/writes to the overall, device mapper is layered below it and triggers the required i/o calls. > > Ie. For a read, it is a straight passthrough. For a write, the old data and old parity have to be read in, modified, written out. Device mapper does this now for raid 4/5/6, so most of the code is in place. Exactly. Reads are passthrough, writes lead to the parity write being triggered. Only remaining concern for me is that the md super block will require block device to be initialized using mdadm. That can be acceptable I suppose, but an ideal solution would be able to use existing block devices (which would be untouched)...put passthrough block device on top of them and manage the parity updation on the parity block device. The information about which block devices comprise the array can be stored in a config file etc and does not need a superblock as badly as a raid setup. > >>Appreciate the help. >> >>Thanks, >>Anshuman > > I just realized I replied to a top post. > > Seriously, don't do that on kernel lists if you want to be taken seriously. It immediately identifies you as unfamiliar with the kernel mailing list netiquette. > > Greg > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. Sorry. Just getting used to the kernel mailing list and most tools put the default reply on the top. Thanks for replying and reminding me. Anshuman ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 17:28 ` Anshuman Aggarwal @ 2014-11-24 18:10 ` Valdis.Kletnieks at vt.edu 2014-11-25 4:56 ` Greg Freemyer 1 sibling, 0 replies; 44+ messages in thread From: Valdis.Kletnieks at vt.edu @ 2014-11-24 18:10 UTC (permalink / raw) To: kernelnewbies On Mon, 24 Nov 2014 22:58:08 +0530, Anshuman Aggarwal said: > prevents it from directly recognized by file system code . I was > wondering if Split RAID block devices can be made to be unaware to the > RAID scheme on top and be fully mountable and usable without the raid > drivers (of course invalidating the parity if any of them are written Well, there's two basic cases: 1) You have one device and you're adding a parity device - which is basically just creating a raid-1 mirror when you get down to it. 2) You have some collection of devices in a stripe/concat/whatever, and are adding a parity device. This only works if the existing stripe/concat is already functional *without* the parity device (which implies that said stripe or concat has to be an already-supported structure) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 848 bytes Desc: not available Url : http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20141124/0a07dff5/attachment.bin ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-24 17:28 ` Anshuman Aggarwal 2014-11-24 18:10 ` Valdis.Kletnieks at vt.edu @ 2014-11-25 4:56 ` Greg Freemyer 2014-11-27 17:50 ` Anshuman Aggarwal 1 sibling, 1 reply; 44+ messages in thread From: Greg Freemyer @ 2014-11-25 4:56 UTC (permalink / raw) To: kernelnewbies On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com> >wrote: >> >> >> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal ><anshuman.aggarwal@gmail.com> wrote: >>>Sandeep, >>> This isn't exactly RAID4 (only thing in common is a single parity >>>disk but the data is not striped at all). I did bring it up on the >>>linux-raid mailing list and have had a short conversation with Neil. >>>He wasn't too excited about device mapper but didn't indicate why or >>>why not. >> >> If it was early in your proposal it may simply be he didn't >understand it. >> >> The delayed writes to the parity disk you described would have been >tough for device mapper to manage. It doesn't typically maintain its >own longer term buffers, so that would have been something that might >have given him concern. The only reason you provided was reduced wear >and tear for the parity drive. >> >> Reduced wear and tear in this case is a red herring. The kernel >already buffers writes to the data disk, so no need to separately >buffer parity writes. > >Fair enough, the delay in buffering for the parity writes is an >independent issue which can be deferred easily. > >> >>>I would like to have this as a layer for each block device on top of >>>the original block devices (intercepting write requests to the block >>>devices and updating the parity disk). Is device mapper the write >>>interface? >> >> I think yes, but dm and md are actually separate. I think of dm as a >subset of md, but if you are going to really do this you will need to >learn the details better than I know them: >> >> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt >> >> You will need to add code to both the dm and md kernel code. >> >> I assume you know that both mdraid (mdadm) and lvm userspace tools >are used to manage device mapper, so you would have to add user space >support to mdraid/lvm as well. >> >>> What are the others? >> >> Well btrfs as an example incorporates a lot of raid capability into >the filesystem. Thus btrfs is a monolithic driver that has consumed >much of the dm/md layer. I can't speak to why they are doing that, but >I find it troubling. Having monolithic aspects to the kernel has >always been something the Linux kernel avoided. >> >>> Also if I don't store the metadata on >>>the block device itself (to allow the block device to be unaware of >>>the RAID4 on top...how would the kernel be informed of which devices >>>together form the Split RAID. >> >> I don't understand the question. > >mdadm typically has a metadata superblock stored on the block device >which identifies the block device as part of the RAID and typically >prevents it from directly recognized by file system code . I was >wondering if Split RAID block devices can be made to be unaware to the >RAID scheme on top and be fully mountable and usable without the raid >drivers (of course invalidating the parity if any of them are written >to). This allows a parity disk to be added to existing block devices >without having to setup the superblock on the underlying devices. > >Hope that is clear now? Thank you, I knew about the superblock, but didn't realize that was what you were talking about. Does this address your desire? https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for. >> >> I haven't thought through the process, but with mdraid/lvm you would >identify the physical drives as under dm control. (mdadm for md, >pvcreate for dm). Then configure the split raid setup. >> >> Have you gone through the process of creating a raid5 with mdadm. If >not at least read a howto about it. >> >> https://raid.wiki.kernel.org/index.php/RAID_setup > >Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm >for more than a few years and handled multiple failures. I am >reasonably familiar with md reconstruction too. It is the performance >oriented but disk intensive nature of mdadm that I would like to vary >on for a home media server. > >> >> I assume you would have mdadm form your multi-disk split raid volume >composed of all the physical disks, then use lvm commands to define the >block range on the the first drive as a lv (logical volume). Same for >the other data drives. >> >> Then use mkfs to put a filesystem on each lv. > >Maybe it can also be done via md raid creating a partitionable array >where each partition corresponds to an underlying block device without >any striping. > I think I agree. >> >> The filesystem has no knowledge there is a split raid below it. It >simply reads/writes to the overall, device mapper is layered below it >and triggers the required i/o calls. >> >> Ie. For a read, it is a straight passthrough. For a write, the old >data and old parity have to be read in, modified, written out. Device >mapper does this now for raid 4/5/6, so most of the code is in place. > >Exactly. Reads are passthrough, writes lead to the parity write being >triggered. Only remaining concern for me is that the md super block >will require block device to be initialized using mdadm. That can be >acceptable I suppose, but an ideal solution would be able to use >existing block devices (which would be untouched)...put passthrough >block device on top of them and manage the parity updation on the >parity block device. The information about which block devices >comprise the array can be stored in a config file etc and does not >need a superblock as badly as a raid setup. Hopefully the new user space feature does just that. Greg -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-25 4:56 ` Greg Freemyer @ 2014-11-27 17:50 ` Anshuman Aggarwal 2014-11-27 18:31 ` Greg Freemyer 0 siblings, 1 reply; 44+ messages in thread From: Anshuman Aggarwal @ 2014-11-27 17:50 UTC (permalink / raw) To: kernelnewbies On 25 November 2014 at 10:26, Greg Freemyer <greg.freemyer@gmail.com> wrote: > > > On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com> >>wrote: >>> >>> >>> On November 24, 2014 1:48:48 AM EST, Anshuman Aggarwal >><anshuman.aggarwal@gmail.com> wrote: >>>>Sandeep, >>>> This isn't exactly RAID4 (only thing in common is a single parity >>>>disk but the data is not striped at all). I did bring it up on the >>>>linux-raid mailing list and have had a short conversation with Neil. >>>>He wasn't too excited about device mapper but didn't indicate why or >>>>why not. >>> >>> If it was early in your proposal it may simply be he didn't >>understand it. >>> >>> The delayed writes to the parity disk you described would have been >>tough for device mapper to manage. It doesn't typically maintain its >>own longer term buffers, so that would have been something that might >>have given him concern. The only reason you provided was reduced wear >>and tear for the parity drive. >>> >>> Reduced wear and tear in this case is a red herring. The kernel >>already buffers writes to the data disk, so no need to separately >>buffer parity writes. >> >>Fair enough, the delay in buffering for the parity writes is an >>independent issue which can be deferred easily. >> >>> >>>>I would like to have this as a layer for each block device on top of >>>>the original block devices (intercepting write requests to the block >>>>devices and updating the parity disk). Is device mapper the write >>>>interface? >>> >>> I think yes, but dm and md are actually separate. I think of dm as a >>subset of md, but if you are going to really do this you will need to >>learn the details better than I know them: >>> >>> https://www.kernel.org/doc/Documentation/device-mapper/dm-raid.txt >>> >>> You will need to add code to both the dm and md kernel code. >>> >>> I assume you know that both mdraid (mdadm) and lvm userspace tools >>are used to manage device mapper, so you would have to add user space >>support to mdraid/lvm as well. >>> >>>> What are the others? >>> >>> Well btrfs as an example incorporates a lot of raid capability into >>the filesystem. Thus btrfs is a monolithic driver that has consumed >>much of the dm/md layer. I can't speak to why they are doing that, but >>I find it troubling. Having monolithic aspects to the kernel has >>always been something the Linux kernel avoided. >>> >>>> Also if I don't store the metadata on >>>>the block device itself (to allow the block device to be unaware of >>>>the RAID4 on top...how would the kernel be informed of which devices >>>>together form the Split RAID. >>> >>> I don't understand the question. >> >>mdadm typically has a metadata superblock stored on the block device >>which identifies the block device as part of the RAID and typically >>prevents it from directly recognized by file system code . I was >>wondering if Split RAID block devices can be made to be unaware to the >>RAID scheme on top and be fully mountable and usable without the raid >>drivers (of course invalidating the parity if any of them are written >>to). This allows a parity disk to be added to existing block devices >>without having to setup the superblock on the underlying devices. >> >>Hope that is clear now? > > Thank you, I knew about the superblock, but didn't realize that was what you were talking about. > > Does this address your desire? > > https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats > > Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for. > It doesn't seem to because it appears that the unified container would still need to be the created before putting any data on the device. Ideally, the split raid can be added as an after thought by just adding a parity disk (block device) to an existing set of disks (block devices) >>> >>> I haven't thought through the process, but with mdraid/lvm you would >>identify the physical drives as under dm control. (mdadm for md, >>pvcreate for dm). Then configure the split raid setup. >>> >>> Have you gone through the process of creating a raid5 with mdadm. If >>not at least read a howto about it. >>> >>> https://raid.wiki.kernel.org/index.php/RAID_setup >> >>Actually, I have maintained a RAID5, RAID6 6 disk cluster with mdadm >>for more than a few years and handled multiple failures. I am >>reasonably familiar with md reconstruction too. It is the performance >>oriented but disk intensive nature of mdadm that I would like to vary >>on for a home media server. >> >>> >>> I assume you would have mdadm form your multi-disk split raid volume >>composed of all the physical disks, then use lvm commands to define the >>block range on the the first drive as a lv (logical volume). Same for >>the other data drives. >>> >>> Then use mkfs to put a filesystem on each lv. >> >>Maybe it can also be done via md raid creating a partitionable array >>where each partition corresponds to an underlying block device without >>any striping. >> > > I think I agree. > >>> >>> The filesystem has no knowledge there is a split raid below it. It >>simply reads/writes to the overall, device mapper is layered below it >>and triggers the required i/o calls. >>> >>> Ie. For a read, it is a straight passthrough. For a write, the old >>data and old parity have to be read in, modified, written out. Device >>mapper does this now for raid 4/5/6, so most of the code is in place. >> >>Exactly. Reads are passthrough, writes lead to the parity write being >>triggered. Only remaining concern for me is that the md super block >>will require block device to be initialized using mdadm. That can be >>acceptable I suppose, but an ideal solution would be able to use >>existing block devices (which would be untouched)...put passthrough >>block device on top of them and manage the parity updation on the >>parity block device. The information about which block devices >>comprise the array can be stored in a config file etc and does not >>need a superblock as badly as a raid setup. > > Hopefully the new user space feature does just that. > > Greg Although the user space feature doesn't seem to, Neil has suggested a way to try out using RAID-4 in a manner so as to create a split raid like array. Will post on this mailing list if it succeeds. > > -- > Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
* Split RAID: Proposal for archival RAID using incremental batch checksum 2014-11-27 17:50 ` Anshuman Aggarwal @ 2014-11-27 18:31 ` Greg Freemyer 0 siblings, 0 replies; 44+ messages in thread From: Greg Freemyer @ 2014-11-27 18:31 UTC (permalink / raw) To: kernelnewbies On Thu, Nov 27, 2014 at 12:50 PM, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: > On 25 November 2014 at 10:26, Greg Freemyer <greg.freemyer@gmail.com> wrote: >> >> >> On November 24, 2014 12:28:08 PM EST, Anshuman Aggarwal <anshuman.aggarwal@gmail.com> wrote: >>>On 24 November 2014 at 18:49, Greg Freemyer <greg.freemyer@gmail.com> >>>wrote: <snip> >>>>> Also if I don't store the metadata on >>>>>the block device itself (to allow the block device to be unaware of >>>>>the RAID4 on top...how would the kernel be informed of which devices >>>>>together form the Split RAID. >>>> >>>> I don't understand the question. >>> >>>mdadm typically has a metadata superblock stored on the block device >>>which identifies the block device as part of the RAID and typically >>>prevents it from directly recognized by file system code . I was >>>wondering if Split RAID block devices can be made to be unaware to the >>>RAID scheme on top and be fully mountable and usable without the raid >>>drivers (of course invalidating the parity if any of them are written >>>to). This allows a parity disk to be added to existing block devices >>>without having to setup the superblock on the underlying devices. >>> >>>Hope that is clear now? >> >> Thank you, I knew about the superblock, but didn't realize that was what you were talking about. >> >> Does this address your desire? >> >> https://raid.wiki.kernel.org/index.php/RAID_superblock_formats#mdadm_v3.0_--_Adding_the_Concept_of_User-Space_Managed_External_Metadata_Formats >> >> Fyi: I'm ignorant of any real details and I have not used the above new feature, but it seems to be what you asking for. >> > > It doesn't seem to because it appears that the unified container would > still need to be the created before putting any data on the device. > Ideally, the split raid can be added as an after thought by just > adding a parity disk (block device) to an existing set of disks (block > devices) So what precisely does "creating a container" really do? ie. have you run strace on "mdadm --create --verbose /dev/md/imsm /dev/sd[b-g] --raid-devices 4 --metadata=imsm"? I'm assuming for your use case /etc/ could hold a metadata file thast defined a container and then a second metadata file that defined the splitRAID setup. >>>> >>>> The filesystem has no knowledge there is a split raid below it. It >>>simply reads/writes to the overall, device mapper is layered below it >>>and triggers the required i/o calls. >>>> >>>> Ie. For a read, it is a straight passthrough. For a write, the old >>>data and old parity have to be read in, modified, written out. Device >>>mapper does this now for raid 4/5/6, so most of the code is in place. >>> >>>Exactly. Reads are passthrough, writes lead to the parity write being >>>triggered. Only remaining concern for me is that the md super block >>>will require block device to be initialized using mdadm. That can be >>>acceptable I suppose, but an ideal solution would be able to use >>>existing block devices (which would be untouched)...put passthrough >>>block device on top of them and manage the parity updation on the >>>parity block device. The information about which block devices >>>comprise the array can be stored in a config file etc and does not >>>need a superblock as badly as a raid setup. >> >> Hopefully the new user space feature does just that. >> >> Greg > > Although the user space feature doesn't seem to, Neil has suggested a > way to try out using RAID-4 in a manner so as to create a split raid > like array. Will post on this mailing list if it succeeds. I've used hardware raid setup with raid-1 that did what you want. If needed, you could pull out a drive and connected straight to another computer and everything just worked (except mirroring). Since you're working with Neil you have the expert on the case, but don't forget most drives have unused space between sector 1 and the start of the first partition. ie. Traditionally sectors 1-62 were unused/blank. Newer systems start the first partition at sector 2048, so sectors 1-2047 are blank. I don't recall off-hand which sectors a GPT setup uses, but I assume you can find an area that is rarely used. Greg >> -- >> Sent from my Android phone with K-9 Mail. Please excuse my brevity. ^ permalink raw reply [flat|nested] 44+ messages in thread
end of thread, other threads:[~2015-01-06 11:40 UTC | newest]
Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-29 7:15 Split RAID: Proposal for archival RAID using incremental batch checksum Anshuman Aggarwal
2014-10-29 7:32 ` Roman Mamedov
2014-10-29 8:31 ` Anshuman Aggarwal
2014-10-29 9:05 ` NeilBrown
2014-10-29 9:25 ` Anshuman Aggarwal
2014-10-29 19:27 ` Ethan Wilson
2014-10-30 14:57 ` Anshuman Aggarwal
2014-10-30 17:25 ` Piergiorgio Sartor
2014-10-31 11:05 ` Anshuman Aggarwal
2014-10-31 14:25 ` Matt Garman
2014-11-01 12:55 ` Piergiorgio Sartor
2014-11-06 2:29 ` Anshuman Aggarwal
2014-10-30 15:00 ` Anshuman Aggarwal
2014-11-03 5:52 ` NeilBrown
2014-11-03 18:04 ` Piergiorgio Sartor
2014-11-06 2:24 ` Anshuman Aggarwal
2014-11-24 7:29 ` Anshuman Aggarwal
2014-11-24 22:50 ` NeilBrown
2014-11-26 6:24 ` Anshuman Aggarwal
2014-12-01 16:00 ` Anshuman Aggarwal
2014-12-01 16:34 ` Anshuman Aggarwal
2014-12-01 21:46 ` NeilBrown
2014-12-02 11:56 ` Anshuman Aggarwal
2014-12-16 16:25 ` Anshuman Aggarwal
2014-12-16 21:49 ` NeilBrown
2014-12-17 6:40 ` Anshuman Aggarwal
2015-01-06 11:40 ` Anshuman Aggarwal
[not found] ` <CAJvUf-BktH_E6jb5d94VuMVEBf_Be4i_8u_kBYU52Df1cu0gmg@mail.gmail.com>
2014-11-01 5:36 ` Anshuman Aggarwal
-- strict thread matches above, loose matches on Subject: below --
2014-11-21 10:15 Anshuman Aggarwal
2014-11-21 11:41 ` Greg Freemyer
2014-11-21 18:48 ` Anshuman Aggarwal
2014-11-22 13:17 ` Greg Freemyer
2014-11-22 13:22 ` Anshuman Aggarwal
2014-11-22 14:03 ` Greg Freemyer
2014-11-22 14:43 ` Anshuman Aggarwal
2014-11-22 14:54 ` Greg Freemyer
2014-11-24 5:36 ` SandeepKsinha
2014-11-24 6:48 ` Anshuman Aggarwal
2014-11-24 13:19 ` Greg Freemyer
2014-11-24 17:28 ` Anshuman Aggarwal
2014-11-24 18:10 ` Valdis.Kletnieks at vt.edu
2014-11-25 4:56 ` Greg Freemyer
2014-11-27 17:50 ` Anshuman Aggarwal
2014-11-27 18:31 ` Greg Freemyer
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.