* Any benefity to write intent bitmaps on Raid1 @ 2009-04-09 0:24 Steven Ellis 2009-04-09 1:30 ` Bryan Mesich 2009-04-09 5:59 ` Neil Brown 0 siblings, 2 replies; 11+ messages in thread From: Steven Ellis @ 2009-04-09 0:24 UTC (permalink / raw) To: Linux RAID Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery sync time. Would an internal bitmap help dramatically, and are there any other benefits. Steve -------------------------------------------- Steven Ellis - Technical Director OpenMedia Limited - The Home of myPVR email - steven@openmedia.co.nz website - http://www.openmedia.co.nz ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis @ 2009-04-09 1:30 ` Bryan Mesich 2009-04-09 5:59 ` Neil Brown 1 sibling, 0 replies; 11+ messages in thread From: Bryan Mesich @ 2009-04-09 1:30 UTC (permalink / raw) To: Steven Ellis; +Cc: Linux RAID [-- Attachment #1: Type: text/plain, Size: 1467 bytes --] On Thu, Apr 09, 2009 at 12:24:05PM +1200, Steven Ellis wrote: > Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery > sync time. Would an internal bitmap help dramatically, and are there any > other benefits. > If one of your drives goes pear shaped and needs to be replaced, then no, a write intent bitmap will not help you. When you replace the drive, the incomming drive will need to do a full resync. Many times a read/write error causes the drive to be failed. In this case, the bad block that caused the read/write error should get re-mapped by the HD firmware. If you have a write intent bitmap enabled, the re-add only resyncs the out-of-sync pages (not sure if page is the correct terminology). There is some overhead when using a write intent bitmap as the bitmap needs to be updated as data is written to the device. For most people, the overhead is not noticeable. If you really need performance, the bitmap can be moved to another disk(s) that is not a member of the RAID1 array in question. I've used write intent bitmaps many times in a SAN environment in which FC initiators mirror 2 block devices. Both block devices come from different FC targets. This makes maintiance much easier since all we have to do is break the RAID1 mirror on the initiator (we also have good uptime :). A write intent bitmap speeds the re-syncing process up since we only resync the out-of-sync data. Bryan [-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis 2009-04-09 1:30 ` Bryan Mesich @ 2009-04-09 5:59 ` Neil Brown 2009-04-09 6:26 ` Goswin von Brederlow 2009-04-09 22:51 ` Bill Davidsen 1 sibling, 2 replies; 11+ messages in thread From: Neil Brown @ 2009-04-09 5:59 UTC (permalink / raw) To: Steven Ellis; +Cc: Linux RAID On Thursday April 9, steven@openmedia.co.nz wrote: > Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery > sync time. Would an internal bitmap help dramatically, and are there any > other benefits. Bryan answered some of this but... - if your machine crashes, then resync will be much faster if you have a bitmap. - If one drive becomes disconnected, and then can be reconnected, recovery will be much faster. - if one drive fails and has to be replaced, a bitmap makes no difference(*). - there might be performance hit - it is very dependant on your workload. - You can add or remove a bitmap at any time, so you can try to measure the impact on your particular workload fairly easily. (*) I've been wondering about adding another bitmap which would record which sections of the array have valid data. Initially nothing would be valid and so wouldn't need recovery. Every time we write to a new section we add that section to the 'valid' sections and make sure that section is in-sync. When a device was replaced, we would only need to recover the parts of the array that are known to be invalid. As filesystem start using the new "invalidate" command for block devices, we could clear bits for sections that the filesystem says are not needed any more... But currently it is just a vague idea. NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 5:59 ` Neil Brown @ 2009-04-09 6:26 ` Goswin von Brederlow 2009-04-10 9:04 ` Neil Brown 2009-04-09 22:51 ` Bill Davidsen 1 sibling, 1 reply; 11+ messages in thread From: Goswin von Brederlow @ 2009-04-09 6:26 UTC (permalink / raw) To: Neil Brown; +Cc: Steven Ellis, Linux RAID Neil Brown <neilb@suse.de> writes: > (*) I've been wondering about adding another bitmap which would record > which sections of the array have valid data. Initially nothing would > be valid and so wouldn't need recovery. Every time we write to a new > section we add that section to the 'valid' sections and make sure that > section is in-sync. > When a device was replaced, we would only need to recover the parts of > the array that are known to be invalid. > As filesystem start using the new "invalidate" command for block > devices, we could clear bits for sections that the filesystem says are > not needed any more... > But currently it is just a vague idea. > > NeilBrown If you are up for experimenting I would go for a completly new approach. Instead of working with physical blocks and marking where blocks are used and out of sync how about adding a mapping layer on the device and using virtual blocks. You reduce the reported disk size by maybe 1% to always have some spare blocks and initialy all blocks will be unmapped (unused). Then whenever there is a write you pick out an unused block, write to it and change the in memory mapping of the logical to physical block. Every X seconds, on a barrier or an sync you commit the mapping from memory to disk in such a way that it is synchronized between all disks in the raid. So every commited mapping represents a valid raid set. After the commit of the mapping all blocks changed between the mapping and the last can be marked as free again. Better use the second last so there are always 2 valid mappings to choose from after a crash. This would obviously need a lot more space than a bitmap but space is (relatively) cheap. One benefit imho should be that sync/barrier would not have to stop all activity on the raid to wait for the sync/barrier to finish. It just has to finalize the mapping for the commit and then can start a new in memory mapping while the finalized one writes to disk. Just some thoughts, Goswin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 6:26 ` Goswin von Brederlow @ 2009-04-10 9:04 ` Neil Brown 2009-04-11 2:56 ` Goswin von Brederlow 0 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2009-04-10 9:04 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: Steven Ellis, Linux RAID On Thursday April 9, goswin-v-b@web.de wrote: > Neil Brown <neilb@suse.de> writes: > > > (*) I've been wondering about adding another bitmap which would record > > which sections of the array have valid data. Initially nothing would > > be valid and so wouldn't need recovery. Every time we write to a new > > section we add that section to the 'valid' sections and make sure that > > section is in-sync. > > When a device was replaced, we would only need to recover the parts of > > the array that are known to be invalid. > > As filesystem start using the new "invalidate" command for block > > devices, we could clear bits for sections that the filesystem says are > > not needed any more... > > But currently it is just a vague idea. > > > > NeilBrown > > If you are up for experimenting I would go for a completly new > approach. Instead of working with physical blocks and marking where > blocks are used and out of sync how about adding a mapping layer on > the device and using virtual blocks. You reduce the reported disk size > by maybe 1% to always have some spare blocks and initialy all blocks > will be unmapped (unused). Then whenever there is a write you pick out > an unused block, write to it and change the in memory mapping of the > logical to physical block. Every X seconds, on a barrier or an sync > you commit the mapping from memory to disk in such a way that it is > synchronized between all disks in the raid. So every commited mapping > represents a valid raid set. After the commit of the mapping all > blocks changed between the mapping and the last can be marked as free > again. Better use the second last so there are always 2 valid mappings > to choose from after a crash. > > This would obviously need a lot more space than a bitmap but space is > (relatively) cheap. One benefit imho should be that sync/barrier would > not have to stop all activity on the raid to wait for the sync/barrier > to finish. It just has to finalize the mapping for the commit and then > can start a new in memory mapping while the finalized one writes to > disk. While there is obviously real value in this functionality, I can't help thinking that it belongs in the file system, not the block device. But then I've always seen logical volume management as an interim hack until filesystems were able to span multiple volumes in a sensible way. As time goes on it seems less and less 'interim'. I may well implement a filesystem that has this sort of functionality. I'm very unlikely to implement it in the md layer. But you never know what will happen... Thanks for the thoughts. NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-10 9:04 ` Neil Brown @ 2009-04-11 2:56 ` Goswin von Brederlow 2009-04-11 5:35 ` Neil Brown 0 siblings, 1 reply; 11+ messages in thread From: Goswin von Brederlow @ 2009-04-11 2:56 UTC (permalink / raw) To: Neil Brown; +Cc: Goswin von Brederlow, Steven Ellis, Linux RAID Neil Brown <neilb@suse.de> writes: > On Thursday April 9, goswin-v-b@web.de wrote: >> Neil Brown <neilb@suse.de> writes: >> >> > (*) I've been wondering about adding another bitmap which would record >> > which sections of the array have valid data. Initially nothing would >> > be valid and so wouldn't need recovery. Every time we write to a new >> > section we add that section to the 'valid' sections and make sure that >> > section is in-sync. >> > When a device was replaced, we would only need to recover the parts of >> > the array that are known to be invalid. >> > As filesystem start using the new "invalidate" command for block >> > devices, we could clear bits for sections that the filesystem says are >> > not needed any more... >> > But currently it is just a vague idea. >> > >> > NeilBrown >> >> If you are up for experimenting I would go for a completly new >> approach. Instead of working with physical blocks and marking where >> blocks are used and out of sync how about adding a mapping layer on >> the device and using virtual blocks. You reduce the reported disk size >> by maybe 1% to always have some spare blocks and initialy all blocks >> will be unmapped (unused). Then whenever there is a write you pick out >> an unused block, write to it and change the in memory mapping of the >> logical to physical block. Every X seconds, on a barrier or an sync >> you commit the mapping from memory to disk in such a way that it is >> synchronized between all disks in the raid. So every commited mapping >> represents a valid raid set. After the commit of the mapping all >> blocks changed between the mapping and the last can be marked as free >> again. Better use the second last so there are always 2 valid mappings >> to choose from after a crash. >> >> This would obviously need a lot more space than a bitmap but space is >> (relatively) cheap. One benefit imho should be that sync/barrier would >> not have to stop all activity on the raid to wait for the sync/barrier >> to finish. It just has to finalize the mapping for the commit and then >> can start a new in memory mapping while the finalized one writes to >> disk. > > While there is obviously real value in this functionality, I can't > help thinking that it belongs in the file system, not the block > device. I believe it is the only way to actualy remove the race conditions inherent in software raid and there are some uses that don't work well with a filesystem. E.g. creating a filesystem with only a swapfile on it instead of using a raid device seems a bit stupid. Or for databases that use block devices. > But then I've always seen logical volume management as an interim hack > until filesystems were able to span multiple volumes in a sensible > way. As time goes on it seems less and less 'interim'. > > I may well implement a filesystem that has this sort of > functionality. I'm very unlikely to implement it in the md layer. > But you never know what will happen... Zfs already does this. btrfs does it but only with raid1. But I find that zfs doesn't really integrate the two, it just has the raid and filesystem layer in a single binary but still as 2 seperate layers. Makes changing the layout inflexible, e.g. you can't grow from 4 to 5 disks per stripe. > Thanks for the thoughts. > > NeilBrown MfG Goswin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-11 2:56 ` Goswin von Brederlow @ 2009-04-11 5:35 ` Neil Brown 2009-04-11 8:46 ` Goswin von Brederlow 0 siblings, 1 reply; 11+ messages in thread From: Neil Brown @ 2009-04-11 5:35 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: Steven Ellis, Linux RAID On Saturday April 11, goswin-v-b@web.de wrote: > Neil Brown <neilb@suse.de> writes: > > > On Thursday April 9, goswin-v-b@web.de wrote: > >> Neil Brown <neilb@suse.de> writes: > >> > >> > (*) I've been wondering about adding another bitmap which would record > >> > which sections of the array have valid data. Initially nothing would > >> > be valid and so wouldn't need recovery. Every time we write to a new > >> > section we add that section to the 'valid' sections and make sure that > >> > section is in-sync. > >> > When a device was replaced, we would only need to recover the parts of > >> > the array that are known to be invalid. > >> > As filesystem start using the new "invalidate" command for block > >> > devices, we could clear bits for sections that the filesystem says are > >> > not needed any more... > >> > But currently it is just a vague idea. > >> > > >> > NeilBrown > >> > >> If you are up for experimenting I would go for a completly new > >> approach. Instead of working with physical blocks and marking where > >> blocks are used and out of sync how about adding a mapping layer on > >> the device and using virtual blocks. You reduce the reported disk size > >> by maybe 1% to always have some spare blocks and initialy all blocks > >> will be unmapped (unused). Then whenever there is a write you pick out > >> an unused block, write to it and change the in memory mapping of the > >> logical to physical block. Every X seconds, on a barrier or an sync > >> you commit the mapping from memory to disk in such a way that it is > >> synchronized between all disks in the raid. So every commited mapping > >> represents a valid raid set. After the commit of the mapping all > >> blocks changed between the mapping and the last can be marked as free > >> again. Better use the second last so there are always 2 valid mappings > >> to choose from after a crash. > >> > >> This would obviously need a lot more space than a bitmap but space is > >> (relatively) cheap. One benefit imho should be that sync/barrier would > >> not have to stop all activity on the raid to wait for the sync/barrier > >> to finish. It just has to finalize the mapping for the commit and then > >> can start a new in memory mapping while the finalized one writes to > >> disk. > > > > While there is obviously real value in this functionality, I can't > > help thinking that it belongs in the file system, not the block > > device. > > I believe it is the only way to actualy remove the race conditions > inherent in software raid and there are some uses that don't work well > with a filesystem. E.g. creating a filesystem with only a swapfile on > it instead of using a raid device seems a bit stupid. Or for databases > that use block devices. I agree that it would remove some races, make resync unnecessary, and thus remove the small risk of data loss when a system with a degraded raid5 crashes. I doubt it is the only way, and may not even be a good way, though I'm not certain. Your mapping of logical to physical blocks - it would technically need to map each sector independently, but let's be generous (and fairly realistic) and map each 4K block independently. Then with a 1TB device, you have 2**28 entries in the table, each 4 bytes, so 2**30 bytes, or 1 gigabyte. You suggest this table is kept in memory. While memory is cheap, I don't think it is that cheap yet. So you would need to make compromises, either not keeping it all in memory, or having larger block sizes (and so needing to pre-read for updates), or having a more complicated data structure. Or, more likely, all of the above. You could make it work, but there would be a performance hit. Now look at your cases where a filesystem doesn't work well: 1/ Swap. That is a non-issue. After a crash, the contents of swap are irrelevant. Without a crash, the races you refer to are irrelevant. 2/ Database that use block devices directly. Why do they use the block device directly rather than using O_DIRECT to a pre-allocated file? Because they believe that the filesystem introduces a performance penalty. What reason is there to believe that the performance penalty of your remapped-raid would necessarily be less than that of a filesystem? I cannot see one. BTW an alternate approach to closing those races (assuming that I am understanding you correctly) is to journal all updates to a separate device. Possible an SSD or battery-backed RAM. That could have the added benefit of reducing latency, though it may impact throughput. I'm not sure if that is an approach with a real future either. But it is a valid alternate. > > > But then I've always seen logical volume management as an interim hack > > until filesystems were able to span multiple volumes in a sensible > > way. As time goes on it seems less and less 'interim'. > > > > I may well implement a filesystem that has this sort of > > functionality. I'm very unlikely to implement it in the md layer. > > But you never know what will happen... > > Zfs already does this. btrfs does it but only with raid1. But I find > that zfs doesn't really integrate the two, it just has the raid and > filesystem layer in a single binary but still as 2 seperate layers. > Makes changing the layout inflexible, e.g. you can't grow from 4 to 5 > disks per stripe. I thought ZFS was more integrated than that, but I haven't looked deeply. My vague notion what that when ZFS wanted to write "some data" it would break it into sets of N blocks. calculate a parity block for each N, then write those N+1 blocks to N+1 different devices, where-ever there happened to be unused space. Then the addresses of those N+1 block would be stored in the file metadata which would be written a similar way, possibly with a different(smaller) N. This idea (which might be completely wrong) implies very tight integration between the layers. With this setup you could conceivably change the default N at any time. Old data wouldn't be relocated, but new writes would be written with the new N. If you have a background defragmentation process, it could, over a period of time, arrange for the whole filesystem to be re-laid out with the new N. Clearly data would still be recoverable after a single drive failure. The problem I see with this approach is the cost of recovering to a hot-spare after device failure. Finding which blocks need to be written where would require scanning all the metadata on the entire filesystem. And much of this would not be contiguous. So much seeking would be involved. I wouldn't be surprised if recovering a device in a nearly-full filesystem took an order of magnitude longer with that approach than with md style raid. Given that observation: maybe I am wrong about RAID-Z. However it is the only model I can come up with that matches the various snippets I have heard about it. (hmm... maybe a secondary indexing scheme could help... might get it down to taking only twice as long, with could be acceptable .... maybe I will try implementing that after all and see how it works... in my spare time) NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-11 5:35 ` Neil Brown @ 2009-04-11 8:46 ` Goswin von Brederlow 2009-04-11 13:08 ` Bill Davidsen 0 siblings, 1 reply; 11+ messages in thread From: Goswin von Brederlow @ 2009-04-11 8:46 UTC (permalink / raw) To: Neil Brown; +Cc: Goswin von Brederlow, Steven Ellis, Linux RAID Neil Brown <neilb@suse.de> writes: > On Saturday April 11, goswin-v-b@web.de wrote: >> Neil Brown <neilb@suse.de> writes: >> >> > On Thursday April 9, goswin-v-b@web.de wrote: >> >> Neil Brown <neilb@suse.de> writes: >> >> >> >> > (*) I've been wondering about adding another bitmap which would record >> >> > which sections of the array have valid data. Initially nothing would >> >> > be valid and so wouldn't need recovery. Every time we write to a new >> >> > section we add that section to the 'valid' sections and make sure that >> >> > section is in-sync. >> >> > When a device was replaced, we would only need to recover the parts of >> >> > the array that are known to be invalid. >> >> > As filesystem start using the new "invalidate" command for block >> >> > devices, we could clear bits for sections that the filesystem says are >> >> > not needed any more... >> >> > But currently it is just a vague idea. >> >> > >> >> > NeilBrown >> >> >> >> If you are up for experimenting I would go for a completly new >> >> approach. Instead of working with physical blocks and marking where >> >> blocks are used and out of sync how about adding a mapping layer on >> >> the device and using virtual blocks. You reduce the reported disk size >> >> by maybe 1% to always have some spare blocks and initialy all blocks >> >> will be unmapped (unused). Then whenever there is a write you pick out >> >> an unused block, write to it and change the in memory mapping of the >> >> logical to physical block. Every X seconds, on a barrier or an sync >> >> you commit the mapping from memory to disk in such a way that it is >> >> synchronized between all disks in the raid. So every commited mapping >> >> represents a valid raid set. After the commit of the mapping all >> >> blocks changed between the mapping and the last can be marked as free >> >> again. Better use the second last so there are always 2 valid mappings >> >> to choose from after a crash. >> >> >> >> This would obviously need a lot more space than a bitmap but space is >> >> (relatively) cheap. One benefit imho should be that sync/barrier would >> >> not have to stop all activity on the raid to wait for the sync/barrier >> >> to finish. It just has to finalize the mapping for the commit and then >> >> can start a new in memory mapping while the finalized one writes to >> >> disk. >> > >> > While there is obviously real value in this functionality, I can't >> > help thinking that it belongs in the file system, not the block >> > device. >> >> I believe it is the only way to actualy remove the race conditions >> inherent in software raid and there are some uses that don't work well >> with a filesystem. E.g. creating a filesystem with only a swapfile on >> it instead of using a raid device seems a bit stupid. Or for databases >> that use block devices. > > I agree that it would remove some races, make resync unnecessary, and > thus remove the small risk of data loss when a system with a degraded > raid5 crashes. I doubt it is the only way, and may not even be a good > way, though I'm not certain. Ok, not the only way. You could have a journal where you first write what block and data is to be updated, sync, and then write the data to the actual block. After a crash the journal could just be replayed. > Your mapping of logical to physical blocks - it would technically need > to map each sector independently, but let's be generous (and fairly > realistic) and map each 4K block independently. > Then with a 1TB device, you have 2**28 entries in the table, each 4 > bytes, so 2**30 bytes, or 1 gigabyte. > You suggest this table is kept in memory. While memory is cheap, I > don't think it is that cheap yet. > So you would need to make compromises, either not keeping it all in > memory, or having larger block sizes (and so needing to pre-read for > updates), or having a more complicated data structure. Or, more > likely, all of the above. Plus as a plain array you would have to have multiple copies of 1GB. A BTree where only used parts are in memory or something similar would really ne neccessary. Mapping extends instead of individual blocks would also be usefull as well as a defrager that remaps blocks into larger continious segments. But now it really got complex. > You could make it work, but there would be a performance hit. > > Now look at your cases where a filesystem doesn't work well: > 1/ Swap. That is a non-issue. After a crash, the contents of swap > are irrelevant. Without a crash, the races you refer to are > irrelevant. What about suspend to swap? > 2/ Database that use block devices directly. Why do they use the > block device directly rather than using O_DIRECT to a > pre-allocated file? Because they believe that the filesystem > introduces a performance penalty. What reason is there to believe > that the performance penalty of your remapped-raid would > necessarily be less than that of a filesystem? I cannot see one. Youareassuming we could change the DB to use files instead. :) > BTW an alternate approach to closing those races (assuming that I am > understanding you correctly) is to journal all updates to a separate > device. Possible an SSD or battery-backed RAM. That could have the > added benefit of reducing latency, though it may impact throughput. > I'm not sure if that is an approach with a real future either. But it > is a valid alternate. That is what hardware raids do. >> > But then I've always seen logical volume management as an interim hack >> > until filesystems were able to span multiple volumes in a sensible >> > way. As time goes on it seems less and less 'interim'. >> > >> > I may well implement a filesystem that has this sort of >> > functionality. I'm very unlikely to implement it in the md layer. >> > But you never know what will happen... >> >> Zfs already does this. btrfs does it but only with raid1. But I find >> that zfs doesn't really integrate the two, it just has the raid and >> filesystem layer in a single binary but still as 2 seperate layers. >> Makes changing the layout inflexible, e.g. you can't grow from 4 to 5 >> disks per stripe. > > I thought ZFS was more integrated than that, but I haven't looked > deeply. > My vague notion what that when ZFS wanted to write "some data" it > would break it into sets of N blocks. calculate a parity block for > each N, then write those N+1 blocks to N+1 different devices, > where-ever there happened to be unused space. Then the addresses of > those N+1 block would be stored in the file metadata which would be > written a similar way, possibly with a different(smaller) N. > > This idea (which might be completely wrong) implies very tight > integration between the layers. But first you define a storage pool form segments X devices with a certain raid level. The higher level then uses virtual addresses into that pool. If you want to grow your zfs you have to add new disks and create a new pool from them. All the docs I've seen didn't mention any support for changing an existing pool. > With this setup you could conceivably change the default N at any > time. Old data wouldn't be relocated, but new writes would be written > with the new N. If you have a background defragmentation process, it > could, over a period of time, arrange for the whole filesystem to be > re-laid out with the new N. As I understand it the pool creates a virtual->physical mapping and the higher layers use the virtual address. By increasing the number of disks in a pool all physical addresses would change, just like when growing a raid, and the hiher layers would have to readjust their addresses. At least that is my understanding. > Clearly data would still be recoverable after a single drive failure. > > The problem I see with this approach is the cost of recovering to a > hot-spare after device failure. Finding which blocks need to be > written where would require scanning all the metadata on the entire > filesystem. And much of this would not be contiguous. So much > seeking would be involved. I wouldn't be surprised if recovering a > device in a nearly-full filesystem took an order of magnitude longer > with that approach than with md style raid. One huge improvement comes from splitting data and metadata into seperate segments thereby keeping the metadata close together. If one also takes care to write the parent of a metablock before its child and defrags them frequently they should be kept pretty linear. And how much metadata is there in the filesystem? My 4.6TB movie archive has 30000 inodes used so that would be a few MB of metadata. Hardly relevant. For a news spool it would look different. > Given that observation: maybe I am wrong about RAID-Z. However it is > the only model I can come up with that matches the various snippets I > have heard about it. > > (hmm... maybe a secondary indexing scheme could help... might get it > down to taking only twice as long, with could be acceptable .... > maybe I will try implementing that after all and see how it > works... in my spare time) > > NeilBrown The snippets I've read about zfs let me to believe that the raid level is restricted to the pools. So in effect you just have lots of internal md devices. Resync speed in zfs should be exactly like normal raid. MfG Goswin ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-11 8:46 ` Goswin von Brederlow @ 2009-04-11 13:08 ` Bill Davidsen 0 siblings, 0 replies; 11+ messages in thread From: Bill Davidsen @ 2009-04-11 13:08 UTC (permalink / raw) To: Goswin von Brederlow; +Cc: Neil Brown, Steven Ellis, Linux RAID Goswin von Brederlow wrote: > Neil Brown <neilb@suse.de> writes: > > >> You could make it work, but there would be a performance hit. >> >> Now look at your cases where a filesystem doesn't work well: >> 1/ Swap. That is a non-issue. After a crash, the contents of swap >> are irrelevant. Without a crash, the races you refer to are >> irrelevant. >> > > What about suspend to swap? > suspend is a "without a crash" case, I wouldn't want to restore from swap if the system failed to complete a clean shutdown. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc "You are disgraced professional losers. And by the way, give us our money back." - Representative Earl Pomeroy, Democrat of North Dakota on the A.I.G. executives who were paid bonuses after a federal bailout. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 5:59 ` Neil Brown 2009-04-09 6:26 ` Goswin von Brederlow @ 2009-04-09 22:51 ` Bill Davidsen 2009-04-10 9:10 ` Neil Brown 1 sibling, 1 reply; 11+ messages in thread From: Bill Davidsen @ 2009-04-09 22:51 UTC (permalink / raw) To: Neil Brown; +Cc: Steven Ellis, Linux RAID Neil Brown wrote: > On Thursday April 9, steven@openmedia.co.nz wrote: > >> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery >> sync time. Would an internal bitmap help dramatically, and are there any >> other benefits. >> > > Bryan answered some of this but... > > - if your machine crashes, then resync will be much faster if you > have a bitmap. > - If one drive becomes disconnected, and then can be reconnected, > recovery will be much faster. > - if one drive fails and has to be replaced, a bitmap makes no > difference(*). > - there might be performance hit - it is very dependant on your > workload. > - You can add or remove a bitmap at any time, so you can try to > measure the impact on your particular workload fairly easily. > > > (*) I've been wondering about adding another bitmap which would record > which sections of the array have valid data. Initially nothing would > be valid and so wouldn't need recovery. Every time we write to a new > section we add that section to the 'valid' sections and make sure that > section is in-sync. > When a device was replaced, we would only need to recover the parts of > the array that are known to be invalid. > As filesystem start using the new "invalidate" command for block > devices, we could clear bits for sections that the filesystem says are > not needed any more... > But currently it is just a vague idea. > It's obvious that this idea would provide a speedup, and might be useful in terms of doing some physical dump software which would just save the "used" portions of the array. Only you have an idea of how much effort this would take, although my thought is "very little" for the stable case and "bunches" for the case of an array size change. I have been trying making a COW copy of an entire drive with qemu-img, then booting it under KCM, and besides giving an interesting slant to the term "dual boot," I can back up the changes files (a sparse file) quickly and into small space with a backup which knows about sparse files. There is lots of room to imagine uses for this if we had it. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc "You are disgraced professional losers. And by the way, give us our money back." - Representative Earl Pomeroy, Democrat of North Dakota on the A.I.G. executives who were paid bonuses after a federal bailout. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Any benefity to write intent bitmaps on Raid1 2009-04-09 22:51 ` Bill Davidsen @ 2009-04-10 9:10 ` Neil Brown 0 siblings, 0 replies; 11+ messages in thread From: Neil Brown @ 2009-04-10 9:10 UTC (permalink / raw) To: Bill Davidsen; +Cc: Steven Ellis, Linux RAID On Thursday April 9, davidsen@tmr.com wrote: > Neil Brown wrote: > > On Thursday April 9, steven@openmedia.co.nz wrote: > > > >> Given I have a pair of 1TB drives Raid1 I'd prefer to reduce any recovery > >> sync time. Would an internal bitmap help dramatically, and are there any > >> other benefits. > >> > > > > Bryan answered some of this but... > > > > - if your machine crashes, then resync will be much faster if you > > have a bitmap. > > - If one drive becomes disconnected, and then can be reconnected, > > recovery will be much faster. > > - if one drive fails and has to be replaced, a bitmap makes no > > difference(*). > > - there might be performance hit - it is very dependant on your > > workload. > > - You can add or remove a bitmap at any time, so you can try to > > measure the impact on your particular workload fairly easily. > > > > > > (*) I've been wondering about adding another bitmap which would record > > which sections of the array have valid data. Initially nothing would > > be valid and so wouldn't need recovery. Every time we write to a new > > section we add that section to the 'valid' sections and make sure that > > section is in-sync. > > When a device was replaced, we would only need to recover the parts of > > the array that are known to be invalid. > > As filesystem start using the new "invalidate" command for block > > devices, we could clear bits for sections that the filesystem says are > > not needed any more... > > But currently it is just a vague idea. > > > > It's obvious that this idea would provide a speedup, and might be useful > in terms of doing some physical dump software which would just save the > "used" portions of the array. Only you have an idea of how much effort > this would take, although my thought is "very little" for the stable > case and "bunches" for the case of an array size change. The only difficulty I can see with the "size change" case is needing to find space for a bigger bitmap. If the space exists, you just copy the bitmap (if needed) and you are done. If the space doesn't exist, you change the chunk size (space covered per bit) and use the same space. The rest, I agree, should be fairly easy. One possibly awkwardness is that every time you write to a new segment which requires setting a new bit, you would need to kick-off a resync for that segment. That could have an adverse and unpredictable effect on throughput. Ofcourse you don't *need* that resync to complete until a reboot so you could do it with very low priority, so it might be OK. > > I have been trying making a COW copy of an entire drive with qemu-img, > then booting it under KCM, and besides giving an interesting slant to > the term "dual boot," I can back up the changes files (a sparse file) > quickly and into small space with a backup which knows about sparse > files. There is lots of room to imagine uses for this if we had it. Interesting ideas... NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2009-04-11 13:08 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-04-09 0:24 Any benefity to write intent bitmaps on Raid1 Steven Ellis 2009-04-09 1:30 ` Bryan Mesich 2009-04-09 5:59 ` Neil Brown 2009-04-09 6:26 ` Goswin von Brederlow 2009-04-10 9:04 ` Neil Brown 2009-04-11 2:56 ` Goswin von Brederlow 2009-04-11 5:35 ` Neil Brown 2009-04-11 8:46 ` Goswin von Brederlow 2009-04-11 13:08 ` Bill Davidsen 2009-04-09 22:51 ` Bill Davidsen 2009-04-10 9:10 ` Neil Brown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).