* Distributed spares @ 2008-10-13 21:50 Bill Davidsen 2008-10-13 22:11 ` Justin Piszcz 2008-10-14 10:04 ` Neil Brown 0 siblings, 2 replies; 16+ messages in thread From: Bill Davidsen @ 2008-10-13 21:50 UTC (permalink / raw) To: Neil Brown; +Cc: Linux RAID Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) distributed over multiple drives. This has come up again, so I thought I'd just mention why, and what advantages it offers. By spreading the spare over multiple drives the head motion of normal access is spread over one (or several) more drives. This reduces seeks, improves performance, etc. The benefit reduces as the number of drives in the array gets larger, obviously with four drives using only three for normal operation is slower than four, etc. And by using all the drives all the time, the chance of a spare being undetected after going bad is reduced. This becomes important as array drive counts shrink. Lower cost for drives ($100/TB!), and attempts to drop power use by using fewer drives, result in an overall drop in drive count, important in serious applications. All that said, I would really like to bring this up one more time, even if the answer is "no interest." -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 21:50 Distributed spares Bill Davidsen @ 2008-10-13 22:11 ` Justin Piszcz 2008-10-13 22:30 ` Billy Crook 2008-10-14 23:20 ` Bill Davidsen 2008-10-14 10:04 ` Neil Brown 1 sibling, 2 replies; 16+ messages in thread From: Justin Piszcz @ 2008-10-13 22:11 UTC (permalink / raw) To: Bill Davidsen; +Cc: Neil Brown, Linux RAID On Mon, 13 Oct 2008, Bill Davidsen wrote: > Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) distributed > over multiple drives. This has come up again, so I thought I'd just mention > why, and what advantages it offers. > > By spreading the spare over multiple drives the head motion of normal access > is spread over one (or several) more drives. This reduces seeks, improves > performance, etc. The benefit reduces as the number of drives in the array > gets larger, obviously with four drives using only three for normal operation > is slower than four, etc. And by using all the drives all the time, the > chance of a spare being undetected after going bad is reduced. > > This becomes important as array drive counts shrink. Lower cost for drives > ($100/TB!), and attempts to drop power use by using fewer drives, result in > an overall drop in drive count, important in serious applications. > > All that said, I would really like to bring this up one more time, even if > the answer is "no interest." > > -- > Bill Davidsen <davidsen@tmr.com> > "Woe unto the statesman who makes war without a reason that will still > be valid when the war is over..." Otto von Bismark > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Bill, Not a bad idea; however, can the same not be acheived (somewhat) by performing daily/smart, weekly/long tests on the drive to validate its health? I find this to work fairly well on a large scale. Justin. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 22:11 ` Justin Piszcz @ 2008-10-13 22:30 ` Billy Crook 2008-10-13 23:29 ` Keld Jørn Simonsen 2008-10-14 12:02 ` non-degraded component replacement was " David Greaves 2008-10-14 23:20 ` Bill Davidsen 1 sibling, 2 replies; 16+ messages in thread From: Billy Crook @ 2008-10-13 22:30 UTC (permalink / raw) To: Justin Piszcz; +Cc: Bill Davidsen, Neil Brown, Linux RAID Just my two cents.... Those daily smart tests or regularly running badblocks are fine, but they're not 'real' load. A test can't prove everything is right, it can at best only prove it didn't find anything wrong. Distributed spare would exert 'real' load on the spare because the spare disks ARE the live disks. On a side note, it would be handy to have a daemon that could run in the background on large raid1's, or raid6', and once a month, pull each disk out of the array sequentially, completely overwrite it, check it with badblocks several times, do the smart tests, etc..., then rejoin it, reinstall grub, wait an hour and move on. The point being, of course, to kill weak drives off early and in a controlled manor. It would be even nicer if there were a way to hot-transfer one raid component to another without setting anything faulty. I suppose you could make all the components of the real array be single disk raid1 arrays for that purpose. Then you could have one extra disk set aside for this sort of scrubbing, and never even be down one of your parities. I guess I should add that onto my todo list.... -Billy On Mon, Oct 13, 2008 at 17:11, Justin Piszcz <jpiszcz@lucidpixels.com> wrote: > > > On Mon, 13 Oct 2008, Bill Davidsen wrote: > >> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) >> distributed over multiple drives. This has come up again, so I thought I'd >> just mention why, and what advantages it offers. >> >> By spreading the spare over multiple drives the head motion of normal >> access is spread over one (or several) more drives. This reduces seeks, >> improves performance, etc. The benefit reduces as the number of drives in >> the array gets larger, obviously with four drives using only three for >> normal operation is slower than four, etc. And by using all the drives all >> the time, the chance of a spare being undetected after going bad is reduced. >> >> This becomes important as array drive counts shrink. Lower cost for drives >> ($100/TB!), and attempts to drop power use by using fewer drives, result in >> an overall drop in drive count, important in serious applications. >> >> All that said, I would really like to bring this up one more time, even if >> the answer is "no interest." >> >> -- >> Bill Davidsen <davidsen@tmr.com> >> "Woe unto the statesman who makes war without a reason that will still >> be valid when the war is over..." Otto von Bismark >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > Bill, > > Not a bad idea; however, can the same not be acheived (somewhat) by > performing daily/smart, weekly/long tests on the drive to validate its > health? I find this to work fairly well on a large scale. > > Justin. > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 22:30 ` Billy Crook @ 2008-10-13 23:29 ` Keld Jørn Simonsen 2008-10-14 10:12 ` Martin K. Petersen 2008-10-14 12:02 ` non-degraded component replacement was " David Greaves 1 sibling, 1 reply; 16+ messages in thread From: Keld Jørn Simonsen @ 2008-10-13 23:29 UTC (permalink / raw) To: Billy Crook; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID On Mon, Oct 13, 2008 at 05:30:49PM -0500, Billy Crook wrote: > Just my two cents.... Those daily smart tests or regularly running > badblocks are fine, but they're not 'real' load. A test can't prove > everything is right, it can at best only prove it didn't find anything > wrong. Distributed spare would exert 'real' load on the spare because > the spare disks ARE the live disks. > > > On a side note, it would be handy to have a daemon that could run in > the background on large raid1's, or raid6', and once a month, pull > each disk out of the array sequentially, completely overwrite it, > check it with badblocks several times, do the smart tests, etc..., > then rejoin it, reinstall grub, wait an hour and move on. The point > being, of course, to kill weak drives off early and in a controlled > manor. It would be even nicer if there were a way to hot-transfer one > raid component to another without setting anything faulty. I suppose > you could make all the components of the real array be single disk > raid1 arrays for that purpose. Then you could have one extra disk set > aside for this sort of scrubbing, and never even be down one of your > parities. I guess I should add that onto my todo list.... I have also been thinking a little on this. My idea is that if bit errors develop on disks, then there is first maybe one bit error, and the crc check on the disk sectors then finds and corrects these. If you rewrite such bit errors, then that bit error will be corrected, and you prevent the one-bit error from developing to a two-bit error that is not correctable by the CRC. Is there some merit to this idea? Furthermore, if bad luck has striken, then in the case of mirrored RAIDs you could - when crc fails, then see that this is the block in error and recreate it from the redundant info, Would be good for raid1, raid10, raid5, raid6. If the block then could not be written without errors, then it could be added to a bad blocks list and remapped. I think there is nothing novel in a scheme like this, but I would like to know if it is implemented somewhere. Articles say that bit errors on disks are becoming more and more frequent, so schemes like this may help the scary scenarion somewhat. best regards keld ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 23:29 ` Keld Jørn Simonsen @ 2008-10-14 10:12 ` Martin K. Petersen 2008-10-14 13:06 ` Keld Jørn Simonsen 2008-10-14 13:20 ` David Lethe 0 siblings, 2 replies; 16+ messages in thread From: Martin K. Petersen @ 2008-10-14 10:12 UTC (permalink / raw) To: Keld Jørn Simonsen Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID >>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes: Keld> I have also been thinking a little on this. My idea is that if Keld> bit errors develop on disks, then there is first maybe one bit Keld> error, and the crc check on the disk sectors then finds and Keld> corrects these. Keld> If you rewrite such bit errors, then that bit error will be Keld> corrected, and you prevent the one-bit error from developing to Keld> a two-bit error that is not correctable by the CRC. I think you are assuming that disks are much simpler than they actually are. A modern disk drive protects a 512-byte sector with a pretty strong ECC that's capable of correcting errors up to ~50 bytes. Yes, that's bytes. Also, many drive firmwares will internally keep track of problematic media areas and rewrite or reallocate affected blocks. That includes stuff like rewriting sectors that are susceptible to bleed due to being adjacent to write hot spots. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-14 10:12 ` Martin K. Petersen @ 2008-10-14 13:06 ` Keld Jørn Simonsen 2008-10-14 13:20 ` David Lethe 1 sibling, 0 replies; 16+ messages in thread From: Keld Jørn Simonsen @ 2008-10-14 13:06 UTC (permalink / raw) To: Martin K. Petersen Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID On Tue, Oct 14, 2008 at 06:12:29AM -0400, Martin K. Petersen wrote: > >>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes: > > Keld> I have also been thinking a little on this. My idea is that if > Keld> bit errors develop on disks, then there is first maybe one bit > Keld> error, and the crc check on the disk sectors then finds and > Keld> corrects these. > > Keld> If you rewrite such bit errors, then that bit error will be > Keld> corrected, and you prevent the one-bit error from developing to > Keld> a two-bit error that is not correctable by the CRC. > > I think you are assuming that disks are much simpler than they > actually are. > > A modern disk drive protects a 512-byte sector with a pretty strong > ECC that's capable of correcting errors up to ~50 bytes. Yes, that's > bytes. > > Also, many drive firmwares will internally keep track of problematic > media areas and rewrite or reallocate affected blocks. That includes > stuff like rewriting sectors that are susceptible to bleed due to > being adjacent to write hot spots. Good to know. Could yo tell me if this is actually true for normal state-of-the art SATA disks, or only true for more expensive disks? Do you have a good reference for it. best regards keld -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Distributed spares 2008-10-14 10:12 ` Martin K. Petersen 2008-10-14 13:06 ` Keld Jørn Simonsen @ 2008-10-14 13:20 ` David Lethe 1 sibling, 0 replies; 16+ messages in thread From: David Lethe @ 2008-10-14 13:20 UTC (permalink / raw) To: Martin K. Petersen, Keld Jørn Simonsen Cc: Billy Crook, Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Martin K. Petersen > Sent: Tuesday, October 14, 2008 5:12 AM > To: Keld Jørn Simonsen > Cc: Billy Crook; Justin Piszcz; Bill Davidsen; Neil Brown; Linux RAID > Subject: Re: Distributed spares > > >>>>> "Keld" == Keld Jørn Simonsen <keld@dkuug.dk> writes: > > Keld> I have also been thinking a little on this. My idea is that if > Keld> bit errors develop on disks, then there is first maybe one bit > Keld> error, and the crc check on the disk sectors then finds and > Keld> corrects these. > > Keld> If you rewrite such bit errors, then that bit error will be > Keld> corrected, and you prevent the one-bit error from developing to > Keld> a two-bit error that is not correctable by the CRC. > > I think you are assuming that disks are much simpler than they > actually are. > > A modern disk drive protects a 512-byte sector with a pretty strong > ECC that's capable of correcting errors up to ~50 bytes. Yes, that's > bytes. > > Also, many drive firmwares will internally keep track of problematic > media areas and rewrite or reallocate affected blocks. That includes > stuff like rewriting sectors that are susceptible to bleed due to > being adjacent to write hot spots. > > -- > Martin K. Petersen Oracle Linux Engineering > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html Martin is absolutely correct. Enterprise class drives have come a long way. They will scan and fix blocks (but certainly not 100% of them) in background. The $99 disk drives you get at the local computer retailer now even have limited BGMS / repair capability. If you run the built-in diags on disk drives, you can be presented with a list of known bad blocks, or when you boot a disk drive, sometimes you can get a bad block display in POST. How about a baby step? When you run offline or online tests, or even when you run media scans, you get a list of known defects. How about a program that rewrites a RAID1/3/5/6 stripe, and you just pass it the physical device name and known block number? As for checking out a disk .. The prior poster's idea about putting the RAID in degraded mode for purposes of checking out a disk is, Frankly, nuts. NEVER degrade anything. Just use the hotspare and do a hot clone of the disk in question to the hotspare, then make that disk the new hot spare and repeat.. Equate this to a "Rotating the Tires" mode. David @ santools com http://www.santools.com/smart/unix/manual -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 16+ messages in thread
* non-degraded component replacement was Re: Distributed spares 2008-10-13 22:30 ` Billy Crook 2008-10-13 23:29 ` Keld Jørn Simonsen @ 2008-10-14 12:02 ` David Greaves 2008-10-14 13:18 ` Billy Crook 1 sibling, 1 reply; 16+ messages in thread From: David Greaves @ 2008-10-14 12:02 UTC (permalink / raw) To: Billy Crook; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID, dean Billy Crook wrote: > It would be even nicer if there were a way to hot-transfer one > raid component to another without setting anything faulty. I suppose > you could make all the components of the real array be single disk > raid1 arrays for that purpose. Then you could have one extra disk set > aside for this sort of scrubbing, and never even be down one of your > parities. I guess I should add that onto my todo list.... IMHO This one should be high on the todo list. Especially if it's a pre-requisite for other improvements to resilience. Right now, if a drive fails or shows signs of going bad then you get into a very risky situation. I'm sure most here know that the risk is because removing the failing drive and installing a good one to re-sync puts you in a very vulnerable position; if another drive fails (even one bad block) then you lose data. The solution involves raid1 - but it needs a twist of raid5/6. http://arctic.org/~dean/proactive-raid5-disk-replacement.txt I think this is what was discussed: Assume md0 has drives A B C D D is failing E is new * add E as spare * set E to mirror 'failing' drive D (with bitmap?) * subsequent writes go to both D+E * recover 99+% of data from D to E by simple mirroring * any md0 or D->E read failures on D are recovered from reading ABC not E unless E is in sync. D is not failed out. (and it's these tricks that stops users from doing all this manually) * any md0 sector read failure on ABC can still (hopefully) be read from D even if not yet mirrored to E (also not possible * once E is mirrored, D is removed and the job is done Personally I think this feature is more important than the reshaping requests; of course that's just one opinion :) David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: non-degraded component replacement was Re: Distributed spares 2008-10-14 12:02 ` non-degraded component replacement was " David Greaves @ 2008-10-14 13:18 ` Billy Crook 0 siblings, 0 replies; 16+ messages in thread From: Billy Crook @ 2008-10-14 13:18 UTC (permalink / raw) To: David Greaves; +Cc: Justin Piszcz, Bill Davidsen, Neil Brown, Linux RAID, dean On Tue, Oct 14, 2008 at 07:02, David Greaves <david@dgreaves.com> wrote: > Billy Crook wrote: >> It would be even nicer if there were a way to hot-transfer one >> raid component to another without setting anything faulty. I suppose >> you could make all the components of the real array be single disk >> raid1 arrays for that purpose. Then you could have one extra disk set >> aside for this sort of scrubbing, and never even be down one of your >> parities. I guess I should add that onto my todo list.... > > IMHO This one should be high on the todo list. Especially if it's a > pre-requisite for other improvements to resilience. Here's the process as I thought it out. I'm sure it can be improved upon: Component C will be the current drive that one wishes to take out of service. Component N will be the new drive that one wishes to put in service in place of component C. Redirect incoming writes from component C to component C AND N. Check to make sure component N is same size or larger than C. Create counter curBlock to store position of the drive copy, (initialize counter at 0). While curBlock < component C's block count: Copy block curBlock from component C to curBlock on component N. If copy fails, then try to construct that block from the other disks using parity and apply that to component N. Increment curBlock. Once the copy is complete, optionally verify by comparing both components. Set curBlock to 0 again. While curBlock < component C's block count: Compare curBlock on component C to curBlock on component N. If compare fails, then terminate with error and stop mirroring writes to N. Increment curBlock. Redirect reads from component N only. Stop writing to component C, and only write to N. Present some notification that this process is done. At all points during the process, redundancy should be as good as or better than before the process. The process can be aborted at any time without disruption to the array. This could be represented IMHO with some different status designator character in /proc/mdstat like M (for migrating), and the name for this capability, I'd call "Hot raid component migration". Just so long as people realise its an option for replacing raid components more safely. I bet the majority of the code needed is already in the raid1 personality. You could accomplish the same thing by building your 'real' array ontop of single-disk raid1 arrays, but oh that would be messy to look at! ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 22:11 ` Justin Piszcz 2008-10-13 22:30 ` Billy Crook @ 2008-10-14 23:20 ` Bill Davidsen 1 sibling, 0 replies; 16+ messages in thread From: Bill Davidsen @ 2008-10-14 23:20 UTC (permalink / raw) To: Justin Piszcz; +Cc: Neil Brown, Linux RAID Justin Piszcz wrote: > > > On Mon, 13 Oct 2008, Bill Davidsen wrote: > >> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) >> distributed over multiple drives. This has come up again, so I >> thought I'd just mention why, and what advantages it offers. >> >> By spreading the spare over multiple drives the head motion of normal >> access is spread over one (or several) more drives. This reduces >> seeks, improves performance, etc. The benefit reduces as the number >> of drives in the array gets larger, obviously with four drives using >> only three for normal operation is slower than four, etc. And by >> using all the drives all the time, the chance of a spare being >> undetected after going bad is reduced. >> >> This becomes important as array drive counts shrink. Lower cost for >> drives ($100/TB!), and attempts to drop power use by using fewer >> drives, result in an overall drop in drive count, important in >> serious applications. >> >> All that said, I would really like to bring this up one more time, >> even if the answer is "no interest." >> >> -- >> Bill Davidsen <davidsen@tmr.com> >> "Woe unto the statesman who makes war without a reason that will still >> be valid when the war is over..." Otto von Bismark >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > Bill, > > Not a bad idea; however, can the same not be acheived (somewhat) by > performing daily/smart, weekly/long tests on the drive to validate its > health? I find this to work fairly well on a large scale. Not really, the performance benefit comes from spreading head motion to (at least) one more drive. You can get a check on basic functionality with SMART, but it doesn't beat the drive the way real load does. Add to that the unfortunate problem that more realistic testing also takes up i/o bandwidth for non-productive transfers. Better to be doing actual live data transfers to those drives if you can. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-13 21:50 Distributed spares Bill Davidsen 2008-10-13 22:11 ` Justin Piszcz @ 2008-10-14 10:04 ` Neil Brown 2008-10-16 23:50 ` Bill Davidsen 2008-10-17 13:09 ` Gabor Gombas 1 sibling, 2 replies; 16+ messages in thread From: Neil Brown @ 2008-10-14 10:04 UTC (permalink / raw) To: Bill Davidsen; +Cc: Linux RAID On Monday October 13, davidsen@tmr.com wrote: > Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) > distributed over multiple drives. This has come up again, so I thought > I'd just mention why, and what advantages it offers. > > By spreading the spare over multiple drives the head motion of normal > access is spread over one (or several) more drives. This reduces seeks, > improves performance, etc. The benefit reduces as the number of drives > in the array gets larger, obviously with four drives using only three > for normal operation is slower than four, etc. And by using all the > drives all the time, the chance of a spare being undetected after going > bad is reduced. > > This becomes important as array drive counts shrink. Lower cost for > drives ($100/TB!), and attempts to drop power use by using fewer drives, > result in an overall drop in drive count, important in serious applications. > > All that said, I would really like to bring this up one more time, even > if the answer is "no interest." How are your coding skills? The tricky bit is encoding the new state. We can not longer tell the difference between "optimal" and "degraded" based on the number of in-sync devices. We also need some state flag to say that the "distributed spare" has been constructed. Maybe that could be encoded in the "layout". We would also need to allow a "recovery" pass to happen without having actually added any spares, or having any non-insync devices. That probably means passing the decision "is a recovery pending" down into the personality rather than making it in common code. Maybe have some field in the mddev structure which the personality sets if a recovery is worth trying. Or maybe just try it anyway after any significant change and if the personality finds nothing can be done it aborts. I'm happy to advise on, review, and eventually accept patches. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-14 10:04 ` Neil Brown @ 2008-10-16 23:50 ` Bill Davidsen 2008-10-17 4:09 ` David Lethe 2008-10-17 13:09 ` Gabor Gombas 1 sibling, 1 reply; 16+ messages in thread From: Bill Davidsen @ 2008-10-16 23:50 UTC (permalink / raw) To: Neil Brown; +Cc: Linux RAID Neil Brown wrote: > On Monday October 13, davidsen@tmr.com wrote: > >> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) >> distributed over multiple drives. This has come up again, so I thought >> I'd just mention why, and what advantages it offers. >> >> By spreading the spare over multiple drives the head motion of normal >> access is spread over one (or several) more drives. This reduces seeks, >> improves performance, etc. The benefit reduces as the number of drives >> in the array gets larger, obviously with four drives using only three >> for normal operation is slower than four, etc. And by using all the >> drives all the time, the chance of a spare being undetected after going >> bad is reduced. >> >> This becomes important as array drive counts shrink. Lower cost for >> drives ($100/TB!), and attempts to drop power use by using fewer drives, >> result in an overall drop in drive count, important in serious applications. >> >> All that said, I would really like to bring this up one more time, even >> if the answer is "no interest." >> > > How are your coding skills? > > The tricky bit is encoding the new state. > We can not longer tell the difference between "optimal" and "degraded" > based on the number of in-sync devices. We also need some state flag > to say that the "distributed spare" has been constructed. > Maybe that could be encoded in the "layout". > > We would also need to allow a "recovery" pass to happen without having > actually added any spares, or having any non-insync devices. That > probably means passing the decision "is a recovery pending" down into > the personality rather than making it in common code. Maybe have some > field in the mddev structure which the personality sets if a recovery > is worth trying. Or maybe just try it anyway after any significant > change and if the personality finds nothing can be done it aborts. > > My coding skills are fine, here, but I have to do a lot of planning before even considering this. Here's why: Say you have a five drive RAID-5e, and you are running happily. A drive fails! Now you can rebuild on the spare drive, but the spare drive must be created on the parts from the remaining functional drives, so it can't be done pre-failure, the allocation has to be defined after you see what you have left. Does that sound ugly and complex? Does to me, too. So I'm thinking about this, and doing some reading, but it's not quite as simple as I thought. > I'm happy to advise on, review, and eventually accept patches. > Actually what I think I would do is want to build a test bed in software before trying this in the kernel, then run the kernel part in a virtual machine. I have another idea, which has about 75% of the benefit with 10% of the complexity. Since it sounds too good to be true it probably is, I'll get back to you after I think about the simpler solution, I distrust free lunch algorithms. > NeilBrown -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 16+ messages in thread
* RE: Distributed spares 2008-10-16 23:50 ` Bill Davidsen @ 2008-10-17 4:09 ` David Lethe 2008-10-17 13:46 ` Bill Davidsen 0 siblings, 1 reply; 16+ messages in thread From: David Lethe @ 2008-10-17 4:09 UTC (permalink / raw) To: Bill Davidsen, Neil Brown; +Cc: Linux RAID > -----Original Message----- > From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- > owner@vger.kernel.org] On Behalf Of Bill Davidsen > Sent: Thursday, October 16, 2008 6:50 PM > To: Neil Brown > Cc: Linux RAID > Subject: Re: Distributed spares > > Neil Brown wrote: > > On Monday October 13, davidsen@tmr.com wrote: > > > >> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) > >> distributed over multiple drives. This has come up again, so I > thought > >> I'd just mention why, and what advantages it offers. > >> > >> By spreading the spare over multiple drives the head motion of > normal > >> access is spread over one (or several) more drives. This reduces > seeks, > >> improves performance, etc. The benefit reduces as the number of > drives > >> in the array gets larger, obviously with four drives using only > three > >> for normal operation is slower than four, etc. And by using all the > >> drives all the time, the chance of a spare being undetected after > going > >> bad is reduced. > >> > >> This becomes important as array drive counts shrink. Lower cost for > >> drives ($100/TB!), and attempts to drop power use by using fewer > drives, > >> result in an overall drop in drive count, important in serious > applications. > >> > >> All that said, I would really like to bring this up one more time, > even > >> if the answer is "no interest." > >> > > > > How are your coding skills? > > > > The tricky bit is encoding the new state. > > We can not longer tell the difference between "optimal" and > "degraded" > > based on the number of in-sync devices. We also need some state flag > > to say that the "distributed spare" has been constructed. > > Maybe that could be encoded in the "layout". > > > > We would also need to allow a "recovery" pass to happen without > having > > actually added any spares, or having any non-insync devices. That > > probably means passing the decision "is a recovery pending" down into > > the personality rather than making it in common code. Maybe have > some > > field in the mddev structure which the personality sets if a recovery > > is worth trying. Or maybe just try it anyway after any significant > > change and if the personality finds nothing can be done it aborts. > > > > > My coding skills are fine, here, but I have to do a lot of planning > before even considering this. > Here's why: > Say you have a five drive RAID-5e, and you are running happily. A > drive fails! Now you can rebuild on the spare drive, but the spare > drive > must be created on the parts from the remaining functional drives, so > it > can't be done pre-failure, the allocation has to be defined after you > see what you have left. Does that sound ugly and complex? Does to me, > too. So I'm thinking about this, and doing some reading, but it's not > quite as simple as I thought. > > I'm happy to advise on, review, and eventually accept patches. > > > > Actually what I think I would do is want to build a test bed in > software > before trying this in the kernel, then run the kernel part in a virtual > machine. I have another idea, which has about 75% of the benefit with > 10% of the complexity. Since it sounds too good to be true it probably > is, I'll get back to you after I think about the simpler solution, I > distrust free lunch algorithms. > > NeilBrown > -- > > Bill Davidsen <davidsen@tmr.com> > "Woe unto the statesman who makes war without a reason that will > still > be valid when the war is over..." Otto von Bismark > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" > in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html With all due respect, RAID5E isn't practical. Too many corner cases dealing With performance implications, and where you even just put the parity block, to insure That when a disk fails you didn't put yourself into situation where the Hot spare chunk is located on the disk drive that just died. What about all of those dozens of utilities that need to know about RAID5E to work properly? The cynic in me says if they're still having to recall patches (like today) that deals with mdadm on Established RAID levels, then RAID5E is going to be much worse. Algorithms dealing with drive failures, unrecoverable read/write errors on normal operations as well as rebuilds, expansions, and journalization/optimization are not well understood. It is new territory. If you want multiple distributed spares, just do RAID6, it is better than RAID5 in that respect, and nobody Has to re-invent the wheel. Your "hot spare" is still distributed in all of the disks, and you can survive multiple Drive failures. If your motivation is performance, then buy faster disks, additional controller(s), optimize your storage pools; and tune your md settings to be more compatible with your filesystem parameters. Or even look at your application and see if anything can be done to reduce the I/O count. The fastest I/Os are the ones you eliminate. David ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-17 4:09 ` David Lethe @ 2008-10-17 13:46 ` Bill Davidsen 2008-10-20 1:11 ` Neil Brown 0 siblings, 1 reply; 16+ messages in thread From: Bill Davidsen @ 2008-10-17 13:46 UTC (permalink / raw) To: David Lethe; +Cc: Neil Brown, Linux RAID David Lethe wrote: > >> -----Original Message----- >> From: linux-raid-owner@vger.kernel.org [mailto:linux-raid- >> owner@vger.kernel.org] On Behalf Of Bill Davidsen >> Sent: Thursday, October 16, 2008 6:50 PM >> To: Neil Brown >> Cc: Linux RAID >> Subject: Re: Distributed spares >> >> Neil Brown wrote: >> >>> On Monday October 13, davidsen@tmr.com wrote: >>> >>> >>>> Over a year ago I mentioned RAID-5e, a RAID-5 with the spare(s) >>>> distributed over multiple drives. This has come up again, so I >>>> >> thought >> >>>> I'd just mention why, and what advantages it offers. >>>> >>>> By spreading the spare over multiple drives the head motion of >>>> >> normal >> >>>> access is spread over one (or several) more drives. This reduces >>>> >> seeks, >> >>>> improves performance, etc. The benefit reduces as the number of >>>> >> drives >> >>>> in the array gets larger, obviously with four drives using only >>>> >> three >> >>>> for normal operation is slower than four, etc. And by using all the >>>> drives all the time, the chance of a spare being undetected after >>>> >> going >> >>>> bad is reduced. >>>> >>>> This becomes important as array drive counts shrink. Lower cost for >>>> drives ($100/TB!), and attempts to drop power use by using fewer >>>> >> drives, >> >>>> result in an overall drop in drive count, important in serious >>>> >> applications. >> >>>> All that said, I would really like to bring this up one more time, >>>> >> even >> >>>> if the answer is "no interest." >>>> >>>> >>> How are your coding skills? >>> >>> The tricky bit is encoding the new state. >>> We can not longer tell the difference between "optimal" and >>> >> "degraded" >> >>> based on the number of in-sync devices. We also need some state >>> > flag > >>> to say that the "distributed spare" has been constructed. >>> Maybe that could be encoded in the "layout". >>> >>> We would also need to allow a "recovery" pass to happen without >>> >> having >> >>> actually added any spares, or having any non-insync devices. That >>> probably means passing the decision "is a recovery pending" down >>> > into > >>> the personality rather than making it in common code. Maybe have >>> >> some >> >>> field in the mddev structure which the personality sets if a >>> > recovery > >>> is worth trying. Or maybe just try it anyway after any significant >>> change and if the personality finds nothing can be done it aborts. >>> >>> >>> >> My coding skills are fine, here, but I have to do a lot of planning >> before even considering this. >> Here's why: >> Say you have a five drive RAID-5e, and you are running happily. A >> drive fails! Now you can rebuild on the spare drive, but the spare >> drive >> must be created on the parts from the remaining functional drives, so >> it >> can't be done pre-failure, the allocation has to be defined after you >> see what you have left. Does that sound ugly and complex? Does to me, >> too. So I'm thinking about this, and doing some reading, but it's not >> quite as simple as I thought. >> >>> I'm happy to advise on, review, and eventually accept patches. >>> >>> >> Actually what I think I would do is want to build a test bed in >> software >> before trying this in the kernel, then run the kernel part in a >> > virtual > >> machine. I have another idea, which has about 75% of the benefit with >> 10% of the complexity. Since it sounds too good to be true it probably >> is, I'll get back to you after I think about the simpler solution, I >> distrust free lunch algorithms. >> >>> NeilBrown >>> >> -- >> >> Bill Davidsen <davidsen@tmr.com> >> "Woe unto the statesman who makes war without a reason that will >> still >> be valid when the war is over..." Otto von Bismark >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-raid" >> in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > With all due respect, RAID5E isn't practical. Too many corner cases > dealing > With performance implications, and where you even just put the parity > block, to insure > That when a disk fails you didn't put yourself into situation where the > Hot spare chunk > is located on the disk drive that just died. > > Having run 38 multi-TB machines for an ISP using RAID5e in the SCSI controller, I feel pretty sure that the practicality is established, and only the ability to reinvent that particular wheel is in question. The complexity is that the hot spare drive needs to be defined after the 1st drive failure, using the spare sectors on the functional drives. > What about all of those dozens of utilities that need to know about > RAID5E to work properly? > What did you have in mind? Virtually all utilities working at the filesystem level have no need to know anything, and treat any array as a drive in a black box fashion. > The cynic in me says if they're still having to recall patches (like > today) that deals with mdadm on > Established RAID levels, then RAID5E is going to be much worse. > Let's definitely not add anything to the kernel, then, as a feature-static kernel is much more stable. Features like the software RAID-10 (not 1+0) are not established in some standard I've seen, but they work just fine. Any this is not unexplored territory, distributed spare is called RAID5e on the IBM servers I used, and I believe Storage Computer (in NH) has a similar feature they call "RAID-7" and trademark. > Algorithms dealing > with drive failures, unrecoverable read/write errors on normal > operations as well as rebuilds, expansions, > and journalization/optimization are not well understood. It is new > territory. > That's why I'm being quite cautious about saying I can do this, the coding is easy, it's finding out what to code that's hard. It appears that configuration decisions need to be made after the failure event, before the rebuild. Yes, it's complex. But from experience I can say that performance during rebuild is far better with a distributed spare than beating the snot out of one newly added spare with other RAID levels. So there's a performance benefit for both the normal case and the rebuild case, and a side benefit of faster rebuild time. The full recovery after replacing the failed drive is also an interesting time. :-( > If you want multiple distributed spares, just do RAID6, it is better > than RAID5 in that respect, and nobody > Has to re-invent the wheel. Your "hot spare" is still distributed in > all of the disks, and you can survive multiple > Drive failures. If your motivation is performance, then buy faster > disks, additional controller(s), > optimize your storage pools; and tune your md settings to be more > compatible with your filesystem parameters. > Or even look at your application and see if anything can be done to > reduce the I/O count. > > The motivation is to get the best performance from the hardware you have. Adding hardware cost so you can use storage hardware inefficiently is *really* not practical. Neither power, cooling, drives, nor floor space are cheap enough to use poorly. > The fastest I/Os are the ones you eliminate. > And the fastest seeks are the ones you don't do because you spread head motion over more drives. But once you have distributed spare in the kernel, you have a free performance gain. Or as free as using either more CPU or more memory for mapping will allow. Most people will trade a little of either for better performance. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-17 13:46 ` Bill Davidsen @ 2008-10-20 1:11 ` Neil Brown 0 siblings, 0 replies; 16+ messages in thread From: Neil Brown @ 2008-10-20 1:11 UTC (permalink / raw) To: Bill Davidsen; +Cc: David Lethe, Linux RAID On Friday October 17, davidsen@tmr.com wrote: > David Lethe wrote: > > > > With all due respect, RAID5E isn't practical. Too many corner cases > > dealing > > With performance implications, and where you even just put the parity > > block, to insure > > That when a disk fails you didn't put yourself into situation where the > > Hot spare chunk > > is located on the disk drive that just died. > > > > > Having run 38 multi-TB machines for an ISP using RAID5e in the SCSI > controller, I feel pretty sure that the practicality is established, and > only the ability to reinvent that particular wheel is in question. The > complexity is that the hot spare drive needs to be defined after the 1st > drive failure, using the spare sectors on the functional drives. I don't think that will be particularly complex. It will just be a bit of code in raid5_compute_sector. The detail of 'which device has failed' would be stored in ->algorithm somehow. There is an interesting question of how general do we want the code to be. e.g do we want to be able to configure an array with 2 distributed spares? I suspect that people would rarely want 2, and never want 3, so it would be worth making 2 work if the code didn't get too complex, which I don't think it would (but I'm not certain). > > Algorithms dealing > > with drive failures, unrecoverable read/write errors on normal > > operations as well as rebuilds, expansions, > > and journalization/optimization are not well understood. It is new > > territory. > > > > That's why I'm being quite cautious about saying I can do this, the > coding is easy, it's finding out what to code that's hard. It appears > that configuration decisions need to be made after the failure event, > before the rebuild. Yes, it's complex. But from experience I can say > that performance during rebuild is far better with a distributed spare > than beating the snot out of one newly added spare with other RAID > levels. So there's a performance benefit for both the normal case and > the rebuild case, and a side benefit of faster rebuild time. I cannot see why rebuilding a raid5e would be faster than rebuilding a raid5 to a fresh device. In each case, you need to read from n-1 devices, and write to 1 device. So all devices are constantly doing IO at the same rate. In the raid5 case you could get better streaming as each device is either "always reading" or "always writing", where as in a raid5e rebuild, devices will sometimes read and sometimes write. So if anything, I would expect raid5e to rebuild more slowly, but you would probably only notice this with small chunk sizes. I agree that (with suitably large chunk sizes) you should be able to get better throughput on raid5e. NeilBrown ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: Distributed spares 2008-10-14 10:04 ` Neil Brown 2008-10-16 23:50 ` Bill Davidsen @ 2008-10-17 13:09 ` Gabor Gombas 1 sibling, 0 replies; 16+ messages in thread From: Gabor Gombas @ 2008-10-17 13:09 UTC (permalink / raw) To: Neil Brown; +Cc: Bill Davidsen, Linux RAID On Tue, Oct 14, 2008 at 09:04:25PM +1100, Neil Brown wrote: > The tricky bit is encoding the new state. > We can not longer tell the difference between "optimal" and "degraded" > based on the number of in-sync devices. We also need some state flag > to say that the "distributed spare" has been constructed. > Maybe that could be encoded in the "layout". Or you need to add a "virtual" spare that does not have an actual block device behind it. Or rather it could be a virtual disk constructed from the spare chunks on the data disks; maybe device mapper could be used here? If you could create such a virtual disk then maybe the normal RAID5 code could just do the rest. Of course the mapping of stripes to disk locations has to be changed to account for the "black holes" now belonging to the virtual spare device. Hmm, if you create DM devices for all the disks that just leave out the proper holes from the mapping, and you also create a DM device from the holes, and you create an MD RAID5+spare on top of these DM devices, then you could have RAID5e right now completely from userspace. The superblock should be changed so nobody ever tries to build a RAID5 from the raw disks and there may be some other details... Gabor -- --------------------------------------------------------- MTA SZTAKI Computer and Automation Research Institute Hungarian Academy of Sciences --------------------------------------------------------- ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2008-10-20 1:11 UTC | newest] Thread overview: 16+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-10-13 21:50 Distributed spares Bill Davidsen 2008-10-13 22:11 ` Justin Piszcz 2008-10-13 22:30 ` Billy Crook 2008-10-13 23:29 ` Keld Jørn Simonsen 2008-10-14 10:12 ` Martin K. Petersen 2008-10-14 13:06 ` Keld Jørn Simonsen 2008-10-14 13:20 ` David Lethe 2008-10-14 12:02 ` non-degraded component replacement was " David Greaves 2008-10-14 13:18 ` Billy Crook 2008-10-14 23:20 ` Bill Davidsen 2008-10-14 10:04 ` Neil Brown 2008-10-16 23:50 ` Bill Davidsen 2008-10-17 4:09 ` David Lethe 2008-10-17 13:46 ` Bill Davidsen 2008-10-20 1:11 ` Neil Brown 2008-10-17 13:09 ` Gabor Gombas
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).