* Multi-layer raid status @ 2018-01-30 15:30 David Brown 2018-02-02 6:03 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: David Brown @ 2018-01-30 15:30 UTC (permalink / raw) To: linux-raid Does anyone know the current state of multi-layer raid (in the Linux md layer) for recovery? I am thinking of a setup like this (hypothetical example - it is not a real setup): md0 = sda + sdb, raid1 md1 = sdc + sdd, raid1 md2 = sde + sdf, raid1 md3 = sdg + sdh, raid1 md4 = md0 + md1 + md2 + md3, raid5 If you have an error reading a sector in sda, the raid1 pair finds the mirror copy on sdb, re-writes the data to sda (which re-locates the bad sector) and passes the good data on to the raid5 layer. Everyone is happy, and the error is corrected quickly. Rebuilds are fast as single disk copies. However, if you have an error reading a sector in sda /and/ when reading the mirror copy in sdb, then the raid1 pair has no data to give to the raid5 layer. The raid5 layer will then read the rest of the stripe and calculate the missing data. I presume it will then re-write the calculated data to md0, which will in turn write it to sda and sdb, and all will be well again. But what about rebuilds? A rebuild or recovery of the raid1 layer is not triggered by a read from the raid5 level - it will be handled at the raid1 level. If sda is replaced, then the raid1 level will build it by copying from sdb. If a read error is encountered while copying, is there any way for the recovery code to know that it can get the missing data by asking the raid5 level? Is it possible to mark the matching sda sector as bad, so that a future raid5 read (such as from a scrub) will see that md0 stripe as bad, and re-write it? ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-01-30 15:30 Multi-layer raid status David Brown @ 2018-02-02 6:03 ` NeilBrown 2018-02-02 10:41 ` David Brown 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2018-02-02 6:03 UTC (permalink / raw) To: David Brown, linux-raid [-- Attachment #1: Type: text/plain, Size: 2090 bytes --] On Tue, Jan 30 2018, David Brown wrote: > Does anyone know the current state of multi-layer raid (in the Linux md > layer) for recovery? > > I am thinking of a setup like this (hypothetical example - it is not a > real setup): > > md0 = sda + sdb, raid1 > md1 = sdc + sdd, raid1 > md2 = sde + sdf, raid1 > md3 = sdg + sdh, raid1 > > md4 = md0 + md1 + md2 + md3, raid5 > > > If you have an error reading a sector in sda, the raid1 pair finds the > mirror copy on sdb, re-writes the data to sda (which re-locates the bad > sector) and passes the good data on to the raid5 layer. Everyone is > happy, and the error is corrected quickly. > > Rebuilds are fast as single disk copies. > > > However, if you have an error reading a sector in sda /and/ when reading > the mirror copy in sdb, then the raid1 pair has no data to give to the > raid5 layer. The raid5 layer will then read the rest of the stripe and > calculate the missing data. I presume it will then re-write the > calculated data to md0, which will in turn write it to sda and sdb, and > all will be well again. If sda and sdb have bad-block-logs configured, this should work. Not everyone trusts them though. > > > But what about rebuilds? A rebuild or recovery of the raid1 layer is > not triggered by a read from the raid5 level - it will be handled at the > raid1 level. If sda is replaced, then the raid1 level will build it by > copying from sdb. If a read error is encountered while copying, is > there any way for the recovery code to know that it can get the missing > data by asking the raid5 level? Is it possible to mark the matching sda > sector as bad, so that a future raid5 read (such as from a scrub) will > see that md0 stripe as bad, and re-write it? > "Is it possible to mark the matching sda sector as bad" This is exactly what the bad-block-list functionality is meant to do. NeilBrown > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 832 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 6:03 ` NeilBrown @ 2018-02-02 10:41 ` David Brown 2018-02-02 11:17 ` Wols Lists 0 siblings, 1 reply; 11+ messages in thread From: David Brown @ 2018-02-02 10:41 UTC (permalink / raw) To: NeilBrown, linux-raid On 02/02/18 07:03, NeilBrown wrote: > On Tue, Jan 30 2018, David Brown wrote: > >> Does anyone know the current state of multi-layer raid (in the Linux md >> layer) for recovery? >> >> I am thinking of a setup like this (hypothetical example - it is not a >> real setup): >> >> md0 = sda + sdb, raid1 >> md1 = sdc + sdd, raid1 >> md2 = sde + sdf, raid1 >> md3 = sdg + sdh, raid1 >> >> md4 = md0 + md1 + md2 + md3, raid5 >> >> >> If you have an error reading a sector in sda, the raid1 pair finds the >> mirror copy on sdb, re-writes the data to sda (which re-locates the bad >> sector) and passes the good data on to the raid5 layer. Everyone is >> happy, and the error is corrected quickly. >> >> Rebuilds are fast as single disk copies. >> >> >> However, if you have an error reading a sector in sda /and/ when reading >> the mirror copy in sdb, then the raid1 pair has no data to give to the >> raid5 layer. The raid5 layer will then read the rest of the stripe and >> calculate the missing data. I presume it will then re-write the >> calculated data to md0, which will in turn write it to sda and sdb, and >> all will be well again. > > If sda and sdb have bad-block-logs configured, this should work. Not > everyone trusts them though. > >> >> >> But what about rebuilds? A rebuild or recovery of the raid1 layer is >> not triggered by a read from the raid5 level - it will be handled at the >> raid1 level. If sda is replaced, then the raid1 level will build it by >> copying from sdb. If a read error is encountered while copying, is >> there any way for the recovery code to know that it can get the missing >> data by asking the raid5 level? Is it possible to mark the matching sda >> sector as bad, so that a future raid5 read (such as from a scrub) will >> see that md0 stripe as bad, and re-write it? >> > > "Is it possible to mark the matching sda sector as bad" > > This is exactly what the bad-block-list functionality is meant to do. > > NeilBrown > Marvellous - thank you for the information. Using bad block lists and then doing a higher level scrub should certainly work, and is a good general solution as it means you don't need direct interaction between the layers (just the normal top-down processing of layered block devices). The disadvantage is that there may be quite a delay between the raid1 rebuild and the next full re-read of the entire raid5 array - all you really need is a single read at the higher level to trigger the fixup. Is there any way to map from the block numbers on the lower raid level here to block numbers at a higher level? I suppose in general the lower level does not know what is above it. I guess a user mode tool could look at /proc/mdstat and work through it to figure out the layers, then look through bad block lists and calculate the required high-level reads. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 10:41 ` David Brown @ 2018-02-02 11:17 ` Wols Lists 2018-02-02 11:32 ` David Brown 0 siblings, 1 reply; 11+ messages in thread From: Wols Lists @ 2018-02-02 11:17 UTC (permalink / raw) To: David Brown, NeilBrown, linux-raid On 02/02/18 10:41, David Brown wrote: > Using bad block lists and then doing a higher level scrub should > certainly work, and is a good general solution as it means you don't > need direct interaction between the layers (just the normal top-down > processing of layered block devices). The disadvantage is that there > may be quite a delay between the raid1 rebuild and the next full re-read > of the entire raid5 array - all you really need is a single read at the > higher level to trigger the fixup. This would be a perfect use case of my "full parity reads" mode ... at the moment, raid just reads sufficient disks to return the requested data, but I proposed a mode where it read the full stripe, did the parity checks, and either returned a read error (2-disk raid-1, raid-5) or corrected the stripe (raid-6) if things didn't add up. Okay, it would knacker performance a bit, but where you've got a nested raid like this, you switch it on, run a read on the filesystem ( tar / --no-follow > /dev/null sort of thing), and it would sort out integrity all the way down the stack. Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 11:17 ` Wols Lists @ 2018-02-02 11:32 ` David Brown 2018-02-02 12:12 ` Reindl Harald 2018-02-02 14:24 ` Wols Lists 0 siblings, 2 replies; 11+ messages in thread From: David Brown @ 2018-02-02 11:32 UTC (permalink / raw) To: Wols Lists, NeilBrown, linux-raid On 02/02/18 12:17, Wols Lists wrote: > On 02/02/18 10:41, David Brown wrote: >> Using bad block lists and then doing a higher level scrub should >> certainly work, and is a good general solution as it means you don't >> need direct interaction between the layers (just the normal top-down >> processing of layered block devices). The disadvantage is that there >> may be quite a delay between the raid1 rebuild and the next full re-read >> of the entire raid5 array - all you really need is a single read at the >> higher level to trigger the fixup. > > This would be a perfect use case of my "full parity reads" mode ... at > the moment, raid just reads sufficient disks to return the requested > data, but I proposed a mode where it read the full stripe, did the > parity checks, and either returned a read error (2-disk raid-1, raid-5) > or corrected the stripe (raid-6) if things didn't add up. > You already do that during a scrub. You don't want to do it during normal operations - unless you have a usage pattern with mostly big reads, you will cripple performance. A small performance drop is acceptable if it can be shown to significantly improve reliability - but making every read a full stripe read will give you random read performance closer to that of a single disk than a raid array. > Okay, it would knacker performance a bit, but where you've got a nested > raid like this, you switch it on, run a read on the filesystem ( tar / > --no-follow > /dev/null sort of thing), and it would sort out integrity > all the way down the stack. > That's a scrub. You do it as a very low priority task on a regular basis. ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 11:32 ` David Brown @ 2018-02-02 12:12 ` Reindl Harald 2018-02-02 14:24 ` Wols Lists 1 sibling, 0 replies; 11+ messages in thread From: Reindl Harald @ 2018-02-02 12:12 UTC (permalink / raw) To: David Brown, Wols Lists, NeilBrown, linux-raid Am 02.02.2018 um 12:32 schrieb David Brown: >> Okay, it would knacker performance a bit, but where you've got a nested >> raid like this, you switch it on, run a read on the filesystem ( tar / >> --no-follow > /dev/null sort of thing), and it would sort out integrity >> all the way down the stack. > > That's a scrub. You do it as a very low priority task on a regular basis if only that "low priority" would work these days.... dev.raid.speed_limit_min = 25000 dev.raid.speed_limit_max = 1000000 in the good old days "dev.raid.speed_limit_max" was how it is named and when you used your machine the scrub simply took longer, these days (for many months, few years) you better limit "dev.raid.speed_limit_max" and type "sysctl -p" and *after* that you can continue to use your machine ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 11:32 ` David Brown 2018-02-02 12:12 ` Reindl Harald @ 2018-02-02 14:24 ` Wols Lists 2018-02-02 14:50 ` David Brown 1 sibling, 1 reply; 11+ messages in thread From: Wols Lists @ 2018-02-02 14:24 UTC (permalink / raw) To: David Brown, NeilBrown, linux-raid On 02/02/18 11:32, David Brown wrote: > You already do that during a scrub. You don't want to do it during > normal operations - unless you have a usage pattern with mostly big > reads, you will cripple performance. A small performance drop is > acceptable if it can be shown to significantly improve reliability - but > making every read a full stripe read will give you random read > performance closer to that of a single disk than a raid array. Unless integrity is more important than speed? Unless (like in your own example) you know there's a problem and you want to find it? Yup I know it will knacker performance - I said so. But there are plenty of use cases where it would actually be very useful, and probably the lesser of two evils. (Actually, re-reading your original email, it actually sounds like the right thing to do would be to call hdparm to mark the sector bad on sda, rather than use badblocks, so it will rewrite and clear itself. And this is also a perfect example of where my technique would be useful - it's probably not the raid-5 parity block that gets corrupted, therefore the data itself has been corrupted, therefore my utility would find the damaged file for you so you could recover from backup. A scrub at the raid-5 level would just "fix" the parity and leave you with a corrupted file waiting to blow up on you.) Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 14:24 ` Wols Lists @ 2018-02-02 14:50 ` David Brown 2018-02-02 15:03 ` Wols Lists 0 siblings, 1 reply; 11+ messages in thread From: David Brown @ 2018-02-02 14:50 UTC (permalink / raw) To: Wols Lists, NeilBrown, linux-raid On 02/02/18 15:24, Wols Lists wrote: > On 02/02/18 11:32, David Brown wrote: >> You already do that during a scrub. You don't want to do it during >> normal operations - unless you have a usage pattern with mostly big >> reads, you will cripple performance. A small performance drop is >> acceptable if it can be shown to significantly improve reliability - but >> making every read a full stripe read will give you random read >> performance closer to that of a single disk than a raid array. > > Unless integrity is more important than speed? There are scenarios where it is realistic to expect integrity problems - sudden decay of a disk sector is not a likely event. There is /no/ good reason for saying that when you read sector 1000 from disk A, you should also read sector 1000 from disk B just in case that happened to go bad. Reading a whole stripe when you need to read one sector gives you /nothing/. Reading the whole stripe and checking the parity gives you /almost/ nothing - if there is an error on the sector you are reading, the disk tells you. Undetected read errors are the pink unicorns of the computing world - there are people who swear they have seen them, but real evidence is very hard to come by. And even then, there are much better ways to deal with them (btrfs checksums, for example). And, yet again, you have regular scrubs. These have low bandwidth cost (because you run them slowly, and because they do not flood your block and stripe caches), and will detect any such errors. Integrity is important, but it is not so important that nothing else matters. Do you make sure all your servers are six stories underground in concrete bunkers? Don't tell me you are unwilling to pay that cost - surely you don't want to risk losing data to a meteorite strike? Do you drive a tank to work? After all, surely your personal safety is more important than speed, or fuel costs. > > Unless (like in your own example) you know there's a problem and you > want to find it? First, it is not an unknown problem - it is a known event. Second, reading full stripes for every disk read will not help in any way, because your chances of reading the sector in question are tiny for most normal usage pattern. Third, normal regular scrubs will catch it just the same, merely with a bit of delay. If you want to get it faster and don't mind low performance, increase the scrub bandwidth. All I am asking is if is possible to have a targeted scrub on just the relevant blocks, to minimise the low redundancy period. > > Yup I know it will knacker performance - I said so. But there are plenty > of use cases where it would actually be very useful, and probably the > lesser of two evils. What are these cases? We have already eliminated the rebuild situation I described. And in particular, which use-cases are you thinking of where you not be better off with alternative integrity improvements (like higher redundancy levels) without killing performance? > > (Actually, re-reading your original email, it actually sounds like the > right thing to do would be to call hdparm to mark the sector bad on sda, > rather than use badblocks, so it will rewrite and clear itself. And this > is also a perfect example of where my technique would be useful - it's > probably not the raid-5 parity block that gets corrupted, therefore the > data itself has been corrupted, therefore my utility would find the > damaged file for you so you could recover from backup. A scrub at the > raid-5 level would just "fix" the parity and leave you with a corrupted > file waiting to blow up on you.) > That does not make sense. The bad block list described by Neil will do the job correctly. hdparm bad block marking could also work, but it does so at a lower level and the sector is /not/ corrected automatically, AFAIK. It also would not help if the raid1 were not directly on a hard disk (think disk partition, another raid, an LVM partition, an iSCSI disk, a remote block device, an encrypted block device, etc.). ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 14:50 ` David Brown @ 2018-02-02 15:03 ` Wols Lists 2018-02-02 15:40 ` David Brown 0 siblings, 1 reply; 11+ messages in thread From: Wols Lists @ 2018-02-02 15:03 UTC (permalink / raw) To: David Brown, NeilBrown, linux-raid On 02/02/18 14:50, David Brown wrote: > What are these cases? We have already eliminated the rebuild situation > I described. And in particular, which use-cases are you thinking of > where you not be better off with alternative integrity improvements > (like higher redundancy levels) without killing performance? > In particular, when you KNOW you've got a damaged raid, and you want to know which files are affected. The whole point of my technique is that either it uses the raid to recover (if it can) or it propagates a read error back to the application. It does NOT "fix" the data and leave a corrupted file behind. >> > > That does not make sense. The bad block list described by Neil will do > the job correctly. hdparm bad block marking could also work, but it > does so at a lower level and the sector is /not/ corrected > automatically, AFAIK. It also would not help if the raid1 were not > directly on a hard disk (think disk partition, another raid, an LVM > partition, an iSCSI disk, a remote block device, an encrypted block > device, etc.). > Nor does the bad block list correct the error automatically, if that's true then. The bad blocks list fakes a read error, the hdparm causes a real read error. When the raid-5 scrub hits, either version triggers a rewrite. Thing about the bad-block list, is that that disk block is NOT rewritten. It's moved, and that disk space is LOST. With hdparm, that block gets rewritten, and if the rewrite succeeds the space is recovered. Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 15:03 ` Wols Lists @ 2018-02-02 15:40 ` David Brown 2018-02-02 16:49 ` Wols Lists 0 siblings, 1 reply; 11+ messages in thread From: David Brown @ 2018-02-02 15:40 UTC (permalink / raw) To: Wols Lists, NeilBrown, linux-raid On 02/02/18 16:03, Wols Lists wrote: > On 02/02/18 14:50, David Brown wrote: >> What are these cases? We have already eliminated the rebuild situation >> I described. And in particular, which use-cases are you thinking of >> where you not be better off with alternative integrity improvements >> (like higher redundancy levels) without killing performance? >> > In particular, when you KNOW you've got a damaged raid, and you want to > know which files are affected. The whole point of my technique is that > either it uses the raid to recover (if it can) or it propagates a read > error back to the application. It does NOT "fix" the data and leave a > corrupted file behind. If you read a block and the read fails, the raid system will already read the whole stripe to re-create the missing data. If it can re-create it, it writes the new data back to the disk and returns it to the application. If it cannot, it gives the read error back to the application. I cannot imagine a situation where you would have a disk that you know has incorrect data, as part of your array and in normal use for a file system. For the situation I originally described, if there were no support for the bad block lists, then you would have to have a more complex procedure for the rebuild. (I believe it would be somethign like this. Enable the write intent bitmap for the raid5 level, take the raid1 pair with the missing drive out of the raid5, rebuild the raid1 pair, and if the build is successful then put it back in the raid5 and let the write intent logic bring it up to speed. If the build had errors, you'd have to unmount the filesystem, let the write intent logic finish writing, then scrub the raid5.) But since there is the bad block list to handle my concerns, there is no problem there. > >>>> >> That does not make sense. The bad block list described by Neil will do >> the job correctly. hdparm bad block marking could also work, but it >> does so at a lower level and the sector is /not/ corrected >> automatically, AFAIK. It also would not help if the raid1 were not >> directly on a hard disk (think disk partition, another raid, an LVM >> partition, an iSCSI disk, a remote block device, an encrypted block >> device, etc.). >> > Nor does the bad block list correct the error automatically, if that's > true then. The bad blocks list fakes a read error, the hdparm causes a > real read error. When the raid-5 scrub hits, either version triggers a > rewrite. > > Thing about the bad-block list, is that that disk block is NOT > rewritten. It's moved, and that disk space is LOST. With hdparm, that > block gets rewritten, and if the rewrite succeeds the space is recovered. I don't know what the details are for when blocks are removed from bad lists (either the md raid bad block list, or the hdparm list) and re-tried. But it does not matter - the fraction of wasted space is negligible. > > Cheers, > Wol > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Multi-layer raid status 2018-02-02 15:40 ` David Brown @ 2018-02-02 16:49 ` Wols Lists 0 siblings, 0 replies; 11+ messages in thread From: Wols Lists @ 2018-02-02 16:49 UTC (permalink / raw) To: David Brown, NeilBrown, linux-raid On 02/02/18 15:40, David Brown wrote: > On 02/02/18 16:03, Wols Lists wrote: >> On 02/02/18 14:50, David Brown wrote: >>> What are these cases? We have already eliminated the rebuild situation >>> I described. And in particular, which use-cases are you thinking of >>> where you not be better off with alternative integrity improvements >>> (like higher redundancy levels) without killing performance? >>> >> In particular, when you KNOW you've got a damaged raid, and you want to >> know which files are affected. The whole point of my technique is that >> either it uses the raid to recover (if it can) or it propagates a read >> error back to the application. It does NOT "fix" the data and leave a >> corrupted file behind. > > If you read a block and the read fails, the raid system will already > read the whole stripe to re-create the missing data. If it can > re-create it, it writes the new data back to the disk and returns it to > the application. If it cannot, it gives the read error back to the > application. > > I cannot imagine a situation where you would have a disk that you know > has incorrect data, as part of your array and in normal use for a file > system. Can't you? When I was discussing this originally I had a bunch of examples given to me. Let's take just one, which as far as I can tell is real, and is probably far more common than system developers would like to admit. A drive glitches, and writes a load of data - intended for let's say track 1398 - to track 1938 by mistake. Okay, that particular example is a decimal blunder, and a drive would probably make a bit-flip mistake instead, but writing data to the wrong place is apparently a well-recognised intermittent failure mode. (And it's not even always hardware to blame - just an unfortunate cosmic ray incident.) Or - and it was reported on this list - a drive suffers a power glitch and dumps the entire contents of its write buffer. Either way, we now have a raid array which APPEARS to be functioning normally, and a bunch of stripes are corrupt. If you're lucky (and yes, this does seem to be the normal state of affairs) then it's just the parity which has been corrupted, which a scrub will fix. But if it's not the parity, then raid-1 and raid-5 you can kiss your data bye-bye, and if it's raid-6, a scrub will send your data to data heaven. And saying "it's never happened to me" doesn't mean it's never happened to anyone else. Let's go back a few years, to the development of the ext file system from version 2, to version 4. I can't remember the exact saying, but it's something along the lines of "premature optimisation is the root of all evil". When an ext2 system crashed, you could easily spend hours running fsck before the system was usable. So the developers developed ext3, with a journal. By chance, this always wrote the data blocks before the journal, so when the system crashed, the journal fixed the file system, and the users were very happy they didn't need a fsck. Then the developers decided to optimise further into ext4 and broke the link between data and journal! So now, an ext4 system might boot faster after a crash, shaving seconds off journal replay time. But the system took MUCH LONGER to be available to users, because now the filesystem corrupted user data, and instead of running the system level fsck, users had to replace it with an application data integrity tool. So yes, my "integrity checking raid" might be slow. Which is why it would be disabled by default, and require flipping a runtime switch to enable it. But it's a hell of a lot faster than an "mkfs and reload from backup", which is the alternative if your disk is corrupt (as opposed to crashed and dead). And my way gives you a list of corrupted files that need restoring, as opposed to "scrub, fix, and cross your fingers". And one last question - if my idea is stupid, why did somebody think it worthwhile to write raid6check? Why is it that so many kernel level guys seem to treat user data integrity with contempt? Cheers, Wol ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2018-02-02 16:49 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-01-30 15:30 Multi-layer raid status David Brown 2018-02-02 6:03 ` NeilBrown 2018-02-02 10:41 ` David Brown 2018-02-02 11:17 ` Wols Lists 2018-02-02 11:32 ` David Brown 2018-02-02 12:12 ` Reindl Harald 2018-02-02 14:24 ` Wols Lists 2018-02-02 14:50 ` David Brown 2018-02-02 15:03 ` Wols Lists 2018-02-02 15:40 ` David Brown 2018-02-02 16:49 ` Wols Lists
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox