* resync'ing - what is going on
@ 2008-07-10 16:54 Keld Jørn Simonsen
2008-07-10 17:45 ` Jon Nelson
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Keld Jørn Simonsen @ 2008-07-10 16:54 UTC (permalink / raw)
To: linux-raid
I would like to know what is going on wrt resyncing, how it is done.
This is because I have some ideas to speed up the process.
I have noted for a 4 drive raid10,f2 array that only about 25 % of the
IO speed is used during the rebuid, I would like to have something like
90 % as a goal.
This is especially for raid10,f2, where I think I can make it much
better, but posssibly also for other raid types, as input to an
explanation on the wiki of what is really going on.
Are there references on the net? I tried to look but did not really find
something.
I don't really understand why resync is going on for raid10,f2.
But maybe it checks all of the array, and checks that the two copies are
identical. Is that so? I got some communication with Neil that some
writing is involved in the resync, I don't understand why.
And what happens if a discrepancy is found? Which of the 2 copies are the
good one? Maybe one could look if there are any CRC errors, or disk read
retries going on. I could understand if it was a raid10,f3 - then if one
was different from the 2 other copies - you could correct the odd copy.
For raid5 and raid6 I could imagine that the parity blocks were cheked.
I could of cause read the code, but I would like an overview before
dwelving into that part.
best regards
keld
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: resync'ing - what is going on 2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen @ 2008-07-10 17:45 ` Jon Nelson 2008-07-10 18:03 ` Andre Noll 2008-07-11 4:51 ` Neil Brown 2 siblings, 0 replies; 7+ messages in thread From: Jon Nelson @ 2008-07-10 17:45 UTC (permalink / raw) To: Keld Jørn Simonsen; +Cc: linux-raid On Thu, Jul 10, 2008 at 11:54 AM, Keld Jørn Simonsen <keld@dkuug.dk> wrote: > I would like to know what is going on wrt resyncing, how it is done. > This is because I have some ideas to speed up the process. > I have noted for a 4 drive raid10,f2 array that only about 25 % of the > IO speed is used during the rebuid, I would like to have something like > 90 % as a goal. > > This is especially for raid10,f2, where I think I can make it much > better, but posssibly also for other raid types, as input to an > explanation on the wiki of what is really going on. Well, my guess (and from a quick scan of the source) is that basically raid10 resync by starting at the first logical block or stripe and proceeds through the last. This means that there is quite a bit of seeking going on. IMO, the loop should look more like this: for component in all_components: for stripe_or_block on this component: read_or_check_or_whatever(stripe_or_block) This way, assuming no interference from other sources, each component does the minimal seeking. -- Jon -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on 2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen 2008-07-10 17:45 ` Jon Nelson @ 2008-07-10 18:03 ` Andre Noll 2008-07-11 4:51 ` Neil Brown 2 siblings, 0 replies; 7+ messages in thread From: Andre Noll @ 2008-07-10 18:03 UTC (permalink / raw) To: Keld Jørn Simonsen; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 637 bytes --] On 18:54, Keld Jørn Simonsen wrote: > And what happens if a discrepancy is found? Which of the 2 copies are the > good one? Nobody can tell. IIRC, the md code picks the one with the smallest drive id. > For raid5 and raid6 I could imagine that the parity blocks were cheked. For raid5 you still can't tell which of the copies (if any) is good. For raid6 you can tell under the assumption that exactly one drive has bad data. However, the raid6 code currently does not try to fix up errors. Have a look at the archives for more.. Andre -- The only person who always got his work done by Friday was Robinson Crusoe [-- Attachment #2: Digital signature --] [-- Type: application/pgp-signature, Size: 189 bytes --] ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on 2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen 2008-07-10 17:45 ` Jon Nelson 2008-07-10 18:03 ` Andre Noll @ 2008-07-11 4:51 ` Neil Brown 2008-07-11 22:29 ` Keld Jørn Simonsen 2 siblings, 1 reply; 7+ messages in thread From: Neil Brown @ 2008-07-11 4:51 UTC (permalink / raw) To: Keld Jørn Simonsen; +Cc: linux-raid On Thursday July 10, keld@dkuug.dk wrote: > I would like to know what is going on wrt resyncing, how it is done. > This is because I have some ideas to speed up the process. > I have noted for a 4 drive raid10,f2 array that only about 25 % of the > IO speed is used during the rebuid, I would like to have something like > 90 % as a goal. "resync" and "recovery" are handled very differently in raid10. "check" and "repair" are special cases of "resync". "recovery" walks addresses from the start to the end of the component drives. At each address, it considers each drive which is being recovered and finds a place on a different device to read the block for the current (drive,address) from. It schedules a read and when the read request completes it schedules the write. On an f2 layout, this will read one drive from halfway to the end, then from the start to halfway, and will write the other drive sequentially. "resync" walks the addresses from the start to end of the array. At each address it reads every device block which stores that array block. When all the reads complete the results are compared. If they are not all the same, the "first" block is written out to the others. (I think I might have told you before that it reads one block and writes the others. I checked the code and what is wrong). Here "first" means (I think) the block with the earliest device address, and if there are several of those, the block with the least device index. So for f2, this will read from both the start and the middle of both devices. It will read 64K at a time, so you should get at least a 32K read at each position before a seek (more with a larger chunk size). Clearly this won't be fast. The reason this algorithm was chosen was that it makes sense for every possible raid10 layout, even though it might not be optimal for some of them. > > This is especially for raid10,f2, where I think I can make it much > better, but posssibly also for other raid types, as input to an > explanation on the wiki of what is really going on. Were I to try to make it fast for f2, I would probably shuffle the bits in each request so that it did all the 'odd' chunks first, then all the even chunks. e.g. map 0 1 2 3 4 5 6 7 8 ... to 0 1 4 5 8 9 ..... 2 3 6 7 10 11 .... (assuming a chunk size of '2'). The problem with this is that if you shutdown while part way though a resync, and then boot into a kernel which used a different sequence, it would finish the resync checking the wrong blocks. This is annoying but should not be insurmountable. This way we leave the basic algorithm the same, but introduce variations in the sequence for different specific layouts. > > Are there references on the net? I tried to look but did not really find > something. Just the source, sorry. > > I don't really understand why resync is going on for raid10,f2. > But maybe it checks all of the array, and checks that the two copies are > identical. Is that so? I got some communication with Neil that some > writing is involved in the resync, I don't understand why. raid1 does resync simply by reading one device and writing all the others, and this is conceptually easiest. I had mistakenly thought that I had used the same approach in raid10. > > And what happens if a discrepancy is found? Which of the 2 copies are the > good one? Maybe one could look if there are any CRC errors, or disk read > retries going on. I could understand if it was a raid10,f3 - then if one > was different from the 2 other copies - you could correct the odd copy. There is no "good" block - if they are different, then all are wrong. md/raid just tries to return a consistent value, and leave it up to the filesystem to find and correct any errors. > > For raid5 and raid6 I could imagine that the parity blocks were cheked. If any inconsistency is found during a resync of raid4/5/6 the parity blocks are changed to remove the inconsistency. This may not be "right", but it is least likely to be "wrong". > > I could of cause read the code, but I would like an overview before > dwelving into that part. Sensible :-) Enjoy your reading. NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on 2008-07-11 4:51 ` Neil Brown @ 2008-07-11 22:29 ` Keld Jørn Simonsen 2008-07-12 10:44 ` Neil Brown 0 siblings, 1 reply; 7+ messages in thread From: Keld Jørn Simonsen @ 2008-07-11 22:29 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid I took the information here and made something for the wiki. Comments welcome. /keld = recovery and resync = The following is a recollection of what Neil Brown and others have written on the linux-raid mailing list. "resync" and "recovery" are handled very differently in raid10. "check" and "repair" are special cases of "resync". == recovery == The purpose of the recovery process is to fill a new disk with the relevant information from a running array. The assumption is that all data do the new disk needs to be written. "recovery" walks addresses from the start to the end of the component drives. At each address, it considers each drive which is being recovered and finds a place on a different device to read the block for the current (drive,address) from. It schedules a read and when the read request completes it schedules the write. On an f2 layout, this will read one drive from halfway to the end, then from the start to halfway, and will write the other drive sequentially. == resync == The purpose of resync is to ensure that all data on the array is syncronized. There is an assumption that most, if not all, of the data is allready OK. "resync" walks the addresses from the start to end of the array. At each address it reads every device block which stores that array block. When all the reads complete the results are compared. If they are not all the same, the "first" block is written out to the others. Here "first" means (I think) the block with the earliest device address, and if there are several of those, the block with the least device index. So for f2, this will read from both the start and the middle of both devices. It will read 64K (the chunk size) at a time, so you should get at least a 32K read at each position before a seek (more with a larger chunk size). Clearly this won't be fast. The reason this algorithm was chosen was that it makes sense for every possible raid10 layout, even though it might not be optimal for some of them. Were I to try to make it fast for f2, I would probably shuffle the bits in each request so that it did all the 'odd' chunks first, then all the even chunks. e.g. map 0 1 2 3 4 5 6 7 8 ... to 0 1 4 5 8 9 ..... 2 3 6 7 10 11 .... (assuming a chunk size of '2'). The problem with this is that if you shutdown while part way though a resync, and then boot into a kernel which used a different sequence, it would finish the resync checking the wrong blocks. This is annoying but should not be insurmountable. This way we leave the basic algorithm the same, but introduce variations in the sequence for different specific layouts. Another idea would be to read a number of chunks from one part of the f2 mirror, say 10 MB, and then read then corresponding 10 MB from the other half of the f2 array. This would on current disk technology (80 MB/s) mean 125 ms spent reading, and then 8 ms spent moving heads. raid1 does resync simply by reading one device and writing all the others, and this is conceptually easiest. When repairing, there is no "good" block - if they are different, then all are wrong. md/raid just tries to return a consistent value, and leave it up to the filesystem to find and correct any errors. md/raid does not try to take advantage of information on failed CRC on disk hardware, should that info be available to the kernel. If any inconsistency is found during a resync of raid4/5/6 the parity blocks are changed to remove the inconsistency. This may not be "right", but it is least likely to be "wrong". On Fri, Jul 11, 2008 at 02:51:33PM +1000, Neil Brown wrote: > On Thursday July 10, keld@dkuug.dk wrote: > > I would like to know what is going on wrt resyncing, how it is done. > > This is because I have some ideas to speed up the process. > > I have noted for a 4 drive raid10,f2 array that only about 25 % of the > > IO speed is used during the rebuid, I would like to have something like > > 90 % as a goal. > > "resync" and "recovery" are handled very differently in raid10. > "check" and "repair" are special cases of "resync". > > "recovery" walks addresses from the start to the end of the component > drives. > At each address, it considers each drive which is being recovered and > finds a place on a different device to read the block for the current > (drive,address) from. It schedules a read and when the read request > completes it schedules the write. > > On an f2 layout, this will read one drive from halfway to the end, > then from the start to halfway, and will write the other drive > sequentially. > > "resync" walks the addresses from the start to end of the array. > At each address it reads every device block which stores that > array block. When all the reads complete the results are compared. > If they are not all the same, the "first" block is written out > to the others. (I think I might have told you before that it reads > one block and writes the others. I checked the code and what is > wrong). > > Here "first" means (I think) the block with the earliest device > address, and if there are several of those, the block with the least > device index. > > So for f2, this will read from both the start and the middle of > both devices. It will read 64K at a time, so you should get at least > a 32K read at each position before a seek (more with a larger chunk > size). > > Clearly this won't be fast. > > The reason this algorithm was chosen was that it makes sense for every > possible raid10 layout, even though it might not be optimal for some > of them. > > > > > This is especially for raid10,f2, where I think I can make it much > > better, but posssibly also for other raid types, as input to an > > explanation on the wiki of what is really going on. > > Were I to try to make it fast for f2, I would probably shuffle the > bits in each request so that it did all the 'odd' chunks first, then > all the even chunks. > e.g. map > 0 1 2 3 4 5 6 7 8 ... > to > 0 1 4 5 8 9 ..... 2 3 6 7 10 11 .... > (assuming a chunk size of '2'). > > The problem with this is that if you shutdown while part way though a > resync, and then boot into a kernel which used a different > sequence, it would finish the resync checking the wrong blocks. > This is annoying but should not be insurmountable. > > This way we leave the basic algorithm the same, but introduce > variations in the sequence for different specific layouts. > > > > > Are there references on the net? I tried to look but did not really find > > something. > > Just the source, sorry. > > > > > I don't really understand why resync is going on for raid10,f2. > > But maybe it checks all of the array, and checks that the two copies are > > identical. Is that so? I got some communication with Neil that some > > writing is involved in the resync, I don't understand why. > > raid1 does resync simply by reading one device and writing all the > others, and this is conceptually easiest. I had mistakenly thought > that I had used the same approach in raid10. > > > > > And what happens if a discrepancy is found? Which of the 2 copies are the > > good one? Maybe one could look if there are any CRC errors, or disk read > > retries going on. I could understand if it was a raid10,f3 - then if one > > was different from the 2 other copies - you could correct the odd copy. > > There is no "good" block - if they are different, then all are wrong. > md/raid just tries to return a consistent value, and leave it up to > the filesystem to find and correct any errors. > > > > > For raid5 and raid6 I could imagine that the parity blocks were cheked. > > If any inconsistency is found during a resync of raid4/5/6 the parity > blocks are changed to remove the inconsistency. This may not be > "right", but it is least likely to be "wrong". > > > > > I could of cause read the code, but I would like an overview before > > dwelving into that part. > > Sensible :-) > Enjoy your reading. > > NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on 2008-07-11 22:29 ` Keld Jørn Simonsen @ 2008-07-12 10:44 ` Neil Brown 2008-07-12 11:50 ` Keld Jørn Simonsen 0 siblings, 1 reply; 7+ messages in thread From: Neil Brown @ 2008-07-12 10:44 UTC (permalink / raw) To: Keld Jørn Simonsen; +Cc: linux-raid On Saturday July 12, keld@dkuug.dk wrote: > I took the information here and made something for the wiki. > Comments welcome. /keld Thanks for doing that. I'm not motivated to work on the wiki myself, but I'm more than happy for anyone to take any secrets I divulge in emails and publish them there. > > = recovery and resync = > > The following is a recollection of what Neil Brown and others have > written > on the linux-raid mailing list. > > "resync" and "recovery" are handled very differently in raid10. > "check" and "repair" are special cases of "resync". > > == recovery == > > The purpose of the recovery process is to fill a new disk with the > relevant > information from a running array. > > The assumption is that all data do the new disk needs to be written. > > "recovery" walks addresses from the start to the end of the component > drives. ... > > == resync == ... > > "resync" walks the addresses from the start to end of the array. It's probably worth noting that this difference between recovery and resync in specific to raid10. recovery always follows the address space of component drives. resync: on raid10, follows the address space of the array on raid4/5/6, follows the address space of component drives. on raid1, the two address spaces are the same. Another way to say it is that raid10-resync follows the address space of the array, everything else follows the component drives. NeilBrown ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on 2008-07-12 10:44 ` Neil Brown @ 2008-07-12 11:50 ` Keld Jørn Simonsen 0 siblings, 0 replies; 7+ messages in thread From: Keld Jørn Simonsen @ 2008-07-12 11:50 UTC (permalink / raw) To: Neil Brown; +Cc: linux-raid On Sat, Jul 12, 2008 at 08:44:48PM +1000, Neil Brown wrote: > On Saturday July 12, keld@dkuug.dk wrote: > > I took the information here and made something for the wiki. > > Comments welcome. /keld > > Thanks for doing that. I'm not motivated to work on the wiki myself, > but I'm more than happy for anyone to take any secrets I divulge in > emails and publish them there. > > > > > = recovery and resync = > > > > The following is a recollection of what Neil Brown and others have > > written > > on the linux-raid mailing list. > > > > "resync" and "recovery" are handled very differently in raid10. > > "check" and "repair" are special cases of "resync". > > > > == recovery == > > > > The purpose of the recovery process is to fill a new disk with the > > relevant > > information from a running array. > > > > The assumption is that all data do the new disk needs to be written. > > > > "recovery" walks addresses from the start to the end of the component > > drives. > ... > > > > == resync == > ... > > > > "resync" walks the addresses from the start to end of the array. > > It's probably worth noting that this difference between recovery and > resync in specific to raid10. > > recovery always follows the address space of component drives. > > resync: > on raid10, follows the address space of the array > on raid4/5/6, follows the address space of component drives. > on raid1, the two address spaces are the same. > > Another way to say it is that raid10-resync follows the address space > of the array, everything else follows the component drives. OK, I added that info: For raid10 "resync" walks the addresses from the start to end of the array. (For all other raid types "resync" follows the component drives). I also added that for recovery, it is assumed that the running drives are in sync. Is that true? I wrote it as: The assumption is that all data on the new disk needs to be written, and that the other data on the running array is correct. I also wrote this to indicate the difference between a component drive walk, and a full array walk: "recovery" walks addresses from the start to the end of the component drives. Thus only data for the specific component drive is adressed. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-07-12 11:50 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen 2008-07-10 17:45 ` Jon Nelson 2008-07-10 18:03 ` Andre Noll 2008-07-11 4:51 ` Neil Brown 2008-07-11 22:29 ` Keld Jørn Simonsen 2008-07-12 10:44 ` Neil Brown 2008-07-12 11:50 ` Keld Jørn Simonsen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).