* resync'ing - what is going on
@ 2008-07-10 16:54 Keld Jørn Simonsen
2008-07-10 17:45 ` Jon Nelson
` (2 more replies)
0 siblings, 3 replies; 7+ messages in thread
From: Keld Jørn Simonsen @ 2008-07-10 16:54 UTC (permalink / raw)
To: linux-raid
I would like to know what is going on wrt resyncing, how it is done.
This is because I have some ideas to speed up the process.
I have noted for a 4 drive raid10,f2 array that only about 25 % of the
IO speed is used during the rebuid, I would like to have something like
90 % as a goal.
This is especially for raid10,f2, where I think I can make it much
better, but posssibly also for other raid types, as input to an
explanation on the wiki of what is really going on.
Are there references on the net? I tried to look but did not really find
something.
I don't really understand why resync is going on for raid10,f2.
But maybe it checks all of the array, and checks that the two copies are
identical. Is that so? I got some communication with Neil that some
writing is involved in the resync, I don't understand why.
And what happens if a discrepancy is found? Which of the 2 copies are the
good one? Maybe one could look if there are any CRC errors, or disk read
retries going on. I could understand if it was a raid10,f3 - then if one
was different from the 2 other copies - you could correct the odd copy.
For raid5 and raid6 I could imagine that the parity blocks were cheked.
I could of cause read the code, but I would like an overview before
dwelving into that part.
best regards
keld
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen
@ 2008-07-10 17:45 ` Jon Nelson
2008-07-10 18:03 ` Andre Noll
2008-07-11 4:51 ` Neil Brown
2 siblings, 0 replies; 7+ messages in thread
From: Jon Nelson @ 2008-07-10 17:45 UTC (permalink / raw)
To: Keld Jørn Simonsen; +Cc: linux-raid
On Thu, Jul 10, 2008 at 11:54 AM, Keld Jørn Simonsen <keld@dkuug.dk> wrote:
> I would like to know what is going on wrt resyncing, how it is done.
> This is because I have some ideas to speed up the process.
> I have noted for a 4 drive raid10,f2 array that only about 25 % of the
> IO speed is used during the rebuid, I would like to have something like
> 90 % as a goal.
>
> This is especially for raid10,f2, where I think I can make it much
> better, but posssibly also for other raid types, as input to an
> explanation on the wiki of what is really going on.
Well, my guess (and from a quick scan of the source) is that basically
raid10 resync by starting at the first logical block or stripe and
proceeds through the last. This means that there is quite a bit of
seeking going on. IMO, the loop should look more like this:
for component in all_components:
for stripe_or_block on this component:
read_or_check_or_whatever(stripe_or_block)
This way, assuming no interference from other sources, each component
does the minimal seeking.
--
Jon
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen
2008-07-10 17:45 ` Jon Nelson
@ 2008-07-10 18:03 ` Andre Noll
2008-07-11 4:51 ` Neil Brown
2 siblings, 0 replies; 7+ messages in thread
From: Andre Noll @ 2008-07-10 18:03 UTC (permalink / raw)
To: Keld Jørn Simonsen; +Cc: linux-raid
[-- Attachment #1: Type: text/plain, Size: 637 bytes --]
On 18:54, Keld Jørn Simonsen wrote:
> And what happens if a discrepancy is found? Which of the 2 copies are the
> good one?
Nobody can tell. IIRC, the md code picks the one with the smallest drive
id.
> For raid5 and raid6 I could imagine that the parity blocks were cheked.
For raid5 you still can't tell which of the copies (if any) is
good. For raid6 you can tell under the assumption that exactly one
drive has bad data. However, the raid6 code currently does not try
to fix up errors.
Have a look at the archives for more..
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe
[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen
2008-07-10 17:45 ` Jon Nelson
2008-07-10 18:03 ` Andre Noll
@ 2008-07-11 4:51 ` Neil Brown
2008-07-11 22:29 ` Keld Jørn Simonsen
2 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2008-07-11 4:51 UTC (permalink / raw)
To: Keld Jørn Simonsen; +Cc: linux-raid
On Thursday July 10, keld@dkuug.dk wrote:
> I would like to know what is going on wrt resyncing, how it is done.
> This is because I have some ideas to speed up the process.
> I have noted for a 4 drive raid10,f2 array that only about 25 % of the
> IO speed is used during the rebuid, I would like to have something like
> 90 % as a goal.
"resync" and "recovery" are handled very differently in raid10.
"check" and "repair" are special cases of "resync".
"recovery" walks addresses from the start to the end of the component
drives.
At each address, it considers each drive which is being recovered and
finds a place on a different device to read the block for the current
(drive,address) from. It schedules a read and when the read request
completes it schedules the write.
On an f2 layout, this will read one drive from halfway to the end,
then from the start to halfway, and will write the other drive
sequentially.
"resync" walks the addresses from the start to end of the array.
At each address it reads every device block which stores that
array block. When all the reads complete the results are compared.
If they are not all the same, the "first" block is written out
to the others. (I think I might have told you before that it reads
one block and writes the others. I checked the code and what is
wrong).
Here "first" means (I think) the block with the earliest device
address, and if there are several of those, the block with the least
device index.
So for f2, this will read from both the start and the middle of
both devices. It will read 64K at a time, so you should get at least
a 32K read at each position before a seek (more with a larger chunk
size).
Clearly this won't be fast.
The reason this algorithm was chosen was that it makes sense for every
possible raid10 layout, even though it might not be optimal for some
of them.
>
> This is especially for raid10,f2, where I think I can make it much
> better, but posssibly also for other raid types, as input to an
> explanation on the wiki of what is really going on.
Were I to try to make it fast for f2, I would probably shuffle the
bits in each request so that it did all the 'odd' chunks first, then
all the even chunks.
e.g. map
0 1 2 3 4 5 6 7 8 ...
to
0 1 4 5 8 9 ..... 2 3 6 7 10 11 ....
(assuming a chunk size of '2').
The problem with this is that if you shutdown while part way though a
resync, and then boot into a kernel which used a different
sequence, it would finish the resync checking the wrong blocks.
This is annoying but should not be insurmountable.
This way we leave the basic algorithm the same, but introduce
variations in the sequence for different specific layouts.
>
> Are there references on the net? I tried to look but did not really find
> something.
Just the source, sorry.
>
> I don't really understand why resync is going on for raid10,f2.
> But maybe it checks all of the array, and checks that the two copies are
> identical. Is that so? I got some communication with Neil that some
> writing is involved in the resync, I don't understand why.
raid1 does resync simply by reading one device and writing all the
others, and this is conceptually easiest. I had mistakenly thought
that I had used the same approach in raid10.
>
> And what happens if a discrepancy is found? Which of the 2 copies are the
> good one? Maybe one could look if there are any CRC errors, or disk read
> retries going on. I could understand if it was a raid10,f3 - then if one
> was different from the 2 other copies - you could correct the odd copy.
There is no "good" block - if they are different, then all are wrong.
md/raid just tries to return a consistent value, and leave it up to
the filesystem to find and correct any errors.
>
> For raid5 and raid6 I could imagine that the parity blocks were cheked.
If any inconsistency is found during a resync of raid4/5/6 the parity
blocks are changed to remove the inconsistency. This may not be
"right", but it is least likely to be "wrong".
>
> I could of cause read the code, but I would like an overview before
> dwelving into that part.
Sensible :-)
Enjoy your reading.
NeilBrown
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-11 4:51 ` Neil Brown
@ 2008-07-11 22:29 ` Keld Jørn Simonsen
2008-07-12 10:44 ` Neil Brown
0 siblings, 1 reply; 7+ messages in thread
From: Keld Jørn Simonsen @ 2008-07-11 22:29 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
I took the information here and made something for the wiki.
Comments welcome. /keld
= recovery and resync =
The following is a recollection of what Neil Brown and others have
written
on the linux-raid mailing list.
"resync" and "recovery" are handled very differently in raid10.
"check" and "repair" are special cases of "resync".
== recovery ==
The purpose of the recovery process is to fill a new disk with the
relevant
information from a running array.
The assumption is that all data do the new disk needs to be written.
"recovery" walks addresses from the start to the end of the component
drives.
At each address, it considers each drive which is being recovered and
finds a place on a different device to read the block for the current
(drive,address) from. It schedules a read and when the read request
completes it schedules the write.
On an f2 layout, this will read one drive from halfway to the end,
then from the start to halfway, and will write the other drive
sequentially.
== resync ==
The purpose of resync is to ensure that all data on the array is
syncronized.
There is an assumption that most, if not all, of the data is allready
OK.
"resync" walks the addresses from the start to end of the array.
At each address it reads every device block which stores that
array block. When all the reads complete the results are compared.
If they are not all the same, the "first" block is written out
to the others.
Here "first" means (I think) the block with the earliest device
address, and if there are several of those, the block with the least
device index.
So for f2, this will read from both the start and the middle of
both devices. It will read 64K (the chunk size) at a time, so you
should get at least
a 32K read at each position before a seek (more with a larger chunk
size).
Clearly this won't be fast.
The reason this algorithm was chosen was that it makes sense for every
possible raid10 layout, even though it might not be optimal for some
of them.
Were I to try to make it fast for f2, I would probably shuffle the
bits in each request so that it did all the 'odd' chunks first, then
all the even chunks.
e.g. map
0 1 2 3 4 5 6 7 8 ...
to
0 1 4 5 8 9 ..... 2 3 6 7 10 11 ....
(assuming a chunk size of '2').
The problem with this is that if you shutdown while part way though a
resync, and then boot into a kernel which used a different
sequence, it would finish the resync checking the wrong blocks.
This is annoying but should not be insurmountable.
This way we leave the basic algorithm the same, but introduce
variations in the sequence for different specific layouts.
Another idea would be to read a number of chunks from one part of the f2
mirror,
say 10 MB, and then read then corresponding 10 MB from the other half of
the f2
array. This would on current disk technology (80 MB/s) mean 125 ms spent
reading,
and then 8 ms spent moving heads.
raid1 does resync simply by reading one device and writing all the
others, and this is conceptually easiest.
When repairing, there is no "good" block - if they are different, then
all are wrong.
md/raid just tries to return a consistent value, and leave it up to
the filesystem to find and correct any errors.
md/raid does not try to take advantage of information on failed CRC on
disk
hardware, should that info be available to the kernel.
If any inconsistency is found during a resync of raid4/5/6 the parity
blocks are changed to remove the inconsistency. This may not be
"right", but it is least likely to be "wrong".
On Fri, Jul 11, 2008 at 02:51:33PM +1000, Neil Brown wrote:
> On Thursday July 10, keld@dkuug.dk wrote:
> > I would like to know what is going on wrt resyncing, how it is done.
> > This is because I have some ideas to speed up the process.
> > I have noted for a 4 drive raid10,f2 array that only about 25 % of the
> > IO speed is used during the rebuid, I would like to have something like
> > 90 % as a goal.
>
> "resync" and "recovery" are handled very differently in raid10.
> "check" and "repair" are special cases of "resync".
>
> "recovery" walks addresses from the start to the end of the component
> drives.
> At each address, it considers each drive which is being recovered and
> finds a place on a different device to read the block for the current
> (drive,address) from. It schedules a read and when the read request
> completes it schedules the write.
>
> On an f2 layout, this will read one drive from halfway to the end,
> then from the start to halfway, and will write the other drive
> sequentially.
>
> "resync" walks the addresses from the start to end of the array.
> At each address it reads every device block which stores that
> array block. When all the reads complete the results are compared.
> If they are not all the same, the "first" block is written out
> to the others. (I think I might have told you before that it reads
> one block and writes the others. I checked the code and what is
> wrong).
>
> Here "first" means (I think) the block with the earliest device
> address, and if there are several of those, the block with the least
> device index.
>
> So for f2, this will read from both the start and the middle of
> both devices. It will read 64K at a time, so you should get at least
> a 32K read at each position before a seek (more with a larger chunk
> size).
>
> Clearly this won't be fast.
>
> The reason this algorithm was chosen was that it makes sense for every
> possible raid10 layout, even though it might not be optimal for some
> of them.
>
> >
> > This is especially for raid10,f2, where I think I can make it much
> > better, but posssibly also for other raid types, as input to an
> > explanation on the wiki of what is really going on.
>
> Were I to try to make it fast for f2, I would probably shuffle the
> bits in each request so that it did all the 'odd' chunks first, then
> all the even chunks.
> e.g. map
> 0 1 2 3 4 5 6 7 8 ...
> to
> 0 1 4 5 8 9 ..... 2 3 6 7 10 11 ....
> (assuming a chunk size of '2').
>
> The problem with this is that if you shutdown while part way though a
> resync, and then boot into a kernel which used a different
> sequence, it would finish the resync checking the wrong blocks.
> This is annoying but should not be insurmountable.
>
> This way we leave the basic algorithm the same, but introduce
> variations in the sequence for different specific layouts.
>
> >
> > Are there references on the net? I tried to look but did not really find
> > something.
>
> Just the source, sorry.
>
> >
> > I don't really understand why resync is going on for raid10,f2.
> > But maybe it checks all of the array, and checks that the two copies are
> > identical. Is that so? I got some communication with Neil that some
> > writing is involved in the resync, I don't understand why.
>
> raid1 does resync simply by reading one device and writing all the
> others, and this is conceptually easiest. I had mistakenly thought
> that I had used the same approach in raid10.
>
> >
> > And what happens if a discrepancy is found? Which of the 2 copies are the
> > good one? Maybe one could look if there are any CRC errors, or disk read
> > retries going on. I could understand if it was a raid10,f3 - then if one
> > was different from the 2 other copies - you could correct the odd copy.
>
> There is no "good" block - if they are different, then all are wrong.
> md/raid just tries to return a consistent value, and leave it up to
> the filesystem to find and correct any errors.
>
> >
> > For raid5 and raid6 I could imagine that the parity blocks were cheked.
>
> If any inconsistency is found during a resync of raid4/5/6 the parity
> blocks are changed to remove the inconsistency. This may not be
> "right", but it is least likely to be "wrong".
>
> >
> > I could of cause read the code, but I would like an overview before
> > dwelving into that part.
>
> Sensible :-)
> Enjoy your reading.
>
> NeilBrown
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-11 22:29 ` Keld Jørn Simonsen
@ 2008-07-12 10:44 ` Neil Brown
2008-07-12 11:50 ` Keld Jørn Simonsen
0 siblings, 1 reply; 7+ messages in thread
From: Neil Brown @ 2008-07-12 10:44 UTC (permalink / raw)
To: Keld Jørn Simonsen; +Cc: linux-raid
On Saturday July 12, keld@dkuug.dk wrote:
> I took the information here and made something for the wiki.
> Comments welcome. /keld
Thanks for doing that. I'm not motivated to work on the wiki myself,
but I'm more than happy for anyone to take any secrets I divulge in
emails and publish them there.
>
> = recovery and resync =
>
> The following is a recollection of what Neil Brown and others have
> written
> on the linux-raid mailing list.
>
> "resync" and "recovery" are handled very differently in raid10.
> "check" and "repair" are special cases of "resync".
>
> == recovery ==
>
> The purpose of the recovery process is to fill a new disk with the
> relevant
> information from a running array.
>
> The assumption is that all data do the new disk needs to be written.
>
> "recovery" walks addresses from the start to the end of the component
> drives.
...
>
> == resync ==
...
>
> "resync" walks the addresses from the start to end of the array.
It's probably worth noting that this difference between recovery and
resync in specific to raid10.
recovery always follows the address space of component drives.
resync:
on raid10, follows the address space of the array
on raid4/5/6, follows the address space of component drives.
on raid1, the two address spaces are the same.
Another way to say it is that raid10-resync follows the address space
of the array, everything else follows the component drives.
NeilBrown
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: resync'ing - what is going on
2008-07-12 10:44 ` Neil Brown
@ 2008-07-12 11:50 ` Keld Jørn Simonsen
0 siblings, 0 replies; 7+ messages in thread
From: Keld Jørn Simonsen @ 2008-07-12 11:50 UTC (permalink / raw)
To: Neil Brown; +Cc: linux-raid
On Sat, Jul 12, 2008 at 08:44:48PM +1000, Neil Brown wrote:
> On Saturday July 12, keld@dkuug.dk wrote:
> > I took the information here and made something for the wiki.
> > Comments welcome. /keld
>
> Thanks for doing that. I'm not motivated to work on the wiki myself,
> but I'm more than happy for anyone to take any secrets I divulge in
> emails and publish them there.
>
> >
> > = recovery and resync =
> >
> > The following is a recollection of what Neil Brown and others have
> > written
> > on the linux-raid mailing list.
> >
> > "resync" and "recovery" are handled very differently in raid10.
> > "check" and "repair" are special cases of "resync".
> >
> > == recovery ==
> >
> > The purpose of the recovery process is to fill a new disk with the
> > relevant
> > information from a running array.
> >
> > The assumption is that all data do the new disk needs to be written.
> >
> > "recovery" walks addresses from the start to the end of the component
> > drives.
> ...
> >
> > == resync ==
> ...
> >
> > "resync" walks the addresses from the start to end of the array.
>
> It's probably worth noting that this difference between recovery and
> resync in specific to raid10.
>
> recovery always follows the address space of component drives.
>
> resync:
> on raid10, follows the address space of the array
> on raid4/5/6, follows the address space of component drives.
> on raid1, the two address spaces are the same.
>
> Another way to say it is that raid10-resync follows the address space
> of the array, everything else follows the component drives.
OK, I added that info:
For raid10 "resync" walks the addresses from the start to end of the
array. (For all other raid types "resync" follows the component drives).
I also added that for recovery, it is assumed
that the running drives are in sync. Is that true? I wrote it as:
The assumption is that all data on the new disk needs to be written, and
that the other data on the running array is correct.
I also wrote this to indicate the difference between a component drive
walk, and a full array walk:
"recovery" walks addresses from the start to the end of the component
drives. Thus only data for the specific component drive is adressed.
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2008-07-12 11:50 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-10 16:54 resync'ing - what is going on Keld Jørn Simonsen
2008-07-10 17:45 ` Jon Nelson
2008-07-10 18:03 ` Andre Noll
2008-07-11 4:51 ` Neil Brown
2008-07-11 22:29 ` Keld Jørn Simonsen
2008-07-12 10:44 ` Neil Brown
2008-07-12 11:50 ` Keld Jørn Simonsen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).