sb->resync_offset value after resync failure

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* sb->resync_offset value after resync failure
@ 2012-01-19 16:19 Alexander Lyakas
  2012-01-24 14:25 ` Alexander Lyakas
  2012-01-26  0:51 ` NeilBrown
  0 siblings, 2 replies; 9+ messages in thread
From: Alexander Lyakas @ 2012-01-19 16:19 UTC (permalink / raw)
  To: linux-raid

Greetings,
I am looking into a scenario, in which the md raid5/6 array is
resyncing (e.g., after a fresh creation) and there is a drive failure.
As written in Neil's blog entry "Closing the RAID5 write hole"
(http://neil.brown.name/blog/20110614101708): "if a device fails
during the resync, md doesn't take special action - it just allows the
array to be used without a resync even though there could be corrupt
data".

However, I noticed that at this point sb->resync_offset in the
superblock is not set to MaxSector. At this point if a drive is
added/re-added to the array, then drive recovery starts, i.e., md
assumes that data/parity on the surviving drives are correct, and uses
them to rebuild the new drive. This state of data/parity being correct
should be reflected as sb->resync_offset==MaxSector, shouldn't it?

One issue that I ran into is the following: I reached a situation in
which during array assembly: sb->resync_offset==sb->size. At this
point, the following code in mdadm
assumes that array is clean:
info->array.state =
    (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size))
     ? 1 : 0;
As a result, mdadm lets the array assembly flow through fine to the
kernel, but in the kernel the following code refuses to start the
array:
    if (mddev->degraded > dirty_parity_disks &&
        mddev->recovery_cp != MaxSector) {

At this point, speciying --force to mdadm --assembly doesn't help,
because mdadm thinks that array is clean (clean==1), and therefore
doesn't do the "force-array" update, which would knock off the
sb->resync_offset value. So there is no way to start the array, unless
specifying the start_dirty_degraded=1 kernel parameter.

So one question is: should mdadm compare sb->resync_offset to
MaxSector and not to sb->size? In the kernel code, resync_offset is
always compared to MaxSector.

Another question is: whether sb->resync_offset should be set to
MaxSector by the kernel as soon as it starts rebuilding a drive? I
think this would be consistent with what Neil wrote in the blog entry.

Here is the scenario to reproduce the issue I described:
# Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
# Fail drive D. Array aborts the resync and then immediately restarts
it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
that it restarts from that checkpoint)
# Re-add drive D to the array. It is added as a spare, array continues resyncing
# Fail drive C. Array aborts the resync, and then starts rebuilding
drive D. At this point sb->resync_offset is some valid value (usually
0, not MaxSectors and not sb->size).
# Stop the array. At this point sb->resync offset is sb->size in all
the superblocks.

Another question I have: when exactly md decides to update the
sb->resync_offset in the superblock? I am playing with similar
scenarios with raid5, and sometimes I end up with MaxSectors and
sometimes with valid values. From the code, it looks like only this
logic updates it:
	if (mddev->in_sync)
		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
	else
		sb->resync_offset = cpu_to_le64(0);
except for resizing and setting through sysfs. But I don't understand
how this value should be managed in general.

Thanks!
Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas
@ 2012-01-24 14:25 ` Alexander Lyakas
  2012-01-26  0:41   ` NeilBrown
  2012-01-26  0:51 ` NeilBrown
  1 sibling, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2012-01-24 14:25 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hello Neil,
I hope you can find some time to look at my doubts in the email below.
Meanwhile I realized I have more doubts about resync. Hopefully, you
will be able to give some information on those too...

# I am looking at "--force" parameter for assembly, and also for
"start_dirty_degraded" kernel parameters. They are actually very
different:
"force" marks the array as clean (sets sb->resync_offset=MaxSector).
While if start_dirty_degraded==1, kernel actually starts resyncing the
array. For RAID5, it starts and stops immediately (correct?) But for
RAID6 coming up with one missing drive, kernel will do the resync
using the remaining redundant drive.
So start_dirty_degraded==1 is "better" then just forgetting about
resync with "--force", isn't it? Because we will still have one parity
block correct.

Do you think the following logic is appropriate: always set
start_dirty_degraded=1 kernel parameter. In mdadm during assembly
detect dirty+degraded, and if "force" is not given - abort. If "force"
is given, don't knock off sb->resync_offset (like code does today),
assemble the array and let the kernel start resync (if there is still
a redundant drive).

# I saw an explanation on the list, that for RAID6 always a full
stripe is rewritten. Given this, I think I don't understand why the
initial resync of the array is needed. For those areas
never written to, the parity may remain incorrect, because reading
data from there is not expected to return anything meaningful. For
those areas written, the parity will be
recalculated while writing. So reading from those areas should have
correct parity in degraded mode. I must be missing something here for
sure, can you tell me what?

Thanks,
  Alex.


On Thu, Jan 19, 2012 at 6:19 PM, Alexander Lyakas
<alex.bolshoy@gmail.com> wrote:
> Greetings,
> I am looking into a scenario, in which the md raid5/6 array is
> resyncing (e.g., after a fresh creation) and there is a drive failure.
> As written in Neil's blog entry "Closing the RAID5 write hole"
> (http://neil.brown.name/blog/20110614101708): "if a device fails
> during the resync, md doesn't take special action - it just allows the
> array to be used without a resync even though there could be corrupt
> data".
>
> However, I noticed that at this point sb->resync_offset in the
> superblock is not set to MaxSector. At this point if a drive is
> added/re-added to the array, then drive recovery starts, i.e., md
> assumes that data/parity on the surviving drives are correct, and uses
> them to rebuild the new drive. This state of data/parity being correct
> should be reflected as sb->resync_offset==MaxSector, shouldn't it?
>
> One issue that I ran into is the following: I reached a situation in
> which during array assembly: sb->resync_offset==sb->size. At this
> point, the following code in mdadm
> assumes that array is clean:
> info->array.state =
>    (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size))
>      ? 1 : 0;
> As a result, mdadm lets the array assembly flow through fine to the
> kernel, but in the kernel the following code refuses to start the
> array:
>     if (mddev->degraded > dirty_parity_disks &&
>         mddev->recovery_cp != MaxSector) {
>
> At this point, speciying --force to mdadm --assembly doesn't help,
> because mdadm thinks that array is clean (clean==1), and therefore
> doesn't do the "force-array" update, which would knock off the
> sb->resync_offset value. So there is no way to start the array, unless
> specifying the start_dirty_degraded=1 kernel parameter.
>
> So one question is: should mdadm compare sb->resync_offset to
> MaxSector and not to sb->size? In the kernel code, resync_offset is
> always compared to MaxSector.
>
> Another question is: whether sb->resync_offset should be set to
> MaxSector by the kernel as soon as it starts rebuilding a drive? I
> think this would be consistent with what Neil wrote in the blog entry.
>
> Here is the scenario to reproduce the issue I described:
> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
> # Fail drive D. Array aborts the resync and then immediately restarts
> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
> that it restarts from that checkpoint)
> # Re-add drive D to the array. It is added as a spare, array continues resyncing
> # Fail drive C. Array aborts the resync, and then starts rebuilding
> drive D. At this point sb->resync_offset is some valid value (usually
> 0, not MaxSectors and not sb->size).
> # Stop the array. At this point sb->resync offset is sb->size in all
> the superblocks.
>
> Another question I have: when exactly md decides to update the
> sb->resync_offset in the superblock? I am playing with similar
> scenarios with raid5, and sometimes I end up with MaxSectors and
> sometimes with valid values. From the code, it looks like only this
> logic updates it:
>        if (mddev->in_sync)
>                sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>        else
>                sb->resync_offset = cpu_to_le64(0);
> except for resizing and setting through sysfs. But I don't understand
> how this value should be managed in general.
>
> Thanks!
> Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-01-24 14:25 ` Alexander Lyakas
@ 2012-01-26  0:41   ` NeilBrown
  0 siblings, 0 replies; 9+ messages in thread
From: NeilBrown @ 2012-01-26  0:41 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 3001 bytes --]

On Tue, 24 Jan 2012 16:25:04 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hello Neil,
> I hope you can find some time to look at my doubts in the email below.
> Meanwhile I realized I have more doubts about resync. Hopefully, you
> will be able to give some information on those too...
> 
> # I am looking at "--force" parameter for assembly, and also for
> "start_dirty_degraded" kernel parameters. They are actually very
> different:
> "force" marks the array as clean (sets sb->resync_offset=MaxSector).
> While if start_dirty_degraded==1, kernel actually starts resyncing the
> array. For RAID5, it starts and stops immediately (correct?) But for
> RAID6 coming up with one missing drive, kernel will do the resync
> using the remaining redundant drive.
> So start_dirty_degraded==1 is "better" then just forgetting about
> resync with "--force", isn't it? Because we will still have one parity
> block correct.

Correct.  Degraded RAID5  is either clean or faulty.  There is no sense in
which such an array can be simply "dirty" as there is no redundancy.

Singly-degraded RAID6 can be "dirty" as it still has one disk of redundancy 
which could be inconsistent with the rest.  So there is an extra state that
we need to handle which we currently do no.

> 
> Do you think the following logic is appropriate: always set
> start_dirty_degraded=1 kernel parameter. In mdadm during assembly
> detect dirty+degraded, and if "force" is not given - abort. If "force"
> is given, don't knock off sb->resync_offset (like code does today),
> assemble the array and let the kernel start resync (if there is still
> a redundant drive).

I would rather export mddev->ok_start_degraded via sysfs. Then mdadm can
respond to the --force flag on a dirty/degraded RAID6 by either setting the
flag if it exists, or marking the array 'clean' and starting a 'repair'.

> 
> # I saw an explanation on the list, that for RAID6 always a full
> stripe is rewritten. Given this, I think I don't understand why the
> initial resync of the array is needed. For those areas
> never written to, the parity may remain incorrect, because reading
> data from there is not expected to return anything meaningful. For
> those areas written, the parity will be
> recalculated while writing. So reading from those areas should have
> correct parity in degraded mode. I must be missing something here for
> sure, can you tell me what?

The initial sync isn't needed and you are free to specify --assume-clean if
you like.
However I don't guarantee that RAID6 will always perform reconstruct-write
(rather than read-modify-write).
Also, most people scrub their arrays periodically and if you haven't done an
initial sync, the first time you do a scrub you will likely get a high
mismatch count, which might be confusing.

So I resync by default because it is safest.  If you know what you are doing,
then explicitly disabling that is perfectly OK.

NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas
  2012-01-24 14:25 ` Alexander Lyakas
@ 2012-01-26  0:51 ` NeilBrown
  2012-02-01 20:56   ` Alexander Lyakas
  1 sibling, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-01-26  0:51 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 4349 bytes --]

On Thu, 19 Jan 2012 18:19:55 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Greetings,
> I am looking into a scenario, in which the md raid5/6 array is
> resyncing (e.g., after a fresh creation) and there is a drive failure.
> As written in Neil's blog entry "Closing the RAID5 write hole"
> (http://neil.brown.name/blog/20110614101708): "if a device fails
> during the resync, md doesn't take special action - it just allows the
> array to be used without a resync even though there could be corrupt
> data".
> 
> However, I noticed that at this point sb->resync_offset in the
> superblock is not set to MaxSector. At this point if a drive is
> added/re-added to the array, then drive recovery starts, i.e., md
> assumes that data/parity on the surviving drives are correct, and uses
> them to rebuild the new drive. This state of data/parity being correct
> should be reflected as sb->resync_offset==MaxSector, shouldn't it?
> 
> One issue that I ran into is the following: I reached a situation in
> which during array assembly: sb->resync_offset==sb->size. At this
> point, the following code in mdadm
> assumes that array is clean:
> info->array.state =
>     (__le64_to_cpu(sb->resync_offset) >= __le64_to_cpu(sb->size))
>      ? 1 : 0;
> As a result, mdadm lets the array assembly flow through fine to the
> kernel, but in the kernel the following code refuses to start the
> array:
>     if (mddev->degraded > dirty_parity_disks &&
>         mddev->recovery_cp != MaxSector) {
> 
> At this point, speciying --force to mdadm --assembly doesn't help,
> because mdadm thinks that array is clean (clean==1), and therefore
> doesn't do the "force-array" update, which would knock off the
> sb->resync_offset value. So there is no way to start the array, unless
> specifying the start_dirty_degraded=1 kernel parameter.
> 
> So one question is: should mdadm compare sb->resync_offset to
> MaxSector and not to sb->size? In the kernel code, resync_offset is
> always compared to MaxSector.

Yes, mdadm should be consistent with the kernel.  Patches welcome.

> 
> Another question is: whether sb->resync_offset should be set to
> MaxSector by the kernel as soon as it starts rebuilding a drive? I
> think this would be consistent with what Neil wrote in the blog entry.

Maybe every time we update ->curr_resync_completed we should update
->recovery_cp as well if it is below the new ->curre_resync_completed ??


> 
> Here is the scenario to reproduce the issue I described:
> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
> # Fail drive D. Array aborts the resync and then immediately restarts
> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
> that it restarts from that checkpoint)
> # Re-add drive D to the array. It is added as a spare, array continues resyncing
> # Fail drive C. Array aborts the resync, and then starts rebuilding
> drive D. At this point sb->resync_offset is some valid value (usually
> 0, not MaxSectors and not sb->size).

Does it start the rebuilding from the start?  I hope it does.

> # Stop the array. At this point sb->resync offset is sb->size in all
> the superblocks.

At some point in there you had a RAID6 with two missing devices, so it is
either failed or completely in-sync.  I guess we assume the latter.
Is that wrong?

> 
> Another question I have: when exactly md decides to update the
> sb->resync_offset in the superblock? I am playing with similar
> scenarios with raid5, and sometimes I end up with MaxSectors and
> sometimes with valid values. From the code, it looks like only this
> logic updates it:
> 	if (mddev->in_sync)
> 		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
> 	else
> 		sb->resync_offset = cpu_to_le64(0);
> except for resizing and setting through sysfs. But I don't understand
> how this value should be managed in general.

I'm not sure what you are asking here .... that code explains exactly when
resync_offset should be set, and how.
What more is there to say?

NeilBrown


> 
> Thanks!
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-01-26  0:51 ` NeilBrown
@ 2012-02-01 20:56   ` Alexander Lyakas
  2012-02-01 22:19     ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2012-02-01 20:56 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,
based on your hints I dug some more into resync failure cases, and
handling of sb->resync_offset and mddev->recovery_cp. Here are aome
observations:

The only cases, in which recovery_cp is set (except setting via sysfs,
setting bitmap bits via sysfs etc.) are:
# on array creation, recovery_cp is set to 0 (and bitmap is fully set)
# recovery_cp becomes MaxSector only if MD_RECOVERY_SYNC (with/without
REQUESTED) completes successfully. On raid6 this may happen when a
singly-degraded array completes resync.
# if resync does not complete successfully (MD_RECOVERY_INTR or
crash), then recovery_cp remains valid (not MaxSector).

# The only real influence that recovery_cp seems to have is:
1) abort the assembly if ok_start_degraded is not set
2) when loading the superblock, it looks like recovery_cp may cause
the beginning of the bitmap not being loaded. I did not dig further
into bitmap at this point.
3) resume the resync in md_check_recovery() if recovery_cp is valid.

Are these observations valid?

With this scheme I saw several interesting issues:
# After resync is aborted/interrupted, recovery_cp is updated (either
to MaxSector or another value). However, the superblock is not updated
at this point. If there is no additional activity on the array, the
superblock will not be updated. I saw cases when resync completes
fine, recovery_cp is set to MaxSector, but not persisted in the
superblock. If I crash the machine at this point, then after reboot,
array is still considered as dirty (has valid resync_offset in
superblock). Is there an option to force the superblock update at this
point?

# When resync aborts due to drive failure, then MD_RECOVERY_INTR is
set, and sync_request() returns mddev->dev_sectors - sector_nr. As a
result, recovery_cp is set to device size, and that's what I see in my
raid6 scenario. At this point three things can happen:
1) If there is additional drive to resync (like in raid6), then resync
restarts (from sector 0)
2) If there is a spare, it starts rebuilding the spare, and, as a
result, persists sb->resync_offset==sb->size in the superblock
3) Otherwise, it restarts the resync (due to valid recovery_cp),
immediately finishes it, and sets recovery_cp=MaxSector (but not
necessarily writes the superblock right away).
So basically, we can have a rebuilding drive and
sb->resync_offset==sb->size in the superblock. It will be cleared only
after rebuild completes and this "empty-resync" happens.
Are you (we?) comfortable with this behavior? (Due to mdadm issue I
mentioned earlier, in this case mdadm thinks that array is clean,
while kernel thinks the array is dirty, while it's actually
rebuilding).

Now I will answer to your suggestions:

>> So one question is: should mdadm compare sb->resync_offset to
>> MaxSector and not to sb->size? In the kernel code, resync_offset is
>> always compared to MaxSector.
>
> Yes, mdadm should be consistent with the kernel.  Patches welcome.

I think at this point it's better to consider the array as dirty in
mdadm, and either let the kernel set MaxSectors in this "empty-resync"
or set it via mdadm+force. Let's see if I can submit this trivial
patch.

>> Another question is: whether sb->resync_offset should be set to
>> MaxSector by the kernel as soon as it starts rebuilding a drive? I
>> think this would be consistent with what Neil wrote in the blog entry.
>
> Maybe every time we update ->curr_resync_completed we should update
> ->recovery_cp as well if it is below the new ->curre_resync_completed ??

I'm not sure it will help a lot. Except for not loading part of the
bitmap (which I am unsure about), the real value if recovery_cp does
not really matter. Is this correct?

>
>
>>
>> Here is the scenario to reproduce the issue I described:
>> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
>> # Fail drive D. Array aborts the resync and then immediately restarts
>> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
>> that it restarts from that checkpoint)
>> # Re-add drive D to the array. It is added as a spare, array continues resyncing
>> # Fail drive C. Array aborts the resync, and then starts rebuilding
>> drive D. At this point sb->resync_offset is some valid value (usually
>> 0, not MaxSectors and not sb->size).
>
> Does it start the rebuilding from the start?  I hope it does.

Yes, it starts from j==0 (and then it goes according to bitmap I guess
and hope).

>
>> # Stop the array. At this point sb->resync offset is sb->size in all
>> the superblocks.
>
> At some point in there you had a RAID6 with two missing devices, so it is
> either failed or completely in-sync.  I guess we assume the latter.
> Is that wrong?

It's not that it's wrong, but if we assume the array is clean, then
resync_offset should be MaxSectors in the superblock. But it's not, as
I showed earlier.

About exposing ok_start_degraded via sysfs: Is it possible to push
this value through RUN_ARRAY ioctl? I saw that it can carry
mdu_param_t, which has max_fault/*unused for now*/ field? Perhaps this
filed can be used?

I have another question about md_check_recovery() this code:
			if (mddev->safemode &&
			    !atomic_read(&mddev->writes_pending) &&
			    !mddev->in_sync &&
			    mddev->recovery_cp == MaxSector) {
				mddev->in_sync = 1;
Why mddev->recovery_cp == MaxSector is checked here?  This basically
means that if we have a valid recovery_cp, then in_sync will never be
set to 1, and this code:
	if (mddev->in_sync)
		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
	else
		sb->resync_offset = cpu_to_le64(0);
should always set resync_offset to 0. I saw that "in_sync" basically
tells whether there are pending/in-flight writes.

Finally, what would you recommend to do if a raid5 resync fails? At
this point kernel doesn't fail the array (as you pointed in your
blog), but it machine reboots, array cannot be re-assembled without
"force", because it's dirty+has a failed drive. So it's kind
of...inconsistent?

Thanks for your help,
  Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-02-01 20:56   ` Alexander Lyakas
@ 2012-02-01 22:19     ` NeilBrown
  2012-02-06 11:41       ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-02-01 22:19 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8756 bytes --]

On Wed, 1 Feb 2012 22:56:56 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> based on your hints I dug some more into resync failure cases, and
> handling of sb->resync_offset and mddev->recovery_cp. Here are aome
> observations:
> 
> The only cases, in which recovery_cp is set (except setting via sysfs,
> setting bitmap bits via sysfs etc.) are:
> # on array creation, recovery_cp is set to 0 (and bitmap is fully set)
> # recovery_cp becomes MaxSector only if MD_RECOVERY_SYNC (with/without
> REQUESTED) completes successfully. On raid6 this may happen when a
> singly-degraded array completes resync.
> # if resync does not complete successfully (MD_RECOVERY_INTR or
> crash), then recovery_cp remains valid (not MaxSector).
> 
> # The only real influence that recovery_cp seems to have is:
> 1) abort the assembly if ok_start_degraded is not set
> 2) when loading the superblock, it looks like recovery_cp may cause
> the beginning of the bitmap not being loaded. I did not dig further
> into bitmap at this point.
> 3) resume the resync in md_check_recovery() if recovery_cp is valid.
> 
> Are these observations valid?

Looks about right.  'recovery_cp' reflects the 'dirty' flag in the superblock.
'0' means the whole array is dirty.
'MaxSectors' means none of the array is dirty.
other numbers mean that part of the array is dirty, part isn't.

(I should really rename it is 'resync_cp' - 'cp' for 'checkpoint' of course).


> 
> With this scheme I saw several interesting issues:
> # After resync is aborted/interrupted, recovery_cp is updated (either
> to MaxSector or another value). However, the superblock is not updated
> at this point. If there is no additional activity on the array, the
> superblock will not be updated. I saw cases when resync completes
> fine, recovery_cp is set to MaxSector, but not persisted in the
> superblock. If I crash the machine at this point, then after reboot,
> array is still considered as dirty (has valid resync_offset in
> superblock). Is there an option to force the superblock update at this
> point?

My we just need to move the 'skip:' label to before:

	set_bit(MD_CHANGE_DEVS, &mddev->flags);

it is that flag which causes the superblock to be updated.

> 
> # When resync aborts due to drive failure, then MD_RECOVERY_INTR is
> set, and sync_request() returns mddev->dev_sectors - sector_nr. As a
> result, recovery_cp is set to device size, and that's what I see in my
> raid6 scenario. At this point three things can happen:
> 1) If there is additional drive to resync (like in raid6), then resync
> restarts (from sector 0)

It should really just keep resyncing from where it is up to, shouldn't it?

> 2) If there is a spare, it starts rebuilding the spare, and, as a
> result, persists sb->resync_offset==sb->size in the superblock

Hmm.. that's wrong too.  It make some sense for RAID5 as if there is a spare
recovering we must assume the rest of the array is in-sync.  But not for
RAID6.

> 3) Otherwise, it restarts the resync (due to valid recovery_cp),
> immediately finishes it, and sets recovery_cp=MaxSector (but not
> necessarily writes the superblock right away).

This bit seems right - assuming you are referring to the 'raid5, no spare'
case.

> So basically, we can have a rebuilding drive and
> sb->resync_offset==sb->size in the superblock. It will be cleared only
> after rebuild completes and this "empty-resync" happens.
> Are you (we?) comfortable with this behavior? (Due to mdadm issue I
> mentioned earlier, in this case mdadm thinks that array is clean,
> while kernel thinks the array is dirty, while it's actually
> rebuilding).

No, not really.  There do seem to be some bugs there.

> 
> Now I will answer to your suggestions:
> 
> >> So one question is: should mdadm compare sb->resync_offset to
> >> MaxSector and not to sb->size? In the kernel code, resync_offset is
> >> always compared to MaxSector.
> >
> > Yes, mdadm should be consistent with the kernel.  Patches welcome.
> 
> I think at this point it's better to consider the array as dirty in
> mdadm, and either let the kernel set MaxSectors in this "empty-resync"
> or set it via mdadm+force. Let's see if I can submit this trivial
> patch.

Sounds reasonable.

> 
> >> Another question is: whether sb->resync_offset should be set to
> >> MaxSector by the kernel as soon as it starts rebuilding a drive? I
> >> think this would be consistent with what Neil wrote in the blog entry.
> >
> > Maybe every time we update ->curr_resync_completed we should update
> > ->recovery_cp as well if it is below the new ->curre_resync_completed ??
> 
> I'm not sure it will help a lot. Except for not loading part of the
> bitmap (which I am unsure about), the real value if recovery_cp does
> not really matter. Is this correct?

It is stored in the metadata so if you stop the array in the middle of a
resync, then start the array again, the resync starts from wherever it was up
to.  So having a correct value certainly is useful.
(Having a value that is less than the correct value is no a big problem,
have a value that is too big would be bad).

> 
> >
> >
> >>
> >> Here is the scenario to reproduce the issue I described:
> >> # Create a raid6 array with 4 drives A,B,C,D. Array starts resyncing.
> >> # Fail drive D. Array aborts the resync and then immediately restarts
> >> it (it seems to checkpoint the mddev->recovery_cp, but I am not sure
> >> that it restarts from that checkpoint)
> >> # Re-add drive D to the array. It is added as a spare, array continues resyncing
> >> # Fail drive C. Array aborts the resync, and then starts rebuilding
> >> drive D. At this point sb->resync_offset is some valid value (usually
> >> 0, not MaxSectors and not sb->size).
> >
> > Does it start the rebuilding from the start?  I hope it does.
> 
> Yes, it starts from j==0 (and then it goes according to bitmap I guess
> and hope).
> 
> >
> >> # Stop the array. At this point sb->resync offset is sb->size in all
> >> the superblocks.
> >
> > At some point in there you had a RAID6 with two missing devices, so it is
> > either failed or completely in-sync.  I guess we assume the latter.
> > Is that wrong?
> 
> It's not that it's wrong, but if we assume the array is clean, then
> resync_offset should be MaxSectors in the superblock. But it's not, as
> I showed earlier.

Fair enough - it would be better if it were MaxSectors in this case.

> 
> About exposing ok_start_degraded via sysfs: Is it possible to push
> this value through RUN_ARRAY ioctl? I saw that it can carry
> mdu_param_t, which has max_fault/*unused for now*/ field? Perhaps this
> filed can be used?

I'm not keen on adding more functionality on to the ioctls.
I'd much rather see something via sysfs... Not sure what is best though, I'd
have to have a good think about it.

> 
> I have another question about md_check_recovery() this code:
> 			if (mddev->safemode &&
> 			    !atomic_read(&mddev->writes_pending) &&
> 			    !mddev->in_sync &&
> 			    mddev->recovery_cp == MaxSector) {
> 				mddev->in_sync = 1;
> Why mddev->recovery_cp == MaxSector is checked here?  This basically
> means that if we have a valid recovery_cp, then in_sync will never be
> set to 1, and this code:
> 	if (mddev->in_sync)
> 		sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
> 	else
> 		sb->resync_offset = cpu_to_le64(0);
> should always set resync_offset to 0. I saw that "in_sync" basically
> tells whether there are pending/in-flight writes.

If recovery_cp != MaxSector, then the array is not considered to be 'clean'
so we don't want to mark it as 'clean'.
On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes.
Maybe we could do it at other times if the array is idle - not sure.


> 
> Finally, what would you recommend to do if a raid5 resync fails? At
> this point kernel doesn't fail the array (as you pointed in your
> blog), but it machine reboots, array cannot be re-assembled without
> "force", because it's dirty+has a failed drive. So it's kind
> of...inconsistent?

I don't have a good answer.  The logical thing to do is to cause the
whole array to fail - but I think that would be unpopular.

I'm happy to listen to well-reasoned suggestions, but I don't have any good
suggestion other than the current situation.

Thanks,

NeilBrown



> 
> Thanks for your help,
>   Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-02-01 22:19     ` NeilBrown
@ 2012-02-06 11:41       ` Alexander Lyakas
  2012-02-07  0:55         ` NeilBrown
  0 siblings, 1 reply; 9+ messages in thread
From: Alexander Lyakas @ 2012-02-06 11:41 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

>> # After resync is aborted/interrupted, recovery_cp is updated (either
>> to MaxSector or another value). However, the superblock is not updated
>> at this point. If there is no additional activity on the array, the
>> superblock will not be updated. I saw cases when resync completes
>> fine, recovery_cp is set to MaxSector, but not persisted in the
>> superblock. If I crash the machine at this point, then after reboot,
>> array is still considered as dirty (has valid resync_offset in
>> superblock). Is there an option to force the superblock update at this
>> point?
>
> My we just need to move the 'skip:' label to before:
>
>        set_bit(MD_CHANGE_DEVS, &mddev->flags);
>
> it is that flag which causes the superblock to be updated.

Thanks for the hint, will try it. For production though I'm a bit
afraid to recompile my own kernel at this point.

>
>>
>> # When resync aborts due to drive failure, then MD_RECOVERY_INTR is
>> set, and sync_request() returns mddev->dev_sectors - sector_nr. As a
>> result, recovery_cp is set to device size, and that's what I see in my
>> raid6 scenario. At this point three things can happen:
>> 1) If there is additional drive to resync (like in raid6), then resync
>> restarts (from sector 0)
>
> It should really just keep resyncing from where it is up to, shouldn't it?
According to the code:
		else if (!mddev->bitmap)
			j = mddev->recovery_cp;
if the array has a bitmap, then it starts from j==0.

>
>> 2) If there is a spare, it starts rebuilding the spare, and, as a
>> result, persists sb->resync_offset==sb->size in the superblock
>
> Hmm.. that's wrong too.  It make some sense for RAID5 as if there is a spare
> recovering we must assume the rest of the array is in-sync.  But not for
> RAID6.
>
>> 3) Otherwise, it restarts the resync (due to valid recovery_cp),
>> immediately finishes it, and sets recovery_cp=MaxSector (but not
>> necessarily writes the superblock right away).
>
> This bit seems right - assuming you are referring to the 'raid5, no spare'
> case.
Yes, and also to raid6 with two drives failed during resync.

>> > Maybe every time we update ->curr_resync_completed we should update
>> > ->recovery_cp as well if it is below the new ->curre_resync_completed ??
>>
>> I'm not sure it will help a lot. Except for not loading part of the
>> bitmap (which I am unsure about), the real value if recovery_cp does
>> not really matter. Is this correct?
>
> It is stored in the metadata so if you stop the array in the middle of a
> resync, then start the array again, the resync starts from wherever it was up
> to.  So having a correct value certainly is useful.
> (Having a value that is less than the correct value is no a big problem,
> have a value that is too big would be bad).

Neil, as I pointed out earlier, it looks like resync always starts
from j==0, if there is a bitmap. So there is no real importance for
the exact value of recovery_cp (I think). Only whether it is MaxSector
or not. Am I missing something?

>
>>
>> I have another question about md_check_recovery() this code:
>>                       if (mddev->safemode &&
>>                           !atomic_read(&mddev->writes_pending) &&
>>                           !mddev->in_sync &&
>>                           mddev->recovery_cp == MaxSector) {
>>                               mddev->in_sync = 1;
>> Why mddev->recovery_cp == MaxSector is checked here?  This basically
>> means that if we have a valid recovery_cp, then in_sync will never be
>> set to 1, and this code:
>>       if (mddev->in_sync)
>>               sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
>>       else
>>               sb->resync_offset = cpu_to_le64(0);
>> should always set resync_offset to 0. I saw that "in_sync" basically
>> tells whether there are pending/in-flight writes.
>
> If recovery_cp != MaxSector, then the array is not considered to be 'clean'
> so we don't want to mark it as 'clean'.
> On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes.
> Maybe we could do it at other times if the array is idle - not sure.

It looks like I am missing something. If 'in_sync'==0, then
resync_offset will always be 0 in the superblock. If 'in_sync'==1,
then recovery_cp will be in the superblock. But the code in
md_check_recovery() only sets in_sync=1 if there is no recovery_cp. If
there is a valid recovery_cp, then in_sync will not be set to 1 and
the superblock will always contain 0. That's why I am wondering why we
are checking recovery_cp here? To end up having a valid recovery_cp in
the superblock...we need recovery_cp==MaxSectors. Strange?

Can you please answer two more questions?

1) When testing the sb->resync offset patch with the latest version of
mdadm, some of my unit tests failed due to failure in "--add".
Debugging with strace reveals following:
write_bitmap1() calls awrite() with 4096b buffer that was allocated on
the stack in this function, and is not aligned. However, the fd in use
was opened with O_RDWR|O_EXCL|O_DIRECT. As a result, write() fails
with errno==EINVAL. This appears to be regression at least vs
mdadm-3.1.5, in which aligned buffer was used.
Portions of strace:
open("/dev/sdj", O_RDWR|O_EXCL|O_DIRECT) = 4
...
ioctl(4, BLKSSZGET, 0x7fffa70b98d4)     = 0
write(4, "bitm\4\0\0\0G*8\34m\245\261\253\34\341\312\3\214S\203\247\200\0\0\0\0\0\0\0"...,
512) = -1 EINVAL (Invalid argument)
The buffer pointer at this point is 0x7fffa70b9930 (clearly not aligned to 512).
Do you want me to perhaps prepare a patch, which simply __aligns__ the
buffer on the stack?

2) Why in md_check_recovery(), rebuilding a spare is preferrable over
resyncing the parity by this code:
		} else if ((spares = remove_and_add_spares(mddev))) {
			clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
			clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
			clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
			set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
		} else if (mddev->recovery_cp < MaxSector) {
			set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);

This leads to situation, in which array "remembers" that it has to
complete resync (in sb->resync_offset), but after the spare completes
rebuilding, the bitmap is cleared, and subsequent resync starts and
immediately completes. Was that decision to prefer rebuild over resync
intentional?

Thanks for your help,
  Alex.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-02-06 11:41       ` Alexander Lyakas
@ 2012-02-07  0:55         ` NeilBrown
  2012-02-07  9:11           ` Alexander Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: NeilBrown @ 2012-02-07  0:55 UTC (permalink / raw)
  To: Alexander Lyakas; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 8319 bytes --]

On Mon, 6 Feb 2012 13:41:57 +0200 Alexander Lyakas <alex.bolshoy@gmail.com>
wrote:

> Hi Neil,
> 
> >> # After resync is aborted/interrupted, recovery_cp is updated (either
> >> to MaxSector or another value). However, the superblock is not updated
> >> at this point. If there is no additional activity on the array, the
> >> superblock will not be updated. I saw cases when resync completes
> >> fine, recovery_cp is set to MaxSector, but not persisted in the
> >> superblock. If I crash the machine at this point, then after reboot,
> >> array is still considered as dirty (has valid resync_offset in
> >> superblock). Is there an option to force the superblock update at this
> >> point?
> >
> > My we just need to move the 'skip:' label to before:
> >
> >        set_bit(MD_CHANGE_DEVS, &mddev->flags);
> >
> > it is that flag which causes the superblock to be updated.
> 
> Thanks for the hint, will try it. For production though I'm a bit
> afraid to recompile my own kernel at this point.
> 
> >
> >>
> >> # When resync aborts due to drive failure, then MD_RECOVERY_INTR is
> >> set, and sync_request() returns mddev->dev_sectors - sector_nr. As a
> >> result, recovery_cp is set to device size, and that's what I see in my
> >> raid6 scenario. At this point three things can happen:
> >> 1) If there is additional drive to resync (like in raid6), then resync
> >> restarts (from sector 0)
> >
> > It should really just keep resyncing from where it is up to, shouldn't it?
> According to the code:
> 		else if (!mddev->bitmap)
> 			j = mddev->recovery_cp;
> if the array has a bitmap, then it starts from j==0.

Good point.  When there is a bitmap the recovery_cp number if irrelevant as
the bitmap stores equivalent information with might finer detail.

> 
> >
> >> 2) If there is a spare, it starts rebuilding the spare, and, as a
> >> result, persists sb->resync_offset==sb->size in the superblock

I just looked into this and tested it and I don't agree.
->resync_offset does not get set to sb->size.  It gets set to the value that
curr_resync got up to which is correct ..  or fairly close, it should really
be curr_resync_completed.

So I don't think there is a big problem there.

> >
> > Hmm.. that's wrong too.  It make some sense for RAID5 as if there is a spare
> > recovering we must assume the rest of the array is in-sync.  But not for
> > RAID6.
> >
> >> 3) Otherwise, it restarts the resync (due to valid recovery_cp),
> >> immediately finishes it, and sets recovery_cp=MaxSector (but not
> >> necessarily writes the superblock right away).
> >
> > This bit seems right - assuming you are referring to the 'raid5, no spare'
> > case.
> Yes, and also to raid6 with two drives failed during resync.

Yes.

> 
> >> > Maybe every time we update ->curr_resync_completed we should update
> >> > ->recovery_cp as well if it is below the new ->curre_resync_completed ??
> >>
> >> I'm not sure it will help a lot. Except for not loading part of the
> >> bitmap (which I am unsure about), the real value if recovery_cp does
> >> not really matter. Is this correct?
> >
> > It is stored in the metadata so if you stop the array in the middle of a
> > resync, then start the array again, the resync starts from wherever it was up
> > to.  So having a correct value certainly is useful.
> > (Having a value that is less than the correct value is no a big problem,
> > have a value that is too big would be bad).
> 
> Neil, as I pointed out earlier, it looks like resync always starts
> from j==0, if there is a bitmap. So there is no real importance for
> the exact value of recovery_cp (I think). Only whether it is MaxSector
> or not. Am I missing something?

Yes, when you have a bitmap recovery_cp is much less important.

> 
> >
> >>
> >> I have another question about md_check_recovery() this code:
> >>                       if (mddev->safemode &&
> >>                           !atomic_read(&mddev->writes_pending) &&
> >>                           !mddev->in_sync &&
> >>                           mddev->recovery_cp == MaxSector) {
> >>                               mddev->in_sync = 1;
> >> Why mddev->recovery_cp == MaxSector is checked here?  This basically
> >> means that if we have a valid recovery_cp, then in_sync will never be
> >> set to 1, and this code:
> >>       if (mddev->in_sync)
> >>               sb->resync_offset = cpu_to_le64(mddev->recovery_cp);
> >>       else
> >>               sb->resync_offset = cpu_to_le64(0);
> >> should always set resync_offset to 0. I saw that "in_sync" basically
> >> tells whether there are pending/in-flight writes.
> >
> > If recovery_cp != MaxSector, then the array is not considered to be 'clean'
> > so we don't want to mark it as 'clean'.
> > On a graceful shutdown we do get in_sync to 1 - in __md_stop_writes.
> > Maybe we could do it at other times if the array is idle - not sure.
> 
> It looks like I am missing something. If 'in_sync'==0, then
> resync_offset will always be 0 in the superblock. If 'in_sync'==1,
> then recovery_cp will be in the superblock. But the code in
> md_check_recovery() only sets in_sync=1 if there is no recovery_cp. If
> there is a valid recovery_cp, then in_sync will not be set to 1 and
> the superblock will always contain 0. That's why I am wondering why we
> are checking recovery_cp here? To end up having a valid recovery_cp in
> the superblock...we need recovery_cp==MaxSectors. Strange?

->resync_offset is only really meaningful and useful after a clean shutdown
of the array.  The clean-shutdown code path uses a different rule for 
setting in_sync to 1.  In that case you do get a useful value written to
->resync_offset.


> 
> Can you please answer two more questions?
> 
> 1) When testing the sb->resync offset patch with the latest version of
> mdadm, some of my unit tests failed due to failure in "--add".
> Debugging with strace reveals following:
> write_bitmap1() calls awrite() with 4096b buffer that was allocated on
> the stack in this function, and is not aligned. However, the fd in use
> was opened with O_RDWR|O_EXCL|O_DIRECT. As a result, write() fails
> with errno==EINVAL. This appears to be regression at least vs
> mdadm-3.1.5, in which aligned buffer was used.
> Portions of strace:
> open("/dev/sdj", O_RDWR|O_EXCL|O_DIRECT) = 4
> ...
> ioctl(4, BLKSSZGET, 0x7fffa70b98d4)     = 0
> write(4, "bitm\4\0\0\0G*8\34m\245\261\253\34\341\312\3\214S\203\247\200\0\0\0\0\0\0\0"...,
> 512) = -1 EINVAL (Invalid argument)
> The buffer pointer at this point is 0x7fffa70b9930 (clearly not aligned to 512).
> Do you want me to perhaps prepare a patch, which simply __aligns__ the
> buffer on the stack?

Hmmm... I messed that up, didn't I?
I think I'd rather fix awrite so that it *always* does an aligned write.

I'll make a patch - thanks.

> 
> 2) Why in md_check_recovery(), rebuilding a spare is preferrable over
> resyncing the parity by this code:
> 		} else if ((spares = remove_and_add_spares(mddev))) {
> 			clear_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> 			clear_bit(MD_RECOVERY_CHECK, &mddev->recovery);
> 			clear_bit(MD_RECOVERY_REQUESTED, &mddev->recovery);
> 			set_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> 		} else if (mddev->recovery_cp < MaxSector) {
> 			set_bit(MD_RECOVERY_SYNC, &mddev->recovery);
> 			clear_bit(MD_RECOVERY_RECOVER, &mddev->recovery);
> 
> This leads to situation, in which array "remembers" that it has to
> complete resync (in sb->resync_offset), but after the spare completes
> rebuilding, the bitmap is cleared, and subsequent resync starts and
> immediately completes. Was that decision to prefer rebuild over resync
> intentional?

Yes.

In most cases rebuilding the spare will bring the array back into sync anyway.

Also rebuilding a spare will do more to improve redundancy than a resync.

There might be room for improvement here for RAID6 - if the array is dirty
and degrade we should both resync and recover at the same time ... and it is
possible that we already do.  I'd need to look more closely at the code to be
sure.

Thanks,
NeilBrown


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: sb->resync_offset value after resync failure
  2012-02-07  0:55         ` NeilBrown
@ 2012-02-07  9:11           ` Alexander Lyakas
  0 siblings, 0 replies; 9+ messages in thread
From: Alexander Lyakas @ 2012-02-07  9:11 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi,

>> >> 2) If there is a spare, it starts rebuilding the spare, and, as a
>> >> result, persists sb->resync_offset==sb->size in the superblock
>
> I just looked into this and tested it and I don't agree.
> ->resync_offset does not get set to sb->size.  It gets set to the value that
> curr_resync got up to which is correct ..  or fairly close, it should really
> be curr_resync_completed.
>
> So I don't think there is a big problem there.

In the scenario I mentioned, I clearly end up having
sb->resync_offset==sb->size (and it actually remains this way, until
spare rebuild completes and "empty-resync" completes). But I am using
2.6.38-8; perhaps in your kernel this does not happen anymore. If
you're interested to investigate this further, I can send you a repro
script.

Thanks for your help!
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-02-07  9:11 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-19 16:19 sb->resync_offset value after resync failure Alexander Lyakas
2012-01-24 14:25 ` Alexander Lyakas
2012-01-26  0:41   ` NeilBrown
2012-01-26  0:51 ` NeilBrown
2012-02-01 20:56   ` Alexander Lyakas
2012-02-01 22:19     ` NeilBrown
2012-02-06 11:41       ` Alexander Lyakas
2012-02-07  0:55         ` NeilBrown
2012-02-07  9:11           ` Alexander Lyakas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).