raid1 - mismatches after resuming interrupted recovery

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* raid1 - mismatches after resuming interrupted recovery
@ 2015-07-06 16:19 Nate Dailey
  2015-10-30 15:58 ` Jes Sorensen
  0 siblings, 1 reply; 4+ messages in thread
From: Nate Dailey @ 2015-07-06 16:19 UTC (permalink / raw)
  To: linux-raid

I've found that if I interrupt a recovery by removing the target device, do IO 
before the recovery checkpoint, then re-add the device and let the recovery 
complete, the mismatch_cnt is non-zero after doing a check.

Here's exactly what I'm doing:

- create a 5 GB raid1 with internal bitmap

- do a check, verify zero mismatch_cnt

- remove one member device

- dd 256MB with 2GB seek

- lower sync_speed_min/max to 500

- re-add removed device

- wait 15 sec

- remove the same member device again

- dd 1MB with 1 GB seek

- restore sync_speed_min/max to system defaults

- re-add removed device

- when recovery competes, do another check

At this point the mismatch_cnt is non-zero.

I originally hit this on RHEL 7.1, but tested 4.1.1 from kernel.org and it 
happens there too.

I'm out of my league in terms of trying to fix this, but would be happy to test 
a fix. I wonder if it's really necessary to resume a bitmap recovery from the 
checkpoint? Wouldn't the bitmap always reflect what needs to be copied?

Nate

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 - mismatches after resuming interrupted recovery
  2015-07-06 16:19 raid1 - mismatches after resuming interrupted recovery Nate Dailey
@ 2015-10-30 15:58 ` Jes Sorensen
  2015-10-30 16:57   ` Nate Dailey
  0 siblings, 1 reply; 4+ messages in thread
From: Jes Sorensen @ 2015-10-30 15:58 UTC (permalink / raw)
  To: NeilBrown; +Cc: Nate Dailey, linux-raid

Nate Dailey <nate.dailey@stratus.com> writes:
> I've found that if I interrupt a recovery by removing the target
> device, do IO before the recovery checkpoint, then re-add the device
> and let the recovery complete, the mismatch_cnt is non-zero after
> doing a check.

Neil,

While I am on the nagging path, here is another one.

Jes

> Here's exactly what I'm doing:
>
> - create a 5 GB raid1 with internal bitmap
>
> - do a check, verify zero mismatch_cnt
>
> - remove one member device
>
> - dd 256MB with 2GB seek
>
> - lower sync_speed_min/max to 500
>
> - re-add removed device
>
> - wait 15 sec
>
> - remove the same member device again
>
> - dd 1MB with 1 GB seek
>
> - restore sync_speed_min/max to system defaults
>
> - re-add removed device
>
> - when recovery competes, do another check
>
> At this point the mismatch_cnt is non-zero.
>
>
> I originally hit this on RHEL 7.1, but tested 4.1.1 from kernel.org
> and it happens there too.
>
> I'm out of my league in terms of trying to fix this, but would be
> happy to test a fix. I wonder if it's really necessary to resume a
> bitmap recovery from the checkpoint? Wouldn't the bitmap always
> reflect what needs to be copied?
>
> Nate
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 - mismatches after resuming interrupted recovery
  2015-10-30 15:58 ` Jes Sorensen
@ 2015-10-30 16:57   ` Nate Dailey
  2015-10-30 18:07     ` Jes Sorensen
  0 siblings, 1 reply; 4+ messages in thread
From: Nate Dailey @ 2015-10-30 16:57 UTC (permalink / raw)
  To: Jes Sorensen, NeilBrown; +Cc: linux-raid

This is the the same as "ignore recovery_offset if bitmap exists", describing 
how I hit the problem (before attempting to put a patch together to fix it).

Nate



On 10/30/2015 11:58 AM, Jes Sorensen wrote:
> Nate Dailey <nate.dailey@stratus.com> writes:
>> I've found that if I interrupt a recovery by removing the target
>> device, do IO before the recovery checkpoint, then re-add the device
>> and let the recovery complete, the mismatch_cnt is non-zero after
>> doing a check.
> Neil,
>
> While I am on the nagging path, here is another one.
>
> Jes
>
>> Here's exactly what I'm doing:
>>
>> - create a 5 GB raid1 with internal bitmap
>>
>> - do a check, verify zero mismatch_cnt
>>
>> - remove one member device
>>
>> - dd 256MB with 2GB seek
>>
>> - lower sync_speed_min/max to 500
>>
>> - re-add removed device
>>
>> - wait 15 sec
>>
>> - remove the same member device again
>>
>> - dd 1MB with 1 GB seek
>>
>> - restore sync_speed_min/max to system defaults
>>
>> - re-add removed device
>>
>> - when recovery competes, do another check
>>
>> At this point the mismatch_cnt is non-zero.
>>
>>
>> I originally hit this on RHEL 7.1, but tested 4.1.1 from kernel.org
>> and it happens there too.
>>
>> I'm out of my league in terms of trying to fix this, but would be
>> happy to test a fix. I wonder if it's really necessary to resume a
>> bitmap recovery from the checkpoint? Wouldn't the bitmap always
>> reflect what needs to be copied?
>>
>> Nate
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: raid1 - mismatches after resuming interrupted recovery
  2015-10-30 16:57   ` Nate Dailey
@ 2015-10-30 18:07     ` Jes Sorensen
  0 siblings, 0 replies; 4+ messages in thread
From: Jes Sorensen @ 2015-10-30 18:07 UTC (permalink / raw)
  To: Nate Dailey; +Cc: NeilBrown, linux-raid

Nate Dailey <nate.dailey@stratus.com> writes:
> This is the the same as "ignore recovery_offset if bitmap exists",
> describing how I hit the problem (before attempting to put a patch
> together to fix it).

Thanks for the clarification Nate, I thought it was two different
issues.

Clearly my grey hair is eating away at my brain. Neil please do your
best to ignore me :)

Jes

> On 10/30/2015 11:58 AM, Jes Sorensen wrote:
>> Nate Dailey <nate.dailey@stratus.com> writes:
>>> I've found that if I interrupt a recovery by removing the target
>>> device, do IO before the recovery checkpoint, then re-add the device
>>> and let the recovery complete, the mismatch_cnt is non-zero after
>>> doing a check.
>> Neil,
>>
>> While I am on the nagging path, here is another one.
>>
>> Jes
>>
>>> Here's exactly what I'm doing:
>>>
>>> - create a 5 GB raid1 with internal bitmap
>>>
>>> - do a check, verify zero mismatch_cnt
>>>
>>> - remove one member device
>>>
>>> - dd 256MB with 2GB seek
>>>
>>> - lower sync_speed_min/max to 500
>>>
>>> - re-add removed device
>>>
>>> - wait 15 sec
>>>
>>> - remove the same member device again
>>>
>>> - dd 1MB with 1 GB seek
>>>
>>> - restore sync_speed_min/max to system defaults
>>>
>>> - re-add removed device
>>>
>>> - when recovery competes, do another check
>>>
>>> At this point the mismatch_cnt is non-zero.
>>>
>>>
>>> I originally hit this on RHEL 7.1, but tested 4.1.1 from kernel.org
>>> and it happens there too.
>>>
>>> I'm out of my league in terms of trying to fix this, but would be
>>> happy to test a fix. I wonder if it's really necessary to resume a
>>> bitmap recovery from the checkpoint? Wouldn't the bitmap always
>>> reflect what needs to be copied?
>>>
>>> Nate
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2015-10-30 18:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-06 16:19 raid1 - mismatches after resuming interrupted recovery Nate Dailey
2015-10-30 15:58 ` Jes Sorensen
2015-10-30 16:57   ` Nate Dailey
2015-10-30 18:07     ` Jes Sorensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).