non-fresh data unavailable bug

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* non-fresh data unavailable bug
@ 2010-01-14 15:10 Brett Russ
  2010-01-14 19:24 ` Michael Evans
  0 siblings, 1 reply; 4+ messages in thread
From: Brett Russ @ 2010-01-14 15:10 UTC (permalink / raw)
  To: linux-raid

Slightly related to my last message here Re:non-fresh behavior, we have 
seen cases where the following happens:
* healthy 2 disk raid1 (disks A & B) incurs a problem with disk B
* disk B is removed, unit is now degraded
* replacement disk C is added; recovery from A to C begins
* during recovery, disk A incurs a brief lapse in connectivity.  At this 
point C is still up yet only has a partial copy of the data.
* a subsequent assemble operation on the raid1 results in disk A being 
kicked out as non-fresh, yet C is allowed in.

This presents quite a data-unavailability problem and basically requires 
recognizing the situation and hand assembling the array with disk A 
(only) first, then adding C back in.  Unfortunately this situation is 
hard to reproduce and we don't have a dump of the 'mdadm --examine' 
output for it yet.

Any thoughts on this while we try to get a better reproduction case?

Thanks,
Brett

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: non-fresh data unavailable bug
  2010-01-14 15:10 non-fresh data unavailable bug Brett Russ
@ 2010-01-14 19:24 ` Michael Evans
  2010-01-15 15:36   ` Brett Russ
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Evans @ 2010-01-14 19:24 UTC (permalink / raw)
  To: Brett Russ; +Cc: linux-raid

On Thu, Jan 14, 2010 at 7:10 AM, Brett Russ <bruss@netezza.com> wrote:
> Slightly related to my last message here Re:non-fresh behavior, we have seen
> cases where the following happens:
> * healthy 2 disk raid1 (disks A & B) incurs a problem with disk B
> * disk B is removed, unit is now degraded
> * replacement disk C is added; recovery from A to C begins
> * during recovery, disk A incurs a brief lapse in connectivity.  At this
> point C is still up yet only has a partial copy of the data.
> * a subsequent assemble operation on the raid1 results in disk A being
> kicked out as non-fresh, yet C is allowed in.
>
> This presents quite a data-unavailability problem and basically requires
> recognizing the situation and hand assembling the array with disk A (only)
> first, then adding C back in.  Unfortunately this situation is hard to
> reproduce and we don't have a dump of the 'mdadm --examine' output for it
> yet.
>
> Any thoughts on this while we try to get a better reproduction case?
>
> Thanks,
> Brett
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-raid" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

I believe the desired and logical behavior here is to refuse running
an incomplete array unless explicitly forced to do so.  Incremental
assembly might be what you're seeing.

The only way to access the data from those devices, presuming that
without the device that had the hiccup your array is incomplete, would
be to force assembly with the older device included and hope.  I very
much recommend running it read-only until you can determine which
assembly pattern produces the most viable results.
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: non-fresh data unavailable bug
  2010-01-14 19:24 ` Michael Evans
@ 2010-01-15 15:36   ` Brett Russ
  2010-01-18  3:32     ` Neil Brown
  0 siblings, 1 reply; 4+ messages in thread
From: Brett Russ @ 2010-01-15 15:36 UTC (permalink / raw)
  To: linux-raid

On 01/14/2010 02:24 PM, Michael Evans wrote:
> On Thu, Jan 14, 2010 at 7:10 AM, Brett Russ<bruss@netezza.com>  wrote:
>> Slightly related to my last message here Re:non-fresh behavior, we have seen
>> cases where the following happens:
>> * healthy 2 disk raid1 (disks A&  B) incurs a problem with disk B
>> * disk B is removed, unit is now degraded
>> * replacement disk C is added; recovery from A to C begins
>> * during recovery, disk A incurs a brief lapse in connectivity.  At this
>> point C is still up yet only has a partial copy of the data.
>> * a subsequent assemble operation on the raid1 results in disk A being
>> kicked out as non-fresh, yet C is allowed in.
>
> I believe the desired and logical behavior here is to refuse running
> an incomplete array unless explicitly forced to do so.  Incremental
> assembly might be what you're seeing.

This brings up a good point.  I didn't mention that the assemble in the 
last step above was forced.  Thus, the "bug" I'm reporting is that under 
duress, mdadm/md chose to assemble the array with a partially recovered 
(but "newer") member instead of the older member which was the recovery 
*source* for the newer member.

What I think should happen is members that are *destinations* for 
recovery should *never* receive a higher event count, timestamp, or any 
other marking than the recovery sources.  By definition they are 
incomplete and can't be trusted, thus they should never trump a complete 
member during assemble.  I would assume the code already does this but 
perhaps there is a hole.

One other piece of information that may be relevant--we're using 2 
member RAID1 units with one member marked write-mostly.  At this time, I 
don't have the specifics for which member (A or B) was the write-mostly 
member in the example above, but I can find that out.

> I very much recommend running it read-only until you can determine which
> assembly pattern produces the most viable results.

Good tip.  We were able to manually recover the array in the case 
outlined above, now we're looking back to fixing the kernel to prevent 
it happening again.

Thanks,
Brett

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: non-fresh data unavailable bug
  2010-01-15 15:36   ` Brett Russ
@ 2010-01-18  3:32     ` Neil Brown
  0 siblings, 0 replies; 4+ messages in thread
From: Neil Brown @ 2010-01-18  3:32 UTC (permalink / raw)
  To: Brett Russ; +Cc: linux-raid

On Fri, 15 Jan 2010 10:36:39 -0500
Brett Russ <bruss@netezza.com> wrote:

> On 01/14/2010 02:24 PM, Michael Evans wrote:
> > On Thu, Jan 14, 2010 at 7:10 AM, Brett Russ<bruss@netezza.com>  wrote:
> >> Slightly related to my last message here Re:non-fresh behavior, we have seen
> >> cases where the following happens:
> >> * healthy 2 disk raid1 (disks A&  B) incurs a problem with disk B
> >> * disk B is removed, unit is now degraded
> >> * replacement disk C is added; recovery from A to C begins
> >> * during recovery, disk A incurs a brief lapse in connectivity.  At this
> >> point C is still up yet only has a partial copy of the data.
> >> * a subsequent assemble operation on the raid1 results in disk A being
> >> kicked out as non-fresh, yet C is allowed in.
> >
> > I believe the desired and logical behavior here is to refuse running
> > an incomplete array unless explicitly forced to do so.  Incremental
> > assembly might be what you're seeing.
> 
> This brings up a good point.  I didn't mention that the assemble in the 
> last step above was forced.  Thus, the "bug" I'm reporting is that under 
> duress, mdadm/md chose to assemble the array with a partially recovered 
> (but "newer") member instead of the older member which was the recovery 
> *source* for the newer member.
> 
> What I think should happen is members that are *destinations* for 
> recovery should *never* receive a higher event count, timestamp, or any 
> other marking than the recovery sources.  By definition they are 
> incomplete and can't be trusted, thus they should never trump a complete 
> member during assemble.  I would assume the code already does this but 
> perhaps there is a hole.
> 
> One other piece of information that may be relevant--we're using 2 
> member RAID1 units with one member marked write-mostly.  At this time, I 
> don't have the specifics for which member (A or B) was the write-mostly 
> member in the example above, but I can find that out.
> 
> > I very much recommend running it read-only until you can determine which
> > assembly pattern produces the most viable results.
> 
> Good tip.  We were able to manually recover the array in the case 
> outlined above, now we're looking back to fixing the kernel to prevent 
> it happening again.
>


Thanks for the report.  It sounds like a real problem.
I'm travelling at the moment so reproducing it would be a challenge.
If you are able to, can you report the output of
  mdadm -E /dev/list-of-devices
at the key points in the process, and also add "-v" to any
mdadm --assemble
command you use, and report the output?

Thanks,
NeilBrown

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-01-18  3:32 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-01-14 15:10 non-fresh data unavailable bug Brett Russ
2010-01-14 19:24 ` Michael Evans
2010-01-15 15:36   ` Brett Russ
2010-01-18  3:32     ` Neil Brown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).