From: Neil Brown <neilb@suse.de>
To: Mikael Abrahamsson <swmike@swm.pp.se>
Cc: linux-raid@vger.kernel.org
Subject: Re: 2 drives failed, one "active", one with wrong event count
Date: Mon, 1 Feb 2010 09:37:13 +1100 [thread overview]
Message-ID: <20100201093713.7ee33041@notabene.brown> (raw)
In-Reply-To: <alpine.DEB.1.10.1001301546400.15329@uplift.swm.pp.se>
On Sat, 30 Jan 2010 22:20:34 +0100 (CET)
Mikael Abrahamsson <swmike@swm.pp.se> wrote:
> On Fri, 29 Jan 2010, Mikael Abrahamsson wrote:
>
> > Yes, that solved the problem. Thanks a bunch!
>
> Now I have another problem. Last time one other drive was kicked out
> during the resync due to UNC read errors. I ddrescued this drive to
> another drive on another system, and inserted the drive I copied to. So
> basically I have 5 drives which contain valid information of which one has
> a lower event count, and one drive being resync:ed. This state doesn't
> seem to be ok...
>
> I guess if I removed the drive being resync:ed to and assembled it with
> --force it would update the event count of sdh (the copy of the drive that
> previously had read errors) and all would be fine. The bad part is that I
> don't really know which of the drives was being resync:ed to. Is this
> indicated by the "feature map" (guess 0x2 means partially sync:ed).
0x2 means "the 'recovery_offset' fields is valid" which does correlate well
with "is partially sync:ed".
>
> (6 hrs later: Ok, I physically removed the 0x2 drive and used --assemble
> --force and then I added a different drive and that seemed to work)
>
> I don't know what the default action should be when there is a partially
> resync:ed drive and a drive with lower event count, but I tend to lean
> towards that it should take the drive with the lower event count and
> insert it, and then start sync:ing to the 0x2 drive. This might require
> some new options to mdadm to handle this behaviour?
You might know that nothing has been written to the array since the device
with the lower event count was removed, but md doesn't know that. Any device
with an old event count could have old and so cannot be trusted (unless you
assemble with --force meaning that you are taking responsibility).
My planned way to address this situation is to store a bad-block-list per
device and when we get an unrecoverable failure, record the address in the
bad-block-list and continue as best we can.
NeilBrown
next prev parent reply other threads:[~2010-01-31 22:37 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-01-28 9:05 2 drives failed, one "active", one with wrong event count Mikael Abrahamsson
2010-01-29 4:17 ` Mikael Abrahamsson
2010-01-29 7:06 ` Mikael Abrahamsson
2010-01-29 10:17 ` Neil Brown
2010-01-29 12:09 ` Mikael Abrahamsson
2010-01-29 12:27 ` Mikael Abrahamsson
2010-01-30 21:20 ` Mikael Abrahamsson
2010-01-31 22:37 ` Neil Brown [this message]
2010-02-01 7:13 ` Mikael Abrahamsson
2010-02-04 1:03 ` Neil Brown
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20100201093713.7ee33041@notabene.brown \
--to=neilb@suse.de \
--cc=linux-raid@vger.kernel.org \
--cc=swmike@swm.pp.se \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).