From: Martin Wilck <mwilck@arcor.de>
To: NeilBrown <neilb@suse.de>
Cc: Albert Pauw <albert.pauw@gmail.com>, linux-raid@vger.kernel.org
Subject: Re: Bugreport ddf rebuild problems
Date: Tue, 06 Aug 2013 23:26:30 +0200 [thread overview]
Message-ID: <52016A06.8070400@arcor.de> (raw)
In-Reply-To: <20130806101633.4b8f2374@notabene.brown>
On 08/06/2013 02:16 AM, NeilBrown wrote:
> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
>
>> Hi Albert, Neil,
>>
>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>> as a new unit test and 4/5 should fix it.
>>
>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>> there is a possible race condition here, if a disk fails and mdadm -CR
>> is called to create a new array before the metadata reflecting the
>> failure is written to disk. If a disk failure happens in one array,
>> mdmon will call reconcile_failed() to propagate the failure to other
>> already known arrays in the same container, by writing "faulty" to the
>> sysfs state attribute. It can't do that for a new container though.
>>
>> I thought that process_update() may need to check the kernel state of
>> array members against meta data state when a new VD configuration record
>> is received, but that's impossible because we can't call open() on the
>> respective sysfs files. It could be done in prepare_update(), but that
>> would require major changes, I wanted to ask you first.
>>
>> Another option would be changing manage_new(). But we don't seem to have
>> a suitable metadata handler method to pass the meta data state to the
>> manager....
>>
>> Ideas?
>
> Thanks for the patches - I applied them all.
I don't see them in the public repo yet.
> Is there a race here? When "mdadm -C" looks at the metadata the device will
> either be an active member of another array, or it will be marked faulty.
> Either way mdadm won't use it.
That's right, thanks.
> If the first array was created to use only (say) half of each device and the
> second array was created with a size to fit in the other half of the device
> then it might get interesting.
> "mdadm -C" might see that everything looks good, create the array using the
> second half of that drive that has just failed, and give that info to mdmon.
Yes, I have created a test case for this (10ddf-fail-create-race) which
I am going to submit soon.
> I suspect that ddf_open_new (which currently looks like it is just a stub)
> needs to help out here.
Great idea, I made an implementation. I found that I needed to freeze
the array in Create(), too, to avoid the kernel starting a rebuild
before the mdmon checked the correctness of the new array. Please review
that, I'm not 100% positive I got it right.
> When manage_new() gets told about a new array it will collect relevant info
> from sysfs and call ->open_new() to make sure it matches the metadata.
> ddf_open_new should check that all the devices in the array are recorded as
> working in the metadata. If any are failed, it can write 'faulty' to the
> relevant state_fd.
>
> Possibly the same thing can be done generically in manage_new() as you
> suggested. After the new array has been passed over to the monitor thread,
> manage_new() could check if any devices should be failed much like
> reconcile_failed() does and just fail them.
>
> Does that make any sense? Did I miss something?
It makes a lot of sense.
While testing, I found another minor problem case:
1 disk fails in array taking half size
2 mdmon activates spare
3 mdadm -C is called and finds old meta data, allocates extent at
offset 0 on the spare
4 Create() gets an error writing to the "size" sysfs attribute because
offset 0 has been grabbed by the spare recovery already
That's not too bad, after all, because the array won't be created. The
user just needs to re-issue his mdadm -C command which will now succeed
because the meta data should have been written to disk in the meantime.
That said, some kind of locking between mdadm and mdmon (mdadm won't
read meta data as long as mdmon is busy writing them) might be
desirable. It would be even better to do all meta data operations
through mdmon, mdadm just sending messages to it. That would be a major
architectural change for mdadm, but it would avoid this kind of
"different meta data here and there" problem altogether.
Thanks
Martin
>
> Thanks,
> NeilBrown
next prev parent reply other threads:[~2013-08-06 21:26 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-08-01 19:30 Bugreport ddf rebuild problems Albert Pauw
2013-08-01 19:09 ` Martin Wilck
[not found] ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
2013-08-01 21:13 ` Martin Wilck
2013-08-01 22:09 ` Martin Wilck
2013-08-01 22:37 ` Martin Wilck
2013-08-03 9:43 ` Albert Pauw
2013-08-04 9:47 ` Albert Pauw
2013-08-05 16:55 ` Albert Pauw
2013-08-05 21:24 ` Martin Wilck
2013-08-06 0:16 ` NeilBrown
2013-08-06 21:26 ` Martin Wilck [this message]
2013-08-06 21:37 ` Patches related to current discussion mwilck
2013-08-06 21:38 ` [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks mwilck
2013-08-06 21:38 ` [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race mwilck
2013-08-06 21:38 ` [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray mwilck
2013-08-06 21:38 ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
2013-08-08 0:44 ` NeilBrown
2013-08-08 7:31 ` Martin Wilck
2013-08-07 18:07 ` Bugreport ddf rebuild problems Albert Pauw
2013-08-08 0:40 ` NeilBrown
[not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
2013-08-05 6:21 ` NeilBrown
2013-08-05 7:17 ` Albert Pauw
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52016A06.8070400@arcor.de \
--to=mwilck@arcor.de \
--cc=albert.pauw@gmail.com \
--cc=linux-raid@vger.kernel.org \
--cc=neilb@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).