Re: Bugreport ddf rebuild problems

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Albert Pauw <albert.pauw@gmail.com>
To: Martin Wilck <mwilck@arcor.de>, Neil Brown <neilb@suse.de>,
	linux-raid@vger.kernel.org
Subject: Re: Bugreport ddf rebuild problems
Date: Wed, 07 Aug 2013 20:07:15 +0200	[thread overview]
Message-ID: <52028CD3.5050606@gmail.com> (raw)
In-Reply-To: <52016A06.8070400@arcor.de>

Just to let you guys know that I pulled the latest git version 
(including DDF: Write new conf entries with a single write.)

As for now, it all seems to work rather nicely.

Thanks guys,

Albert


On 08/06/2013 11:26 PM, Martin Wilck wrote:
> On 08/06/2013 02:16 AM, NeilBrown wrote:
>> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
>>
>>> Hi Albert, Neil,
>>>
>>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>>> as a new unit test and 4/5 should fix it.
>>>
>>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>>> there is a possible race condition here, if a disk fails and mdadm -CR
>>> is called to create a new array before the metadata reflecting the
>>> failure is written to disk. If a disk failure happens in one array,
>>> mdmon will call reconcile_failed() to propagate the failure to other
>>> already known arrays in the same container, by writing "faulty" to the
>>> sysfs state attribute. It can't do that for a new container though.
>>>
>>> I thought that process_update() may need to check the kernel state of
>>> array members against meta data state when a new VD configuration record
>>> is received, but that's impossible because we can't call open() on the
>>> respective sysfs files. It could be done in prepare_update(), but that
>>> would require major changes, I wanted to ask you first.
>>>
>>> Another option would be changing manage_new(). But we don't seem to have
>>> a suitable metadata handler method to pass the meta data state to the
>>> manager....
>>>
>>> Ideas?
>> Thanks for the patches - I applied them all.
> I don't see them in the public repo yet.
>
>> Is there a race here?  When "mdadm -C" looks at the metadata the device will
>> either be an active member of another array, or it will be marked faulty.
>> Either way mdadm won't use it.
> That's right, thanks.
>
>> If the first array was created to use only (say) half of each device and the
>> second array was created with a size to fit in the other half of the device
>> then it might get interesting.
>> "mdadm -C" might see that everything looks good, create the array using the
>> second half of that drive that has just failed, and give that info to mdmon.
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
>
>> I suspect that ddf_open_new (which currently looks like it is just a stub)
>> needs to help out here.
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
>
>> When manage_new() gets told about a new array it will collect relevant info
>> from sysfs and call ->open_new() to make sure it matches the metadata.
>> ddf_open_new should check that all the devices in the array are recorded as
>> working in the metadata.  If any are failed, it can write 'faulty' to the
>> relevant state_fd.
>>
>> Possibly the same thing can be done generically in manage_new() as you
>> suggested.  After the new array has been passed over to the monitor thread,
>> manage_new() could check if any devices should be failed much like
>> reconcile_failed() does and just fail them.
>>
>> Does that make any sense?  Did I miss something?
> It makes a lot of sense.
>
> While testing, I found another minor problem case:
>
>   1 disk fails in array taking half size
>   2 mdmon activates spare
>   3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>   4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
>
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
>
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.
>
> Thanks
> Martin
>
>> Thanks,
>> NeilBrown

next prev parent reply	other threads:[~2013-08-07 18:07 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-01 19:30 Bugreport ddf rebuild problems Albert Pauw
2013-08-01 19:09 ` Martin Wilck
     [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
2013-08-01 21:13     ` Martin Wilck
2013-08-01 22:09       ` Martin Wilck
2013-08-01 22:37         ` Martin Wilck
2013-08-03  9:43           ` Albert Pauw
2013-08-04  9:47             ` Albert Pauw
2013-08-05 16:55               ` Albert Pauw
2013-08-05 21:24                 ` Martin Wilck
2013-08-06  0:16                   ` NeilBrown
2013-08-06 21:26                     ` Martin Wilck
2013-08-06 21:37                       ` Patches related to current discussion mwilck
2013-08-06 21:38                       ` [PATCH 6/9] tests/10ddf-fail-spare: more sophisticated result checks mwilck
2013-08-06 21:38                       ` [PATCH 7/9] tests/10ddf-fail-create-race: test handling of fail/create race mwilck
2013-08-06 21:38                       ` [PATCH 8/9] DDF: ddf_open_new: check device status for new subarray mwilck
2013-08-06 21:38                       ` [PATCH 9/9] Create: set array status to frozen until monitoring starts mwilck
2013-08-08  0:44                         ` NeilBrown
2013-08-08  7:31                           ` Martin Wilck
2013-08-07 18:07                       ` Albert Pauw [this message]
2013-08-08  0:40                       ` Bugreport ddf rebuild problems NeilBrown
     [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
2013-08-05  6:21 ` NeilBrown
2013-08-05  7:17   ` Albert Pauw

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=52028CD3.5050606@gmail.com \
    --to=albert.pauw@gmail.com \
    --cc=linux-raid@vger.kernel.org \
    --cc=mwilck@arcor.de \
    --cc=neilb@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).