Re: Bugreport ddf rebuild problems

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Bugreport ddf rebuild problems
       [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
@ 2013-08-05  6:21 ` NeilBrown
  2013-08-05  7:17   ` Albert Pauw
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2013-08-05  6:21 UTC (permalink / raw)
  To: Albert Pauw; +Cc: linux-raid

[-- Attachment #1: Type: text/plain, Size: 620 bytes --]

On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote:

> Now I've got it working, I checked a simple rebuild test with a ddf
> container, containing two raidsets (raid 1 and raid 5).
> 
> Having 6 loopback devices of 512 MB each I opened in a separate screen this
> to monitor what happens:
> 
> while true; do clear; cat /proc/mdstat ; sleep 1; done
> 

For future reference:
   watch cat /proc/mdstat

works very nicely for this sort of task.

I believe the bug you described is fixed in the current git tree.  If you
could confirm I would appreciate it.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05  6:21 ` Bugreport ddf rebuild problems NeilBrown
@ 2013-08-05  7:17   ` Albert Pauw
  0 siblings, 0 replies; 15+ messages in thread
From: Albert Pauw @ 2013-08-05  7:17 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid

Hi Neil,

I will check the latest git version tonight and let you know.

Thanks to you and Martin Wilck,

Albert

On 5 August 2013 08:21, NeilBrown <neilb@suse.de> wrote:
> On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote:
>
>> Now I've got it working, I checked a simple rebuild test with a ddf
>> container, containing two raidsets (raid 1 and raid 5).
>>
>> Having 6 loopback devices of 512 MB each I opened in a separate screen this
>> to monitor what happens:
>>
>> while true; do clear; cat /proc/mdstat ; sleep 1; done
>>
>
> For future reference:
>    watch cat /proc/mdstat
>
> works very nicely for this sort of task.
>
> I believe the bug you described is fixed in the current git tree.  If you
> could confirm I would appreciate it.
>
> Thanks,
> NeilBrown

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Bugreport ddf rebuild problems
@ 2013-08-01 19:30 Albert Pauw
  2013-08-01 19:09 ` Martin Wilck
  0 siblings, 1 reply; 15+ messages in thread
From: Albert Pauw @ 2013-08-01 19:30 UTC (permalink / raw)
  To: linux-raid

Now I've got it working, I checked a simple rebuild test with a ddf 
container, containing two raidsets (raid 1 and raid 5).

Having 6 loopback devices of 512 MB each I opened in a separate screen 
this to monitor what happens:

while true; do clear; cat /proc/mdstat ; sleep 1; done

Now are here my commands, en watch the rebuild.

[root@moonlight ~]# mdadm --zero-superblock /dev/loop[1-6]
[root@moonlight ~]# mdadm -CR /dev/md127 -e ddf -l container -n 5 
/dev/loop[1-5]
mdadm: /dev/loop1 appears to contain an ext2fs file system
        size=153600K  mtime=Thu Jan  1 01:00:00 1970
mdadm: /dev/loop2 appears to contain an ext2fs file system
        size=153600K  mtime=Thu Jan  1 01:00:00 1970
mdadm: container /dev/md127 prepared.
[root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md0
mdadm: You haven't given enough devices (real or missing) to create this 
array
[root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127
mdadm: array /dev/md0 started.
[root@moonlight ~]# mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127
mdadm: array /dev/md1 started.
[root@moonlight ~]# man mdadm
Formatting page, please wait...
[root@moonlight ~]# mdadm -f /dev/md0 /dev/loop4
mdadm: set /dev/loop4 faulty in /dev/md0
[root@moonlight ~]# mdadm -f /dev/md1 /dev/loop2
mdadm: set /dev/loop2 faulty in /dev/md1
[root@moonlight ~]# mdadm --add /dev/md127 /dev/loop6
mdadm: added /dev/loop6


What should happen is that one of the raid sets get rebuild by the added 
spare, however
  the one raid set gets the spare disk (which is good) and starts 
rebuilding, the second one, well have a look:

root@moonlight ~]# cat /proc/mdstat
Personalities : [raid1] [raid6] [raid5] [raid4]
md1 : active raid5 loop6[3] loop1[2] loop3[0]
       958464 blocks super external:/md127/1 level 5, 512k chunk, 
algorithm 2 [3/3] [UUU]

md0 : active raid1 loop3[3](S) loop2[2] loop5[0]
       479232 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive loop6[5](S) loop5[4](S) loop4[3](S) loop3[2](S) 
loop2[1](S) loop1[0](S)
       196608 blocks super external:ddf


Effectively a rebuild raid 1 set with three disks (md0) including a 
spare, using a failed disk.

Looks like something is going wrong here.

Albert



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 19:30 Albert Pauw
@ 2013-08-01 19:09 ` Martin Wilck
       [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Wilck @ 2013-08-01 19:09 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

Hi Albert,

thanks for reporting this. I can reproduce it.
For curiosity, is this a regression?

Martin


^ permalink raw reply	[flat|nested] 15+ messages in thread

[parent not found: <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>]

* Re: Bugreport ddf rebuild problems
       [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
@ 2013-08-01 21:13     ` Martin Wilck
  2013-08-01 22:09       ` Martin Wilck
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Wilck @ 2013-08-01 21:13 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 58 bytes --]

Can you please try the attached patch?

Regards,
Martin



[-- Attachment #2: 0001-DDF-ddf_process_update-delete-removed-disks-from-dli.patch --]
[-- Type: text/x-patch, Size: 1799 bytes --]

From e6b25d31ee79c1191f936e178324c97d2ce68a57 Mon Sep 17 00:00:00 2001
From: Martin Wilck <mwilck@arcor.de>
Date: Fri, 2 Aug 2013 00:11:20 +0200
Subject: [PATCH] DDF: ddf_process_update: delete removed disks from dlist

We currently remove Failed disks that are no longer used by any
VD. If we do that, we must remove them from dlist, too, because
otherwise the same pdnum may occur multiple times in the dlist.
---
 super-ddf.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/super-ddf.c b/super-ddf.c
index 65472a2..f8d5fcd 100644
--- a/super-ddf.c
+++ b/super-ddf.c
@@ -4366,6 +4366,18 @@ static void copy_matching_bvd(struct ddf_super *ddf,
 	       conf->sec_elmnt_seq, guid_str(conf->guid));
 }
 
+static void _delete_dl_by_refnum(struct ddf_super *ddf, be32 refnum)
+{
+	struct dl **pdl = &ddf->dlist;
+	while (*pdl) {
+		if (be32_eq((*pdl)->disk.refnum, refnum)) {
+			free(*pdl);
+			*pdl = (*pdl)->next;
+		} else
+			pdl = &(*pdl)->next;
+	}
+}
+
 static void ddf_process_update(struct supertype *st,
 			       struct metadata_update *update)
 {
@@ -4638,9 +4650,15 @@ static void ddf_process_update(struct supertype *st,
 			if (be16_and(ddf->phys->entries[pdnum].state,
 				     cpu_to_be16(DDF_Failed))
 			    && be16_and(ddf->phys->entries[pdnum].state,
-					cpu_to_be16(DDF_Transition)))
-				/* skip this one */;
-			else if (pdnum == pd2)
+					cpu_to_be16(DDF_Transition))) {
+				/* skip this one and remove from dlist */
+				dprintf("%s: %08x no longer used, removing it\n",
+					__func__,
+					be32_to_cpu(ddf->phys->
+						    entries[pdnum].refnum));
+				_delete_dl_by_refnum(
+					ddf, ddf->phys->entries[pdnum].refnum);
+			} else if (pdnum == pd2)
 				pd2++;
 			else {
 				ddf->phys->entries[pd2] =
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 21:13     ` Martin Wilck
@ 2013-08-01 22:09       ` Martin Wilck
  2013-08-01 22:37         ` Martin Wilck
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Wilck @ 2013-08-01 22:09 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

On 08/01/2013 11:13 PM, Martin Wilck wrote:
> Can you please try the attached patch?

DON'T use it, it's broken. I will send a better one in a minute.


> Regards,
> Martin
> 
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 22:09       ` Martin Wilck
@ 2013-08-01 22:37         ` Martin Wilck
  2013-08-03  9:43           ` Albert Pauw
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Wilck @ 2013-08-01 22:37 UTC (permalink / raw)
  To: Albert Pauw, linux-raid

Hi Albert,

On 08/02/2013 12:09 AM, Martin Wilck wrote:
> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>> Can you please try the attached patch?
> 
> DON'T use it, it's broken. I will send a better one in a minute.

My latest patch series should fix this. The fix for your problem is 3/4,
and I added your test case as 4/4. Retest is welcome.

Martin

> 
> 
>> Regards,
>> Martin
>>
>>
> 


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-01 22:37         ` Martin Wilck
@ 2013-08-03  9:43           ` Albert Pauw
  2013-08-04  9:47             ` Albert Pauw
  0 siblings, 1 reply; 15+ messages in thread
From: Albert Pauw @ 2013-08-03  9:43 UTC (permalink / raw)
  To: Martin Wilck; +Cc: linux-raid

Hi Martin,

I can confirm that patch 3/4 indeed fixed the problem.

Thanks for your quick fix.

Kind regards,

Albert

On 08/02/2013 12:37 AM, Martin Wilck wrote:
> Hi Albert,
>
> On 08/02/2013 12:09 AM, Martin Wilck wrote:
>> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>>> Can you please try the attached patch?
>> DON'T use it, it's broken. I will send a better one in a minute.
> My latest patch series should fix this. The fix for your problem is 3/4,
> and I added your test case as 4/4. Retest is welcome.
>
> Martin
>
>>
>>> Regards,
>>> Martin
>>>
>>>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-03  9:43           ` Albert Pauw
@ 2013-08-04  9:47             ` Albert Pauw
  2013-08-05 16:55               ` Albert Pauw
  0 siblings, 1 reply; 15+ messages in thread
From: Albert Pauw @ 2013-08-04  9:47 UTC (permalink / raw)
  To: Martin Wilck, linux-raid

Hi Martin,

I noticed another problem, with or without your patch, the problem occurs:

Zeroed the superblocks (just in case):
mdadm --zero-superblock /dev/loop[1-6]

Created the container:
mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5]

Created an md device in the container, it used /dev/loop4 and 
/dev/loop5, rebuild and finished
mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127

I fail one of the disks, it rebuild with one of the available unused 
disks in the container:
mdadm -f /dev/md0 /dev/loop4

I add another md device.
mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127

This last one is the odd bit, it is build using the previously failed 
disk. Looks like the container is not aware that /dev/loop4 has failed,
en reuses it. Which is wrong. So the failed status is not kept.

Regards,

Albert



On 08/03/2013 11:43 AM, Albert Pauw wrote:
> Hi Martin,
>
> I can confirm that patch 3/4 indeed fixed the problem.
>
> Thanks for your quick fix.
>
> Kind regards,
>
> Albert
>
> On 08/02/2013 12:37 AM, Martin Wilck wrote:
>> Hi Albert,
>>
>> On 08/02/2013 12:09 AM, Martin Wilck wrote:
>>> On 08/01/2013 11:13 PM, Martin Wilck wrote:
>>>> Can you please try the attached patch?
>>> DON'T use it, it's broken. I will send a better one in a minute.
>> My latest patch series should fix this. The fix for your problem is 3/4,
>> and I added your test case as 4/4. Retest is welcome.
>>
>> Martin
>>
>>>
>>>> Regards,
>>>> Martin
>>>>
>>>>
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-04  9:47             ` Albert Pauw
@ 2013-08-05 16:55               ` Albert Pauw
  2013-08-05 21:24                 ` Martin Wilck
  0 siblings, 1 reply; 15+ messages in thread
From: Albert Pauw @ 2013-08-05 16:55 UTC (permalink / raw)
  To: Martin Wilck, linux-raid, Neil Brown

Hi Neil/Martin,

I have pulled the latest git version (up and including Makefile: check 
that 'run' directory exists.).
It works, except for the last error I noticed, see below.

Regards,

Albert


On 08/04/2013 11:47 AM, Albert Pauw wrote:
> Hi Martin,
>
> I noticed another problem, with or without your patch, the problem 
> occurs:
>
> Zeroed the superblocks (just in case):
> mdadm --zero-superblock /dev/loop[1-6]
>
> Created the container:
> mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5]
>
> Created an md device in the container, it used /dev/loop4 and 
> /dev/loop5, rebuild and finished
> mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127
>
> I fail one of the disks, it rebuild with one of the available unused 
> disks in the container:
> mdadm -f /dev/md0 /dev/loop4
>
> I add another md device.
> mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127
>
> This last one is the odd bit, it is build using the previously failed 
> disk. Looks like the container is not aware that /dev/loop4 has failed,
> en reuses it. Which is wrong. So the failed status is not kept.
>
> Regards,
>
> Albert
>




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05 16:55               ` Albert Pauw
@ 2013-08-05 21:24                 ` Martin Wilck
  2013-08-06  0:16                   ` NeilBrown
  0 siblings, 1 reply; 15+ messages in thread
From: Martin Wilck @ 2013-08-05 21:24 UTC (permalink / raw)
  To: Albert Pauw, Neil Brown; +Cc: linux-raid

Hi Albert, Neil,

I just submitted a new patch series; patch 3/5 integrates your 2nd case
as a new unit test and 4/5 should fix it.

However @Neil: I am not yet entirely happy with this solution. AFAICS
there is a possible race condition here, if a disk fails and mdadm -CR
is called to create a new array before the metadata reflecting the
failure is written to disk. If a disk failure happens in one array,
mdmon will call reconcile_failed() to propagate the failure to other
already known arrays in the same container, by writing "faulty" to the
sysfs state attribute. It can't do that for a new container though.

I thought that process_update() may need to check the kernel state of
array members against meta data state when a new VD configuration record
is received, but that's impossible because we can't call open() on the
respective sysfs files. It could be done in prepare_update(), but that
would require major changes, I wanted to ask you first.

Another option would be changing manage_new(). But we don't seem to have
a suitable metadata handler method to pass the meta data state to the
manager....

Ideas?

Regards
Martin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-05 21:24                 ` Martin Wilck
@ 2013-08-06  0:16                   ` NeilBrown
  2013-08-06 21:26                     ` Martin Wilck
  0 siblings, 1 reply; 15+ messages in thread
From: NeilBrown @ 2013-08-06  0:16 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 2612 bytes --]

On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> Hi Albert, Neil,
> 
> I just submitted a new patch series; patch 3/5 integrates your 2nd case
> as a new unit test and 4/5 should fix it.
> 
> However @Neil: I am not yet entirely happy with this solution. AFAICS
> there is a possible race condition here, if a disk fails and mdadm -CR
> is called to create a new array before the metadata reflecting the
> failure is written to disk. If a disk failure happens in one array,
> mdmon will call reconcile_failed() to propagate the failure to other
> already known arrays in the same container, by writing "faulty" to the
> sysfs state attribute. It can't do that for a new container though.
> 
> I thought that process_update() may need to check the kernel state of
> array members against meta data state when a new VD configuration record
> is received, but that's impossible because we can't call open() on the
> respective sysfs files. It could be done in prepare_update(), but that
> would require major changes, I wanted to ask you first.
> 
> Another option would be changing manage_new(). But we don't seem to have
> a suitable metadata handler method to pass the meta data state to the
> manager....
> 
> Ideas?

Thanks for the patches - I applied them all.

Is there a race here?  When "mdadm -C" looks at the metadata the device will
either be an active member of another array, or it will be marked faulty.
Either way mdadm won't use it.

If the first array was created to use only (say) half of each device and the
second array was created with a size to fit in the other half of the device
then it might get interesting.
"mdadm -C" might see that everything looks good, create the array using the
second half of that drive that has just failed, and give that info to mdmon.

I suspect that ddf_open_new (which currently looks like it is just a stub)
needs to help out here.
When manage_new() gets told about a new array it will collect relevant info
from sysfs and call ->open_new() to make sure it matches the metadata.
ddf_open_new should check that all the devices in the array are recorded as
working in the metadata.  If any are failed, it can write 'faulty' to the
relevant state_fd.

Possibly the same thing can be done generically in manage_new() as you
suggested.  After the new array has been passed over to the monitor thread,
manage_new() could check if any devices should be failed much like
reconcile_failed() does and just fail them.

Does that make any sense?  Did I miss something?

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06  0:16                   ` NeilBrown
@ 2013-08-06 21:26                     ` Martin Wilck
  2013-08-07 18:07                       ` Albert Pauw
  2013-08-08  0:40                       ` NeilBrown
  0 siblings, 2 replies; 15+ messages in thread
From: Martin Wilck @ 2013-08-06 21:26 UTC (permalink / raw)
  To: NeilBrown; +Cc: Albert Pauw, linux-raid

On 08/06/2013 02:16 AM, NeilBrown wrote:
> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> 
>> Hi Albert, Neil,
>>
>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>> as a new unit test and 4/5 should fix it.
>>
>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>> there is a possible race condition here, if a disk fails and mdadm -CR
>> is called to create a new array before the metadata reflecting the
>> failure is written to disk. If a disk failure happens in one array,
>> mdmon will call reconcile_failed() to propagate the failure to other
>> already known arrays in the same container, by writing "faulty" to the
>> sysfs state attribute. It can't do that for a new container though.
>>
>> I thought that process_update() may need to check the kernel state of
>> array members against meta data state when a new VD configuration record
>> is received, but that's impossible because we can't call open() on the
>> respective sysfs files. It could be done in prepare_update(), but that
>> would require major changes, I wanted to ask you first.
>>
>> Another option would be changing manage_new(). But we don't seem to have
>> a suitable metadata handler method to pass the meta data state to the
>> manager....
>>
>> Ideas?
> 
> Thanks for the patches - I applied them all.

I don't see them in the public repo yet.

> Is there a race here?  When "mdadm -C" looks at the metadata the device will
> either be an active member of another array, or it will be marked faulty.
> Either way mdadm won't use it.

That's right, thanks.

> If the first array was created to use only (say) half of each device and the
> second array was created with a size to fit in the other half of the device
> then it might get interesting.
> "mdadm -C" might see that everything looks good, create the array using the
> second half of that drive that has just failed, and give that info to mdmon.

Yes, I have created a test case for this (10ddf-fail-create-race) which
I am going to submit soon.

> I suspect that ddf_open_new (which currently looks like it is just a stub)
> needs to help out here.

Great idea, I made an implementation. I found that I needed to freeze
the array in Create(), too, to avoid the kernel starting a rebuild
before the mdmon checked the correctness of the new array. Please review
that, I'm not 100% positive I got it right.

> When manage_new() gets told about a new array it will collect relevant info
> from sysfs and call ->open_new() to make sure it matches the metadata.
> ddf_open_new should check that all the devices in the array are recorded as
> working in the metadata.  If any are failed, it can write 'faulty' to the
> relevant state_fd.
> 
> Possibly the same thing can be done generically in manage_new() as you
> suggested.  After the new array has been passed over to the monitor thread,
> manage_new() could check if any devices should be failed much like
> reconcile_failed() does and just fail them.
> 
> Does that make any sense?  Did I miss something?

It makes a lot of sense.

While testing, I found another minor problem case:

 1 disk fails in array taking half size
 2 mdmon activates spare
 3 mdadm -C is called and finds old meta data, allocates extent at
offset 0 on the spare
 4 Create() gets an error writing to the "size" sysfs attribute because
offset 0 has been grabbed by the spare recovery already

That's not too bad, after all, because the array won't be created. The
user just needs to re-issue his mdadm -C command which will now succeed
because the meta data should have been written to disk in the meantime.

That said, some kind of locking between mdadm and mdmon (mdadm won't
read meta data as long as mdmon is busy writing them) might be
desirable. It would be even better to do all meta data operations
through mdmon, mdadm just sending messages to it. That would be a major
architectural change for mdadm, but it would avoid this kind of
"different meta data here and there" problem altogether.

Thanks
Martin

> 
> Thanks,
> NeilBrown


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06 21:26                     ` Martin Wilck
@ 2013-08-07 18:07                       ` Albert Pauw
  2013-08-08  0:40                       ` NeilBrown
  1 sibling, 0 replies; 15+ messages in thread
From: Albert Pauw @ 2013-08-07 18:07 UTC (permalink / raw)
  To: Martin Wilck, Neil Brown, linux-raid

Just to let you guys know that I pulled the latest git version 
(including DDF: Write new conf entries with a single write.)

As for now, it all seems to work rather nicely.

Thanks guys,

Albert


On 08/06/2013 11:26 PM, Martin Wilck wrote:
> On 08/06/2013 02:16 AM, NeilBrown wrote:
>> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
>>
>>> Hi Albert, Neil,
>>>
>>> I just submitted a new patch series; patch 3/5 integrates your 2nd case
>>> as a new unit test and 4/5 should fix it.
>>>
>>> However @Neil: I am not yet entirely happy with this solution. AFAICS
>>> there is a possible race condition here, if a disk fails and mdadm -CR
>>> is called to create a new array before the metadata reflecting the
>>> failure is written to disk. If a disk failure happens in one array,
>>> mdmon will call reconcile_failed() to propagate the failure to other
>>> already known arrays in the same container, by writing "faulty" to the
>>> sysfs state attribute. It can't do that for a new container though.
>>>
>>> I thought that process_update() may need to check the kernel state of
>>> array members against meta data state when a new VD configuration record
>>> is received, but that's impossible because we can't call open() on the
>>> respective sysfs files. It could be done in prepare_update(), but that
>>> would require major changes, I wanted to ask you first.
>>>
>>> Another option would be changing manage_new(). But we don't seem to have
>>> a suitable metadata handler method to pass the meta data state to the
>>> manager....
>>>
>>> Ideas?
>> Thanks for the patches - I applied them all.
> I don't see them in the public repo yet.
>
>> Is there a race here?  When "mdadm -C" looks at the metadata the device will
>> either be an active member of another array, or it will be marked faulty.
>> Either way mdadm won't use it.
> That's right, thanks.
>
>> If the first array was created to use only (say) half of each device and the
>> second array was created with a size to fit in the other half of the device
>> then it might get interesting.
>> "mdadm -C" might see that everything looks good, create the array using the
>> second half of that drive that has just failed, and give that info to mdmon.
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
>
>> I suspect that ddf_open_new (which currently looks like it is just a stub)
>> needs to help out here.
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
>
>> When manage_new() gets told about a new array it will collect relevant info
>> from sysfs and call ->open_new() to make sure it matches the metadata.
>> ddf_open_new should check that all the devices in the array are recorded as
>> working in the metadata.  If any are failed, it can write 'faulty' to the
>> relevant state_fd.
>>
>> Possibly the same thing can be done generically in manage_new() as you
>> suggested.  After the new array has been passed over to the monitor thread,
>> manage_new() could check if any devices should be failed much like
>> reconcile_failed() does and just fail them.
>>
>> Does that make any sense?  Did I miss something?
> It makes a lot of sense.
>
> While testing, I found another minor problem case:
>
>   1 disk fails in array taking half size
>   2 mdmon activates spare
>   3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>   4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
>
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
>
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.
>
> Thanks
> Martin
>
>> Thanks,
>> NeilBrown




^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: Bugreport ddf rebuild problems
  2013-08-06 21:26                     ` Martin Wilck
  2013-08-07 18:07                       ` Albert Pauw
@ 2013-08-08  0:40                       ` NeilBrown
  1 sibling, 0 replies; 15+ messages in thread
From: NeilBrown @ 2013-08-08  0:40 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Albert Pauw, linux-raid

[-- Attachment #1: Type: text/plain, Size: 5019 bytes --]

On Tue, 06 Aug 2013 23:26:30 +0200 Martin Wilck <mwilck@arcor.de> wrote:

> On 08/06/2013 02:16 AM, NeilBrown wrote:
> > On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote:
> > 
> >> Hi Albert, Neil,
> >>
> >> I just submitted a new patch series; patch 3/5 integrates your 2nd case
> >> as a new unit test and 4/5 should fix it.
> >>
> >> However @Neil: I am not yet entirely happy with this solution. AFAICS
> >> there is a possible race condition here, if a disk fails and mdadm -CR
> >> is called to create a new array before the metadata reflecting the
> >> failure is written to disk. If a disk failure happens in one array,
> >> mdmon will call reconcile_failed() to propagate the failure to other
> >> already known arrays in the same container, by writing "faulty" to the
> >> sysfs state attribute. It can't do that for a new container though.
> >>
> >> I thought that process_update() may need to check the kernel state of
> >> array members against meta data state when a new VD configuration record
> >> is received, but that's impossible because we can't call open() on the
> >> respective sysfs files. It could be done in prepare_update(), but that
> >> would require major changes, I wanted to ask you first.
> >>
> >> Another option would be changing manage_new(). But we don't seem to have
> >> a suitable metadata handler method to pass the meta data state to the
> >> manager....
> >>
> >> Ideas?
> > 
> > Thanks for the patches - I applied them all.
> 
> I don't see them in the public repo yet.
> 
> > Is there a race here?  When "mdadm -C" looks at the metadata the device will
> > either be an active member of another array, or it will be marked faulty.
> > Either way mdadm won't use it.
> 
> That's right, thanks.
> 
> > If the first array was created to use only (say) half of each device and the
> > second array was created with a size to fit in the other half of the device
> > then it might get interesting.
> > "mdadm -C" might see that everything looks good, create the array using the
> > second half of that drive that has just failed, and give that info to mdmon.
> 
> Yes, I have created a test case for this (10ddf-fail-create-race) which
> I am going to submit soon.
> 
> > I suspect that ddf_open_new (which currently looks like it is just a stub)
> > needs to help out here.
> 
> Great idea, I made an implementation. I found that I needed to freeze
> the array in Create(), too, to avoid the kernel starting a rebuild
> before the mdmon checked the correctness of the new array. Please review
> that, I'm not 100% positive I got it right.
> 
> > When manage_new() gets told about a new array it will collect relevant info
> > from sysfs and call ->open_new() to make sure it matches the metadata.
> > ddf_open_new should check that all the devices in the array are recorded as
> > working in the metadata.  If any are failed, it can write 'faulty' to the
> > relevant state_fd.
> > 
> > Possibly the same thing can be done generically in manage_new() as you
> > suggested.  After the new array has been passed over to the monitor thread,
> > manage_new() could check if any devices should be failed much like
> > reconcile_failed() does and just fail them.
> > 
> > Does that make any sense?  Did I miss something?
> 
> It makes a lot of sense.
> 
> While testing, I found another minor problem case:
> 
>  1 disk fails in array taking half size
>  2 mdmon activates spare
>  3 mdadm -C is called and finds old meta data, allocates extent at
> offset 0 on the spare
>  4 Create() gets an error writing to the "size" sysfs attribute because
> offset 0 has been grabbed by the spare recovery already
> 
> That's not too bad, after all, because the array won't be created. The
> user just needs to re-issue his mdadm -C command which will now succeed
> because the meta data should have been written to disk in the meantime.
> 
> That said, some kind of locking between mdadm and mdmon (mdadm won't
> read meta data as long as mdmon is busy writing them) might be
> desirable. It would be even better to do all meta data operations
> through mdmon, mdadm just sending messages to it. That would be a major
> architectural change for mdadm, but it would avoid this kind of
> "different meta data here and there" problem altogether.

I think the way this locking "should" happen is with an O_EXCL lock on the
container.
mdadm and mdmon already do this to some extent, but it is hard to find :-)

The most obvious usage is that mdmon insists on getting an O_EXCL open before
it will exit.  --create also gets an O_EXCL open, so mdmon cannot
exit while a new array might be being created.
See "When creating a member, we need to be careful" in Create.c and
" No interesting arrays, or we have been told to" in monitor.c

So this race could be closed by manager getting an O_EXCL open on the
container before performing spare management.

Thanks,
NeilBrown

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2013-08-08  0:40 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com>
2013-08-05  6:21 ` Bugreport ddf rebuild problems NeilBrown
2013-08-05  7:17   ` Albert Pauw
2013-08-01 19:30 Albert Pauw
2013-08-01 19:09 ` Martin Wilck
     [not found]   ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>
2013-08-01 21:13     ` Martin Wilck
2013-08-01 22:09       ` Martin Wilck
2013-08-01 22:37         ` Martin Wilck
2013-08-03  9:43           ` Albert Pauw
2013-08-04  9:47             ` Albert Pauw
2013-08-05 16:55               ` Albert Pauw
2013-08-05 21:24                 ` Martin Wilck
2013-08-06  0:16                   ` NeilBrown
2013-08-06 21:26                     ` Martin Wilck
2013-08-07 18:07                       ` Albert Pauw
2013-08-08  0:40                       ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).