* Re: Bugreport ddf rebuild problems [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com> @ 2013-08-05 6:21 ` NeilBrown 2013-08-05 7:17 ` Albert Pauw 0 siblings, 1 reply; 15+ messages in thread From: NeilBrown @ 2013-08-05 6:21 UTC (permalink / raw) To: Albert Pauw; +Cc: linux-raid [-- Attachment #1: Type: text/plain, Size: 620 bytes --] On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote: > Now I've got it working, I checked a simple rebuild test with a ddf > container, containing two raidsets (raid 1 and raid 5). > > Having 6 loopback devices of 512 MB each I opened in a separate screen this > to monitor what happens: > > while true; do clear; cat /proc/mdstat ; sleep 1; done > For future reference: watch cat /proc/mdstat works very nicely for this sort of task. I believe the bug you described is fixed in the current git tree. If you could confirm I would appreciate it. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-05 6:21 ` Bugreport ddf rebuild problems NeilBrown @ 2013-08-05 7:17 ` Albert Pauw 0 siblings, 0 replies; 15+ messages in thread From: Albert Pauw @ 2013-08-05 7:17 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid Hi Neil, I will check the latest git version tonight and let you know. Thanks to you and Martin Wilck, Albert On 5 August 2013 08:21, NeilBrown <neilb@suse.de> wrote: > On Thu, 1 Aug 2013 21:22:44 +0200 Albert Pauw <albert.pauw@gmail.com> wrote: > >> Now I've got it working, I checked a simple rebuild test with a ddf >> container, containing two raidsets (raid 1 and raid 5). >> >> Having 6 loopback devices of 512 MB each I opened in a separate screen this >> to monitor what happens: >> >> while true; do clear; cat /proc/mdstat ; sleep 1; done >> > > For future reference: > watch cat /proc/mdstat > > works very nicely for this sort of task. > > I believe the bug you described is fixed in the current git tree. If you > could confirm I would appreciate it. > > Thanks, > NeilBrown ^ permalink raw reply [flat|nested] 15+ messages in thread
* Bugreport ddf rebuild problems @ 2013-08-01 19:30 Albert Pauw 2013-08-01 19:09 ` Martin Wilck 0 siblings, 1 reply; 15+ messages in thread From: Albert Pauw @ 2013-08-01 19:30 UTC (permalink / raw) To: linux-raid Now I've got it working, I checked a simple rebuild test with a ddf container, containing two raidsets (raid 1 and raid 5). Having 6 loopback devices of 512 MB each I opened in a separate screen this to monitor what happens: while true; do clear; cat /proc/mdstat ; sleep 1; done Now are here my commands, en watch the rebuild. [root@moonlight ~]# mdadm --zero-superblock /dev/loop[1-6] [root@moonlight ~]# mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5] mdadm: /dev/loop1 appears to contain an ext2fs file system size=153600K mtime=Thu Jan 1 01:00:00 1970 mdadm: /dev/loop2 appears to contain an ext2fs file system size=153600K mtime=Thu Jan 1 01:00:00 1970 mdadm: container /dev/md127 prepared. [root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md0 mdadm: You haven't given enough devices (real or missing) to create this array [root@moonlight ~]# mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127 mdadm: array /dev/md0 started. [root@moonlight ~]# mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127 mdadm: array /dev/md1 started. [root@moonlight ~]# man mdadm Formatting page, please wait... [root@moonlight ~]# mdadm -f /dev/md0 /dev/loop4 mdadm: set /dev/loop4 faulty in /dev/md0 [root@moonlight ~]# mdadm -f /dev/md1 /dev/loop2 mdadm: set /dev/loop2 faulty in /dev/md1 [root@moonlight ~]# mdadm --add /dev/md127 /dev/loop6 mdadm: added /dev/loop6 What should happen is that one of the raid sets get rebuild by the added spare, however the one raid set gets the spare disk (which is good) and starts rebuilding, the second one, well have a look: root@moonlight ~]# cat /proc/mdstat Personalities : [raid1] [raid6] [raid5] [raid4] md1 : active raid5 loop6[3] loop1[2] loop3[0] 958464 blocks super external:/md127/1 level 5, 512k chunk, algorithm 2 [3/3] [UUU] md0 : active raid1 loop3[3](S) loop2[2] loop5[0] 479232 blocks super external:/md127/0 [2/2] [UU] md127 : inactive loop6[5](S) loop5[4](S) loop4[3](S) loop3[2](S) loop2[1](S) loop1[0](S) 196608 blocks super external:ddf Effectively a rebuild raid 1 set with three disks (md0) including a spare, using a failed disk. Looks like something is going wrong here. Albert ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-01 19:30 Albert Pauw @ 2013-08-01 19:09 ` Martin Wilck [not found] ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com> 0 siblings, 1 reply; 15+ messages in thread From: Martin Wilck @ 2013-08-01 19:09 UTC (permalink / raw) To: Albert Pauw, linux-raid Hi Albert, thanks for reporting this. I can reproduce it. For curiosity, is this a regression? Martin ^ permalink raw reply [flat|nested] 15+ messages in thread
[parent not found: <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com>]
* Re: Bugreport ddf rebuild problems [not found] ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com> @ 2013-08-01 21:13 ` Martin Wilck 2013-08-01 22:09 ` Martin Wilck 0 siblings, 1 reply; 15+ messages in thread From: Martin Wilck @ 2013-08-01 21:13 UTC (permalink / raw) To: Albert Pauw, linux-raid [-- Attachment #1: Type: text/plain, Size: 58 bytes --] Can you please try the attached patch? Regards, Martin [-- Attachment #2: 0001-DDF-ddf_process_update-delete-removed-disks-from-dli.patch --] [-- Type: text/x-patch, Size: 1799 bytes --] From e6b25d31ee79c1191f936e178324c97d2ce68a57 Mon Sep 17 00:00:00 2001 From: Martin Wilck <mwilck@arcor.de> Date: Fri, 2 Aug 2013 00:11:20 +0200 Subject: [PATCH] DDF: ddf_process_update: delete removed disks from dlist We currently remove Failed disks that are no longer used by any VD. If we do that, we must remove them from dlist, too, because otherwise the same pdnum may occur multiple times in the dlist. --- super-ddf.c | 24 +++++++++++++++++++++--- 1 files changed, 21 insertions(+), 3 deletions(-) diff --git a/super-ddf.c b/super-ddf.c index 65472a2..f8d5fcd 100644 --- a/super-ddf.c +++ b/super-ddf.c @@ -4366,6 +4366,18 @@ static void copy_matching_bvd(struct ddf_super *ddf, conf->sec_elmnt_seq, guid_str(conf->guid)); } +static void _delete_dl_by_refnum(struct ddf_super *ddf, be32 refnum) +{ + struct dl **pdl = &ddf->dlist; + while (*pdl) { + if (be32_eq((*pdl)->disk.refnum, refnum)) { + free(*pdl); + *pdl = (*pdl)->next; + } else + pdl = &(*pdl)->next; + } +} + static void ddf_process_update(struct supertype *st, struct metadata_update *update) { @@ -4638,9 +4650,15 @@ static void ddf_process_update(struct supertype *st, if (be16_and(ddf->phys->entries[pdnum].state, cpu_to_be16(DDF_Failed)) && be16_and(ddf->phys->entries[pdnum].state, - cpu_to_be16(DDF_Transition))) - /* skip this one */; - else if (pdnum == pd2) + cpu_to_be16(DDF_Transition))) { + /* skip this one and remove from dlist */ + dprintf("%s: %08x no longer used, removing it\n", + __func__, + be32_to_cpu(ddf->phys-> + entries[pdnum].refnum)); + _delete_dl_by_refnum( + ddf, ddf->phys->entries[pdnum].refnum); + } else if (pdnum == pd2) pd2++; else { ddf->phys->entries[pd2] = -- 1.7.1 ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-01 21:13 ` Martin Wilck @ 2013-08-01 22:09 ` Martin Wilck 2013-08-01 22:37 ` Martin Wilck 0 siblings, 1 reply; 15+ messages in thread From: Martin Wilck @ 2013-08-01 22:09 UTC (permalink / raw) To: Albert Pauw, linux-raid On 08/01/2013 11:13 PM, Martin Wilck wrote: > Can you please try the attached patch? DON'T use it, it's broken. I will send a better one in a minute. > Regards, > Martin > > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-01 22:09 ` Martin Wilck @ 2013-08-01 22:37 ` Martin Wilck 2013-08-03 9:43 ` Albert Pauw 0 siblings, 1 reply; 15+ messages in thread From: Martin Wilck @ 2013-08-01 22:37 UTC (permalink / raw) To: Albert Pauw, linux-raid Hi Albert, On 08/02/2013 12:09 AM, Martin Wilck wrote: > On 08/01/2013 11:13 PM, Martin Wilck wrote: >> Can you please try the attached patch? > > DON'T use it, it's broken. I will send a better one in a minute. My latest patch series should fix this. The fix for your problem is 3/4, and I added your test case as 4/4. Retest is welcome. Martin > > >> Regards, >> Martin >> >> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-01 22:37 ` Martin Wilck @ 2013-08-03 9:43 ` Albert Pauw 2013-08-04 9:47 ` Albert Pauw 0 siblings, 1 reply; 15+ messages in thread From: Albert Pauw @ 2013-08-03 9:43 UTC (permalink / raw) To: Martin Wilck; +Cc: linux-raid Hi Martin, I can confirm that patch 3/4 indeed fixed the problem. Thanks for your quick fix. Kind regards, Albert On 08/02/2013 12:37 AM, Martin Wilck wrote: > Hi Albert, > > On 08/02/2013 12:09 AM, Martin Wilck wrote: >> On 08/01/2013 11:13 PM, Martin Wilck wrote: >>> Can you please try the attached patch? >> DON'T use it, it's broken. I will send a better one in a minute. > My latest patch series should fix this. The fix for your problem is 3/4, > and I added your test case as 4/4. Retest is welcome. > > Martin > >> >>> Regards, >>> Martin >>> >>> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-03 9:43 ` Albert Pauw @ 2013-08-04 9:47 ` Albert Pauw 2013-08-05 16:55 ` Albert Pauw 0 siblings, 1 reply; 15+ messages in thread From: Albert Pauw @ 2013-08-04 9:47 UTC (permalink / raw) To: Martin Wilck, linux-raid Hi Martin, I noticed another problem, with or without your patch, the problem occurs: Zeroed the superblocks (just in case): mdadm --zero-superblock /dev/loop[1-6] Created the container: mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5] Created an md device in the container, it used /dev/loop4 and /dev/loop5, rebuild and finished mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127 I fail one of the disks, it rebuild with one of the available unused disks in the container: mdadm -f /dev/md0 /dev/loop4 I add another md device. mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127 This last one is the odd bit, it is build using the previously failed disk. Looks like the container is not aware that /dev/loop4 has failed, en reuses it. Which is wrong. So the failed status is not kept. Regards, Albert On 08/03/2013 11:43 AM, Albert Pauw wrote: > Hi Martin, > > I can confirm that patch 3/4 indeed fixed the problem. > > Thanks for your quick fix. > > Kind regards, > > Albert > > On 08/02/2013 12:37 AM, Martin Wilck wrote: >> Hi Albert, >> >> On 08/02/2013 12:09 AM, Martin Wilck wrote: >>> On 08/01/2013 11:13 PM, Martin Wilck wrote: >>>> Can you please try the attached patch? >>> DON'T use it, it's broken. I will send a better one in a minute. >> My latest patch series should fix this. The fix for your problem is 3/4, >> and I added your test case as 4/4. Retest is welcome. >> >> Martin >> >>> >>>> Regards, >>>> Martin >>>> >>>> > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-04 9:47 ` Albert Pauw @ 2013-08-05 16:55 ` Albert Pauw 2013-08-05 21:24 ` Martin Wilck 0 siblings, 1 reply; 15+ messages in thread From: Albert Pauw @ 2013-08-05 16:55 UTC (permalink / raw) To: Martin Wilck, linux-raid, Neil Brown Hi Neil/Martin, I have pulled the latest git version (up and including Makefile: check that 'run' directory exists.). It works, except for the last error I noticed, see below. Regards, Albert On 08/04/2013 11:47 AM, Albert Pauw wrote: > Hi Martin, > > I noticed another problem, with or without your patch, the problem > occurs: > > Zeroed the superblocks (just in case): > mdadm --zero-superblock /dev/loop[1-6] > > Created the container: > mdadm -CR /dev/md127 -e ddf -l container -n 5 /dev/loop[1-5] > > Created an md device in the container, it used /dev/loop4 and > /dev/loop5, rebuild and finished > mdadm -CR /dev/md0 -l raid1 -n 2 /dev/md127 > > I fail one of the disks, it rebuild with one of the available unused > disks in the container: > mdadm -f /dev/md0 /dev/loop4 > > I add another md device. > mdadm -CR /dev/md1 -l raid5 -n 3 /dev/md127 > > This last one is the odd bit, it is build using the previously failed > disk. Looks like the container is not aware that /dev/loop4 has failed, > en reuses it. Which is wrong. So the failed status is not kept. > > Regards, > > Albert > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-05 16:55 ` Albert Pauw @ 2013-08-05 21:24 ` Martin Wilck 2013-08-06 0:16 ` NeilBrown 0 siblings, 1 reply; 15+ messages in thread From: Martin Wilck @ 2013-08-05 21:24 UTC (permalink / raw) To: Albert Pauw, Neil Brown; +Cc: linux-raid Hi Albert, Neil, I just submitted a new patch series; patch 3/5 integrates your 2nd case as a new unit test and 4/5 should fix it. However @Neil: I am not yet entirely happy with this solution. AFAICS there is a possible race condition here, if a disk fails and mdadm -CR is called to create a new array before the metadata reflecting the failure is written to disk. If a disk failure happens in one array, mdmon will call reconcile_failed() to propagate the failure to other already known arrays in the same container, by writing "faulty" to the sysfs state attribute. It can't do that for a new container though. I thought that process_update() may need to check the kernel state of array members against meta data state when a new VD configuration record is received, but that's impossible because we can't call open() on the respective sysfs files. It could be done in prepare_update(), but that would require major changes, I wanted to ask you first. Another option would be changing manage_new(). But we don't seem to have a suitable metadata handler method to pass the meta data state to the manager.... Ideas? Regards Martin ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-05 21:24 ` Martin Wilck @ 2013-08-06 0:16 ` NeilBrown 2013-08-06 21:26 ` Martin Wilck 0 siblings, 1 reply; 15+ messages in thread From: NeilBrown @ 2013-08-06 0:16 UTC (permalink / raw) To: Martin Wilck; +Cc: Albert Pauw, linux-raid [-- Attachment #1: Type: text/plain, Size: 2612 bytes --] On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote: > Hi Albert, Neil, > > I just submitted a new patch series; patch 3/5 integrates your 2nd case > as a new unit test and 4/5 should fix it. > > However @Neil: I am not yet entirely happy with this solution. AFAICS > there is a possible race condition here, if a disk fails and mdadm -CR > is called to create a new array before the metadata reflecting the > failure is written to disk. If a disk failure happens in one array, > mdmon will call reconcile_failed() to propagate the failure to other > already known arrays in the same container, by writing "faulty" to the > sysfs state attribute. It can't do that for a new container though. > > I thought that process_update() may need to check the kernel state of > array members against meta data state when a new VD configuration record > is received, but that's impossible because we can't call open() on the > respective sysfs files. It could be done in prepare_update(), but that > would require major changes, I wanted to ask you first. > > Another option would be changing manage_new(). But we don't seem to have > a suitable metadata handler method to pass the meta data state to the > manager.... > > Ideas? Thanks for the patches - I applied them all. Is there a race here? When "mdadm -C" looks at the metadata the device will either be an active member of another array, or it will be marked faulty. Either way mdadm won't use it. If the first array was created to use only (say) half of each device and the second array was created with a size to fit in the other half of the device then it might get interesting. "mdadm -C" might see that everything looks good, create the array using the second half of that drive that has just failed, and give that info to mdmon. I suspect that ddf_open_new (which currently looks like it is just a stub) needs to help out here. When manage_new() gets told about a new array it will collect relevant info from sysfs and call ->open_new() to make sure it matches the metadata. ddf_open_new should check that all the devices in the array are recorded as working in the metadata. If any are failed, it can write 'faulty' to the relevant state_fd. Possibly the same thing can be done generically in manage_new() as you suggested. After the new array has been passed over to the monitor thread, manage_new() could check if any devices should be failed much like reconcile_failed() does and just fail them. Does that make any sense? Did I miss something? Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-06 0:16 ` NeilBrown @ 2013-08-06 21:26 ` Martin Wilck 2013-08-07 18:07 ` Albert Pauw 2013-08-08 0:40 ` NeilBrown 0 siblings, 2 replies; 15+ messages in thread From: Martin Wilck @ 2013-08-06 21:26 UTC (permalink / raw) To: NeilBrown; +Cc: Albert Pauw, linux-raid On 08/06/2013 02:16 AM, NeilBrown wrote: > On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote: > >> Hi Albert, Neil, >> >> I just submitted a new patch series; patch 3/5 integrates your 2nd case >> as a new unit test and 4/5 should fix it. >> >> However @Neil: I am not yet entirely happy with this solution. AFAICS >> there is a possible race condition here, if a disk fails and mdadm -CR >> is called to create a new array before the metadata reflecting the >> failure is written to disk. If a disk failure happens in one array, >> mdmon will call reconcile_failed() to propagate the failure to other >> already known arrays in the same container, by writing "faulty" to the >> sysfs state attribute. It can't do that for a new container though. >> >> I thought that process_update() may need to check the kernel state of >> array members against meta data state when a new VD configuration record >> is received, but that's impossible because we can't call open() on the >> respective sysfs files. It could be done in prepare_update(), but that >> would require major changes, I wanted to ask you first. >> >> Another option would be changing manage_new(). But we don't seem to have >> a suitable metadata handler method to pass the meta data state to the >> manager.... >> >> Ideas? > > Thanks for the patches - I applied them all. I don't see them in the public repo yet. > Is there a race here? When "mdadm -C" looks at the metadata the device will > either be an active member of another array, or it will be marked faulty. > Either way mdadm won't use it. That's right, thanks. > If the first array was created to use only (say) half of each device and the > second array was created with a size to fit in the other half of the device > then it might get interesting. > "mdadm -C" might see that everything looks good, create the array using the > second half of that drive that has just failed, and give that info to mdmon. Yes, I have created a test case for this (10ddf-fail-create-race) which I am going to submit soon. > I suspect that ddf_open_new (which currently looks like it is just a stub) > needs to help out here. Great idea, I made an implementation. I found that I needed to freeze the array in Create(), too, to avoid the kernel starting a rebuild before the mdmon checked the correctness of the new array. Please review that, I'm not 100% positive I got it right. > When manage_new() gets told about a new array it will collect relevant info > from sysfs and call ->open_new() to make sure it matches the metadata. > ddf_open_new should check that all the devices in the array are recorded as > working in the metadata. If any are failed, it can write 'faulty' to the > relevant state_fd. > > Possibly the same thing can be done generically in manage_new() as you > suggested. After the new array has been passed over to the monitor thread, > manage_new() could check if any devices should be failed much like > reconcile_failed() does and just fail them. > > Does that make any sense? Did I miss something? It makes a lot of sense. While testing, I found another minor problem case: 1 disk fails in array taking half size 2 mdmon activates spare 3 mdadm -C is called and finds old meta data, allocates extent at offset 0 on the spare 4 Create() gets an error writing to the "size" sysfs attribute because offset 0 has been grabbed by the spare recovery already That's not too bad, after all, because the array won't be created. The user just needs to re-issue his mdadm -C command which will now succeed because the meta data should have been written to disk in the meantime. That said, some kind of locking between mdadm and mdmon (mdadm won't read meta data as long as mdmon is busy writing them) might be desirable. It would be even better to do all meta data operations through mdmon, mdadm just sending messages to it. That would be a major architectural change for mdadm, but it would avoid this kind of "different meta data here and there" problem altogether. Thanks Martin > > Thanks, > NeilBrown ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-06 21:26 ` Martin Wilck @ 2013-08-07 18:07 ` Albert Pauw 2013-08-08 0:40 ` NeilBrown 1 sibling, 0 replies; 15+ messages in thread From: Albert Pauw @ 2013-08-07 18:07 UTC (permalink / raw) To: Martin Wilck, Neil Brown, linux-raid Just to let you guys know that I pulled the latest git version (including DDF: Write new conf entries with a single write.) As for now, it all seems to work rather nicely. Thanks guys, Albert On 08/06/2013 11:26 PM, Martin Wilck wrote: > On 08/06/2013 02:16 AM, NeilBrown wrote: >> On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote: >> >>> Hi Albert, Neil, >>> >>> I just submitted a new patch series; patch 3/5 integrates your 2nd case >>> as a new unit test and 4/5 should fix it. >>> >>> However @Neil: I am not yet entirely happy with this solution. AFAICS >>> there is a possible race condition here, if a disk fails and mdadm -CR >>> is called to create a new array before the metadata reflecting the >>> failure is written to disk. If a disk failure happens in one array, >>> mdmon will call reconcile_failed() to propagate the failure to other >>> already known arrays in the same container, by writing "faulty" to the >>> sysfs state attribute. It can't do that for a new container though. >>> >>> I thought that process_update() may need to check the kernel state of >>> array members against meta data state when a new VD configuration record >>> is received, but that's impossible because we can't call open() on the >>> respective sysfs files. It could be done in prepare_update(), but that >>> would require major changes, I wanted to ask you first. >>> >>> Another option would be changing manage_new(). But we don't seem to have >>> a suitable metadata handler method to pass the meta data state to the >>> manager.... >>> >>> Ideas? >> Thanks for the patches - I applied them all. > I don't see them in the public repo yet. > >> Is there a race here? When "mdadm -C" looks at the metadata the device will >> either be an active member of another array, or it will be marked faulty. >> Either way mdadm won't use it. > That's right, thanks. > >> If the first array was created to use only (say) half of each device and the >> second array was created with a size to fit in the other half of the device >> then it might get interesting. >> "mdadm -C" might see that everything looks good, create the array using the >> second half of that drive that has just failed, and give that info to mdmon. > Yes, I have created a test case for this (10ddf-fail-create-race) which > I am going to submit soon. > >> I suspect that ddf_open_new (which currently looks like it is just a stub) >> needs to help out here. > Great idea, I made an implementation. I found that I needed to freeze > the array in Create(), too, to avoid the kernel starting a rebuild > before the mdmon checked the correctness of the new array. Please review > that, I'm not 100% positive I got it right. > >> When manage_new() gets told about a new array it will collect relevant info >> from sysfs and call ->open_new() to make sure it matches the metadata. >> ddf_open_new should check that all the devices in the array are recorded as >> working in the metadata. If any are failed, it can write 'faulty' to the >> relevant state_fd. >> >> Possibly the same thing can be done generically in manage_new() as you >> suggested. After the new array has been passed over to the monitor thread, >> manage_new() could check if any devices should be failed much like >> reconcile_failed() does and just fail them. >> >> Does that make any sense? Did I miss something? > It makes a lot of sense. > > While testing, I found another minor problem case: > > 1 disk fails in array taking half size > 2 mdmon activates spare > 3 mdadm -C is called and finds old meta data, allocates extent at > offset 0 on the spare > 4 Create() gets an error writing to the "size" sysfs attribute because > offset 0 has been grabbed by the spare recovery already > > That's not too bad, after all, because the array won't be created. The > user just needs to re-issue his mdadm -C command which will now succeed > because the meta data should have been written to disk in the meantime. > > That said, some kind of locking between mdadm and mdmon (mdadm won't > read meta data as long as mdmon is busy writing them) might be > desirable. It would be even better to do all meta data operations > through mdmon, mdadm just sending messages to it. That would be a major > architectural change for mdadm, but it would avoid this kind of > "different meta data here and there" problem altogether. > > Thanks > Martin > >> Thanks, >> NeilBrown ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: Bugreport ddf rebuild problems 2013-08-06 21:26 ` Martin Wilck 2013-08-07 18:07 ` Albert Pauw @ 2013-08-08 0:40 ` NeilBrown 1 sibling, 0 replies; 15+ messages in thread From: NeilBrown @ 2013-08-08 0:40 UTC (permalink / raw) To: Martin Wilck; +Cc: Albert Pauw, linux-raid [-- Attachment #1: Type: text/plain, Size: 5019 bytes --] On Tue, 06 Aug 2013 23:26:30 +0200 Martin Wilck <mwilck@arcor.de> wrote: > On 08/06/2013 02:16 AM, NeilBrown wrote: > > On Mon, 05 Aug 2013 23:24:28 +0200 Martin Wilck <mwilck@arcor.de> wrote: > > > >> Hi Albert, Neil, > >> > >> I just submitted a new patch series; patch 3/5 integrates your 2nd case > >> as a new unit test and 4/5 should fix it. > >> > >> However @Neil: I am not yet entirely happy with this solution. AFAICS > >> there is a possible race condition here, if a disk fails and mdadm -CR > >> is called to create a new array before the metadata reflecting the > >> failure is written to disk. If a disk failure happens in one array, > >> mdmon will call reconcile_failed() to propagate the failure to other > >> already known arrays in the same container, by writing "faulty" to the > >> sysfs state attribute. It can't do that for a new container though. > >> > >> I thought that process_update() may need to check the kernel state of > >> array members against meta data state when a new VD configuration record > >> is received, but that's impossible because we can't call open() on the > >> respective sysfs files. It could be done in prepare_update(), but that > >> would require major changes, I wanted to ask you first. > >> > >> Another option would be changing manage_new(). But we don't seem to have > >> a suitable metadata handler method to pass the meta data state to the > >> manager.... > >> > >> Ideas? > > > > Thanks for the patches - I applied them all. > > I don't see them in the public repo yet. > > > Is there a race here? When "mdadm -C" looks at the metadata the device will > > either be an active member of another array, or it will be marked faulty. > > Either way mdadm won't use it. > > That's right, thanks. > > > If the first array was created to use only (say) half of each device and the > > second array was created with a size to fit in the other half of the device > > then it might get interesting. > > "mdadm -C" might see that everything looks good, create the array using the > > second half of that drive that has just failed, and give that info to mdmon. > > Yes, I have created a test case for this (10ddf-fail-create-race) which > I am going to submit soon. > > > I suspect that ddf_open_new (which currently looks like it is just a stub) > > needs to help out here. > > Great idea, I made an implementation. I found that I needed to freeze > the array in Create(), too, to avoid the kernel starting a rebuild > before the mdmon checked the correctness of the new array. Please review > that, I'm not 100% positive I got it right. > > > When manage_new() gets told about a new array it will collect relevant info > > from sysfs and call ->open_new() to make sure it matches the metadata. > > ddf_open_new should check that all the devices in the array are recorded as > > working in the metadata. If any are failed, it can write 'faulty' to the > > relevant state_fd. > > > > Possibly the same thing can be done generically in manage_new() as you > > suggested. After the new array has been passed over to the monitor thread, > > manage_new() could check if any devices should be failed much like > > reconcile_failed() does and just fail them. > > > > Does that make any sense? Did I miss something? > > It makes a lot of sense. > > While testing, I found another minor problem case: > > 1 disk fails in array taking half size > 2 mdmon activates spare > 3 mdadm -C is called and finds old meta data, allocates extent at > offset 0 on the spare > 4 Create() gets an error writing to the "size" sysfs attribute because > offset 0 has been grabbed by the spare recovery already > > That's not too bad, after all, because the array won't be created. The > user just needs to re-issue his mdadm -C command which will now succeed > because the meta data should have been written to disk in the meantime. > > That said, some kind of locking between mdadm and mdmon (mdadm won't > read meta data as long as mdmon is busy writing them) might be > desirable. It would be even better to do all meta data operations > through mdmon, mdadm just sending messages to it. That would be a major > architectural change for mdadm, but it would avoid this kind of > "different meta data here and there" problem altogether. I think the way this locking "should" happen is with an O_EXCL lock on the container. mdadm and mdmon already do this to some extent, but it is hard to find :-) The most obvious usage is that mdmon insists on getting an O_EXCL open before it will exit. --create also gets an O_EXCL open, so mdmon cannot exit while a new array might be being created. See "When creating a member, we need to be careful" in Create.c and " No interesting arrays, or we have been told to" in monitor.c So this race could be closed by manager getting an O_EXCL open on the container before performing spare management. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2013-08-08 0:40 UTC | newest] Thread overview: 15+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CAGkViCHPvbmcehFvACBKVFFCw+DdnjqvK2uNGmvKrFki+n9n-Q@mail.gmail.com> 2013-08-05 6:21 ` Bugreport ddf rebuild problems NeilBrown 2013-08-05 7:17 ` Albert Pauw 2013-08-01 19:30 Albert Pauw 2013-08-01 19:09 ` Martin Wilck [not found] ` <CAGkViCFT0+qm9YAnAJM7JGgVg0RTJi8=HAYDTMs-mfhXinqdcg@mail.gmail.com> 2013-08-01 21:13 ` Martin Wilck 2013-08-01 22:09 ` Martin Wilck 2013-08-01 22:37 ` Martin Wilck 2013-08-03 9:43 ` Albert Pauw 2013-08-04 9:47 ` Albert Pauw 2013-08-05 16:55 ` Albert Pauw 2013-08-05 21:24 ` Martin Wilck 2013-08-06 0:16 ` NeilBrown 2013-08-06 21:26 ` Martin Wilck 2013-08-07 18:07 ` Albert Pauw 2013-08-08 0:40 ` NeilBrown
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).