* [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble @ 2011-10-21 13:33 Jes.Sorensen 2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen ` (2 more replies) 0 siblings, 3 replies; 11+ messages in thread From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw) To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford From: Jes Sorensen <Jes.Sorensen@redhat.com> Hi, When I posted this yesterday I missed a couple of exist cases in the patch changing the locking scheme for Incremental_container(). Not sure how I managed that, but fixed in v2 of the patch. The other two patches remain onchanged. This patch set fixes the problem with mdadm racing udev during during incremental or regular assembly of IMSM raids. There are three patches to the set: 1) Hold the lock while running Incremental_container() to avoid udev (or someone else like a boot script) kicking off an additional mdadm instance to try and assemble the container in parallel. 2) Don't send udev a 'change' event to handle incremental assembly since we are about to do it ourselves anyway. 3) Hold the lock during Assemble() again to avoid udev kicking in behind our backs. With these patches in place, I can no longer reproduce the case I was seeing on Fedora 16 Beta where an array would not come up correctly. Patches are on top of Neil's current git repository. Comments? Cheers, Jes Jes Sorensen (3): Remove race for starting container devices. Don't tell sysfs to launch the container as we are doing it ourselves Hold the map lock while performing Assemble to avoid races with udev Incremental.c | 30 ++++++++++++++---------------- mdadm.c | 6 ++++++ 2 files changed, 20 insertions(+), 16 deletions(-) -- 1.7.4.4 ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/3] Remove race for starting container devices. 2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen @ 2011-10-21 13:33 ` Jes.Sorensen 2011-10-22 0:49 ` NeilBrown 2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen 2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen 2 siblings, 1 reply; 11+ messages in thread From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw) To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford From: Jes Sorensen <Jes.Sorensen@redhat.com> This moves the lock handling out of Incremental_container() and relies on the caller holding the lock. This prevents conflict with a follow-on mdadm comment which may try and launch the device in parallel. Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> --- Incremental.c | 29 ++++++++++++++--------------- 1 files changed, 14 insertions(+), 15 deletions(-) diff --git a/Incremental.c b/Incremental.c index b90089b..1a9a97b 100644 --- a/Incremental.c +++ b/Incremental.c @@ -139,10 +139,16 @@ int Incremental(char *devname, int verbose, int runstop, rv = st->ss->load_container(st, dfd, NULL); close(dfd); - if (!rv && st->ss->container_content) - return Incremental_container(st, devname, homehost, - verbose, runstop, autof, - freeze_reshape); + if (!rv && st->ss->container_content) { + if (map_lock(&map)) + fprintf(stderr, Name ": failed to get " + "exclusive lock on mapfile\n"); + rv = Incremental_container(st, devname, homehost, + verbose, runstop, autof, + freeze_reshape); + map_unlock(&map); + return rv; + } fprintf(stderr, Name ": %s is not part of an md array.\n", devname); @@ -440,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, if (info.array.level == LEVEL_CONTAINER) { int devnum = devnum; /* defined and used iff ->external */ /* Try to assemble within the container */ - map_unlock(&map); sysfs_uevent(&info, "change"); if (verbose >= 0) fprintf(stderr, Name @@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop, devnum = fd2devnum(mdfd); close(mdfd); sysfs_free(sra); - rv = Incremental(chosen_name, verbose, runstop, - NULL, homehost, require_homehost, autof, - freeze_reshape); + rv = Incremental_container(st, chosen_name, homehost, + verbose, runstop, autof, + freeze_reshape); + map_unlock(&map); if (rv == 1) /* Don't fail the whole -I if a subarray didn't * have enough devices to start yet @@ -1463,16 +1469,12 @@ static int Incremental_container(struct supertype *st, char *devname, trustworthy = FOREIGN; list = st->ss->container_content(st, NULL); - if (map_lock(&map)) - fprintf(stderr, Name ": failed to get exclusive lock on " - "mapfile\n"); /* do not assemble arrays that might have bad blocks */ if (list && list->array.state & (1<<MD_SB_BBM_ERRORS)) { fprintf(stderr, Name ": BBM log found in metadata. " "Cannot activate array(s).\n"); /* free container data and exit */ sysfs_free(list); - map_unlock(&map); return 2; } @@ -1536,7 +1538,6 @@ static int Incremental_container(struct supertype *st, char *devname, fprintf(stderr, Name ": array %s/%s is " "explicitly ignored by mdadm.conf\n", match->container, match->member); - map_unlock(&map); return 2; } if (match) @@ -1552,7 +1553,6 @@ static int Incremental_container(struct supertype *st, char *devname, if (mdfd < 0) { fprintf(stderr, Name ": failed to open %s: %s.\n", chosen_name, strerror(errno)); - map_unlock(&map); return 2; } @@ -1607,7 +1607,6 @@ static int Incremental_container(struct supertype *st, char *devname, close(sfd); } domain_free(domains); - map_unlock(&map); return 0; } -- 1.7.4.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] Remove race for starting container devices. 2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen @ 2011-10-22 0:49 ` NeilBrown 2011-10-22 8:22 ` Jes Sorensen 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2011-10-22 0:49 UTC (permalink / raw) To: Jes.Sorensen; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford [-- Attachment #1: Type: text/plain, Size: 1659 bytes --] On Fri, 21 Oct 2011 15:33:18 +0200 Jes.Sorensen@redhat.com wrote: > From: Jes Sorensen <Jes.Sorensen@redhat.com> > > This moves the lock handling out of Incremental_container() and relies > on the caller holding the lock. This prevents conflict with a > follow-on mdadm comment which may try and launch the device in > parallel. > > Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> Hi Jes, I have applied the 7 patches you sent. However I have question mark over part of this one: > @@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop, > devnum = fd2devnum(mdfd); > close(mdfd); > sysfs_free(sra); > - rv = Incremental(chosen_name, verbose, runstop, > - NULL, homehost, require_homehost, autof, > - freeze_reshape); > + rv = Incremental_container(st, chosen_name, homehost, > + verbose, runstop, autof, > + freeze_reshape); > + map_unlock(&map); > if (rv == 1) > /* Don't fail the whole -I if a subarray didn't > * have enough devices to start yet You have replaced a call to Incremental with a call to Incremental_container. I feel that is significant enough that I would have liked to have seen it discussed in the changelog comment. You seem have have open-coded Incremental and then removed all the bits that you don't need in this case - which is that vast majority of it. But you didn't include the call to ->load_content(). So I have put it back in because I'm sure it must be harmless (Because it was there already) but it may not be needed. If you have thoughts on this, please let me know. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] Remove race for starting container devices. 2011-10-22 0:49 ` NeilBrown @ 2011-10-22 8:22 ` Jes Sorensen 0 siblings, 0 replies; 11+ messages in thread From: Jes Sorensen @ 2011-10-22 8:22 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford On 10/22/11 02:49, NeilBrown wrote: > Hi Jes, > > I have applied the 7 patches you sent. > > However I have question mark over part of this one: > > >> > @@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop, >> > devnum = fd2devnum(mdfd); >> > close(mdfd); >> > sysfs_free(sra); >> > - rv = Incremental(chosen_name, verbose, runstop, >> > - NULL, homehost, require_homehost, autof, >> > - freeze_reshape); >> > + rv = Incremental_container(st, chosen_name, homehost, >> > + verbose, runstop, autof, >> > + freeze_reshape); >> > + map_unlock(&map); >> > if (rv == 1) >> > /* Don't fail the whole -I if a subarray didn't >> > * have enough devices to start yet > You have replaced a call to Incremental with a call to Incremental_container. > I feel that is significant enough that I would have liked to have seen it > discussed in the changelog comment. > > You seem have have open-coded Incremental and then removed all the bits that > you don't need in this case - which is that vast majority of it. > But you didn't include the call to ->load_content(). > > So I have put it back in because I'm sure it must be harmless (Because it was > there already) but it may not be needed. > > If you have thoughts on this, please let me know. Hi Neil, There is a perfectly good explanation for this: The reason is that the guy who wrote the initial patch is an idiot! In fact, said idiot spent most of Friday pulling his hairs out trying to figure out why the modified mdadm would boot fine on a systemd based system but fail on a sysvinit based system. The result being the patch bombing you were exposed to yesterday :) Thanks for catching this, much appreciated! Jes ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen 2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen @ 2011-10-21 13:33 ` Jes.Sorensen 2011-12-22 23:48 ` NeilBrown 2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen 2 siblings, 1 reply; 11+ messages in thread From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw) To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford From: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> --- Incremental.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/Incremental.c b/Incremental.c index 1a9a97b..cff7663 100644 --- a/Incremental.c +++ b/Incremental.c @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, if (info.array.level == LEVEL_CONTAINER) { int devnum = devnum; /* defined and used iff ->external */ /* Try to assemble within the container */ - sysfs_uevent(&info, "change"); if (verbose >= 0) fprintf(stderr, Name ": container %s now has %d devices\n", -- 1.7.4.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen @ 2011-12-22 23:48 ` NeilBrown 2012-01-03 10:24 ` Jes Sorensen 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2011-12-22 23:48 UTC (permalink / raw) To: Jes.Sorensen; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford [-- Attachment #1: Type: text/plain, Size: 1121 bytes --] On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote: > From: Jes Sorensen <Jes.Sorensen@redhat.com> > > Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> > --- > Incremental.c | 1 - > 1 files changed, 0 insertions(+), 1 deletions(-) > > diff --git a/Incremental.c b/Incremental.c > index 1a9a97b..cff7663 100644 > --- a/Incremental.c > +++ b/Incremental.c > @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, > if (info.array.level == LEVEL_CONTAINER) { > int devnum = devnum; /* defined and used iff ->external */ > /* Try to assemble within the container */ > - sysfs_uevent(&info, "change"); > if (verbose >= 0) > fprintf(stderr, Name > ": container %s now has %d devices\n", Hi Jes, I've had to revert this patch as it causes a regression. Without the 'change' event the container device doesn't get created in /dev. I'm not sure udev should be running "--incremental" on containers anyway, but if it does it shouldn't cause problems should it? If it does we should fix those problems. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2011-12-22 23:48 ` NeilBrown @ 2012-01-03 10:24 ` Jes Sorensen 2012-01-03 15:54 ` Doug Ledford 0 siblings, 1 reply; 11+ messages in thread From: Jes Sorensen @ 2012-01-03 10:24 UTC (permalink / raw) To: NeilBrown; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford On 12/23/11 00:48, NeilBrown wrote: > On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote: > >> From: Jes Sorensen <Jes.Sorensen@redhat.com> >> >> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> >> --- >> Incremental.c | 1 - >> 1 files changed, 0 insertions(+), 1 deletions(-) >> >> diff --git a/Incremental.c b/Incremental.c >> index 1a9a97b..cff7663 100644 >> --- a/Incremental.c >> +++ b/Incremental.c >> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, >> if (info.array.level == LEVEL_CONTAINER) { >> int devnum = devnum; /* defined and used iff ->external */ >> /* Try to assemble within the container */ >> - sysfs_uevent(&info, "change"); >> if (verbose >= 0) >> fprintf(stderr, Name >> ": container %s now has %d devices\n", > > Hi Jes, > I've had to revert this patch as it causes a regression. > > Without the 'change' event the container device doesn't get created in /dev. > > I'm not sure udev should be running "--incremental" on containers anyway, but > if it does it shouldn't cause problems should it? If it does we should fix > those problems. Hi Neil, I am wary of this being reverted as it fixed a genuine race condition. On Fedora we don't have the problem with the container device not being created, could it be that your udev scripts are not doing the right thing in this case? Cheers, Jes Thanks for the heads up! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2012-01-03 10:24 ` Jes Sorensen @ 2012-01-03 15:54 ` Doug Ledford 2012-01-04 2:34 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: Doug Ledford @ 2012-01-03 15:54 UTC (permalink / raw) To: Jes Sorensen; +Cc: NeilBrown, linux-raid, lukasz.dorau, adam.kwolek [-- Attachment #1.1: Type: text/plain, Size: 2853 bytes --] On 01/03/2012 05:24 AM, Jes Sorensen wrote: > On 12/23/11 00:48, NeilBrown wrote: >> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote: >> >>> From: Jes Sorensen <Jes.Sorensen@redhat.com> >>> >>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> >>> --- >>> Incremental.c | 1 - >>> 1 files changed, 0 insertions(+), 1 deletions(-) >>> >>> diff --git a/Incremental.c b/Incremental.c >>> index 1a9a97b..cff7663 100644 >>> --- a/Incremental.c >>> +++ b/Incremental.c >>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, >>> if (info.array.level == LEVEL_CONTAINER) { >>> int devnum = devnum; /* defined and used iff ->external */ >>> /* Try to assemble within the container */ >>> - sysfs_uevent(&info, "change"); >>> if (verbose >= 0) >>> fprintf(stderr, Name >>> ": container %s now has %d devices\n", >> >> Hi Jes, >> I've had to revert this patch as it causes a regression. >> >> Without the 'change' event the container device doesn't get created in /dev. >> >> I'm not sure udev should be running "--incremental" on containers anyway, but >> if it does it shouldn't cause problems should it? If it does we should fix >> those problems. > > Hi Neil, > > I am wary of this being reverted as it fixed a genuine race condition. > On Fedora we don't have the problem with the container device not being > created, could it be that your udev scripts are not doing the right > thing in this case? I think it's probably worthwhile for us to do a udev rules file comparison. So, I'm attaching the patch we apply to the udev rules file in your distribution before installing it, as also the rules file that handles incremental assembly on Fedora/RHEL. Look them over and see if you want to include any of the things we have there in your version. As to the question of udev running incremental on containers, I think it's fair to say that either udev or mdadm should be doing so, and not both. Whether it should be one or the other is up for debate I think. Doing it in mdadm where this patch pulls it out means that it only happens on add of a new device using incremental assembly. The udev version does it on a change event on the container. The question I have is, if you had mdmon migrate a spare into a container, would it trigger this code in mdadm versus would it trigger the code in the udev rules file? Is there any situation where a change triggered via something other than incremental assembly would result in us needing to run incremental on the container? If the answer is yes, then I would recommend doing things in the udev file versus in the incremental function of mdadm. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD http://people.redhat.com/dledford [-- Attachment #1.2: mdadm.rules --] [-- Type: text/plain, Size: 3416 bytes --] # This file causes block devices with Linux RAID (mdadm) signatures to # automatically cause mdadm to be run. # See udev(8) for syntax # Don't process any events if anaconda is running as anaconda brings up # raid devices manually ENV{ANACONDA}=="?*", GOTO="md_end" # Also don't process disks that are slated to be a multipath device ENV{DM_MULTIPATH_DEVICE_PATH}=="?*", GOTO="md_end" # We process add events on block devices (since they are ready as soon as # they are added to the system), but we must process change events as well # on any dm devices (like LUKS partitions or LVM logical volumes) and on # md devices because both of these first get added, then get brought live # and trigger a change event. The reason we don't process change events # on bare hard disks is because if you stop all arrays on a disk, then # run fdisk on the disk to change the partitions, when fdisk exits it # triggers a change event, and we want to wait until all the fdisks on # all member disks are done before we do anything. Unfortunately, we have # no way of knowing that, so we just have to let those arrays be brought # up manually after fdisk has been run on all of the disks. # First, process all add events (md and dm devices will not really do # anything here, just regular disks, and this also won't get any imsm # array members either) SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \ RUN+="/sbin/mdadm -I $env{DEVNAME}" SUBSYSTEM=="block", ACTION=="remove", ENV{ID_FS_TYPE}=="linux_raid_member", \ RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" # Next, check to make sure the BIOS raid stuff wasn't turned off via cmdline IMPORT{cmdline}="noiswmd" IMPORT{cmdline}="nodmraid" ENV{noiswmd}=="?*", GOTO="md_imsm_inc_end" ENV{nodmraid}=="?*", GOTO="md_imsm_inc_end" SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="isw_raid_member", \ RUN+="/sbin/mdadm -I $env{DEVNAME}" SUBSYSTEM=="block", ACTION=="remove", ENV{ID_FS_TYPE}=="isw_raid_member", \ RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" LABEL="md_imsm_inc_end" # Next make sure that this isn't a dm device we should skip for some reason ENV{DM_UDEV_RULES_VSN}!="?*", GOTO="dm_change_end" ENV{DM_UDEV_DISABLE_DISK_RULES_FLAG}=="1", GOTO="dm_change_end" ENV{DM_SUSPENDED}=="1", GOTO="dm_change_end" KERNEL=="dm-*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \ ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}" LABEL="dm_change_end" # Finally catch any nested md raid arrays. If we brought up an md raid # array that's part of another md raid array, it won't be ready to be used # until the change event that occurs when it becomes live KERNEL=="md*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \ ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}" # In case the initramfs only started some of the arrays in our container, # run incremental assembly on the container itself. Note: we ran mdadm # on the container in 64-md-raid.rules, and that's how the MD_LEVEL # environment variable is already set. If that disappears from the other # file, we will need to add this line into the middle of the next rule: # IMPORT{program}="/sbin/mdadm -D --export $tempnode", \ SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", \ ENV{MD_LEVEL}=="container", RUN+="/sbin/mdadm -I $env{DEVNAME}" LABEL="md_end" [-- Attachment #1.3: mdadm-3.1.3-udev.patch --] [-- Type: text/plain, Size: 1300 bytes --] --- mdadm-3.2.1/udev-md-raid.rules.udev 2011-03-27 22:31:20.000000000 -0400 +++ mdadm-3.2.1/udev-md-raid.rules 2011-03-28 10:14:26.047232843 -0400 @@ -2,11 +2,13 @@ SUBSYSTEM!="block", GOTO="md_end" +# In Fedora we handle the raid components in 65-md-incremental.rules so that +# we can do things like honor anaconda command line options and such # handle potential components of arrays -ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" -ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}" -ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" -ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}" +#ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" +#ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}" +#ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}" +#ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}" # handle md arrays ACTION!="add|change", GOTO="md_end" [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2012-01-03 15:54 ` Doug Ledford @ 2012-01-04 2:34 ` NeilBrown 2012-01-06 18:35 ` Doug Ledford 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2012-01-04 2:34 UTC (permalink / raw) To: Doug Ledford; +Cc: Jes Sorensen, linux-raid, lukasz.dorau, adam.kwolek [-- Attachment #1: Type: text/plain, Size: 6337 bytes --] On Tue, 03 Jan 2012 10:54:41 -0500 Doug Ledford <dledford@redhat.com> wrote: > On 01/03/2012 05:24 AM, Jes Sorensen wrote: > > On 12/23/11 00:48, NeilBrown wrote: > >> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote: > >> > >>> From: Jes Sorensen <Jes.Sorensen@redhat.com> > >>> > >>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> > >>> --- > >>> Incremental.c | 1 - > >>> 1 files changed, 0 insertions(+), 1 deletions(-) > >>> > >>> diff --git a/Incremental.c b/Incremental.c > >>> index 1a9a97b..cff7663 100644 > >>> --- a/Incremental.c > >>> +++ b/Incremental.c > >>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, > >>> if (info.array.level == LEVEL_CONTAINER) { > >>> int devnum = devnum; /* defined and used iff ->external */ > >>> /* Try to assemble within the container */ > >>> - sysfs_uevent(&info, "change"); > >>> if (verbose >= 0) > >>> fprintf(stderr, Name > >>> ": container %s now has %d devices\n", > >> > >> Hi Jes, > >> I've had to revert this patch as it causes a regression. > >> > >> Without the 'change' event the container device doesn't get created in /dev. > >> > >> I'm not sure udev should be running "--incremental" on containers anyway, but > >> if it does it shouldn't cause problems should it? If it does we should fix > >> those problems. > > > > Hi Neil, > > > > I am wary of this being reverted as it fixed a genuine race condition. > > On Fedora we don't have the problem with the container device not being > > created, could it be that your udev scripts are not doing the right > > thing in this case? > > I think it's probably worthwhile for us to do a udev rules file > comparison. So, I'm attaching the patch we apply to the udev rules file > in your distribution before installing it, as also the rules file that > handles incremental assembly on Fedora/RHEL. Look them over and see if > you want to include any of the things we have there in your version. > > As to the question of udev running incremental on containers, I think > it's fair to say that either udev or mdadm should be doing so, and not > both. Whether it should be one or the other is up for debate I think. > Doing it in mdadm where this patch pulls it out means that it only > happens on add of a new device using incremental assembly. The udev > version does it on a change event on the container. The question I have > is, if you had mdmon migrate a spare into a container, would it trigger > this code in mdadm versus would it trigger the code in the udev rules > file? Is there any situation where a change triggered via something > other than incremental assembly would result in us needing to run > incremental on the container? If the answer is yes, then I would > recommend doing things in the udev file versus in the incremental > function of mdadm. > > Nothing's ever easy..... I just tried to reproduce the problem I had which lead me to revert that patch, but mdadm didn't behave as I expected. So now I'm running my test suite with the revert removed to see what happens. However I think that if triggering a change event at that point (or at any point) causes a problem, then the problem was already there and needs to be fixed. If there is a race as you say (and I suspect you are right) then we need to put in appropriate locking to make sure the race doesn't cause a problem. Just removing the code that triggers the race isn't really enough. I think that *both* mdadm and udev should be trying to assemble the contents of a container when the container changes. I'm not convinced that everyone will be running udev though most probably will. Again: if this is racy we should identify and lock-out the races. I don't see how your 'migrate a spare' scenario is relevant but I've probably missed something. The scenarios we have been talking about are for assembling a device once all the required components become available. A 'spare' cannot possibly be a required component so it doesn't seem relevant. When a spare is migrated in, mdmon should notice and take whatever action is necessary. This is quite separate from mdadm or udev. And I've found out the details of the regression. It is another race :-( It particularly affects DDF as DDF containers can have a name. When you "mdadm -I /dev/thing" the first device in a DDF container it will firstly "set_array_info" to set up the array, then add the info to the 'map' file. The set_array_info (or probably the "open" before that) will trigger the "ADD" uevent which will run mdadm which should create the /dev/md/WHATEVER link. However if this happens before mdadm updates the 'map' file it won't be able to create the link. So we need to trigger a 'change' event after updating the map file. There are probably other ways to fix this. We could maybe update the map file before creating the array device, but that feels clumsy - error handling would be messy. We could hold the mapfile locked from before creating the device to after updating the map file. Then make sure the mdadm that is run from udev waits for the lock before looking for the name of the array. That is probably better. In any case I think we need to give some serious thought to locking to avoid races between different mdadm. We have some in place already - just need to make sure it is used consistently. Some of the changes you have made to the udev.rules file look worth while. Having two separate files is probably a good idea - one for devices which are array and one for devices which might be members of arrays. The various options for skipping things look a bit painful. i.e. we test for ANACONDA DM_MULTIPATH_DEVICE_PATH noiswmd nodmraid The last two are probably OK. The first two I would rather see something else set a generic "Don't include this device in anything - I've got it under control" setting, and then the mdadm rules file just checks that and ignores as appropriate. It would be good if everything in the rules file in the main distribution were equally appropriate for all distros, and that any change a distro wanted to make could be done in a separate file. Thanks, NeilBrown [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 828 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves 2012-01-04 2:34 ` NeilBrown @ 2012-01-06 18:35 ` Doug Ledford 0 siblings, 0 replies; 11+ messages in thread From: Doug Ledford @ 2012-01-06 18:35 UTC (permalink / raw) To: NeilBrown; +Cc: Jes Sorensen, linux-raid, lukasz.dorau, adam.kwolek [-- Attachment #1: Type: text/plain, Size: 11280 bytes --] On 01/03/2012 09:34 PM, NeilBrown wrote: > On Tue, 03 Jan 2012 10:54:41 -0500 Doug Ledford <dledford@redhat.com> wrote: > >> On 01/03/2012 05:24 AM, Jes Sorensen wrote: >>> On 12/23/11 00:48, NeilBrown wrote: >>>> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote: >>>> >>>>> From: Jes Sorensen <Jes.Sorensen@redhat.com> >>>>> >>>>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> >>>>> --- >>>>> Incremental.c | 1 - >>>>> 1 files changed, 0 insertions(+), 1 deletions(-) >>>>> >>>>> diff --git a/Incremental.c b/Incremental.c >>>>> index 1a9a97b..cff7663 100644 >>>>> --- a/Incremental.c >>>>> +++ b/Incremental.c >>>>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop, >>>>> if (info.array.level == LEVEL_CONTAINER) { >>>>> int devnum = devnum; /* defined and used iff ->external */ >>>>> /* Try to assemble within the container */ >>>>> - sysfs_uevent(&info, "change"); >>>>> if (verbose >= 0) >>>>> fprintf(stderr, Name >>>>> ": container %s now has %d devices\n", >>>> >>>> Hi Jes, >>>> I've had to revert this patch as it causes a regression. >>>> >>>> Without the 'change' event the container device doesn't get created in /dev. >>>> >>>> I'm not sure udev should be running "--incremental" on containers anyway, but >>>> if it does it shouldn't cause problems should it? If it does we should fix >>>> those problems. >>> >>> Hi Neil, >>> >>> I am wary of this being reverted as it fixed a genuine race condition. >>> On Fedora we don't have the problem with the container device not being >>> created, could it be that your udev scripts are not doing the right >>> thing in this case? >> >> I think it's probably worthwhile for us to do a udev rules file >> comparison. So, I'm attaching the patch we apply to the udev rules file >> in your distribution before installing it, as also the rules file that >> handles incremental assembly on Fedora/RHEL. Look them over and see if >> you want to include any of the things we have there in your version. >> >> As to the question of udev running incremental on containers, I think >> it's fair to say that either udev or mdadm should be doing so, and not >> both. Whether it should be one or the other is up for debate I think. >> Doing it in mdadm where this patch pulls it out means that it only >> happens on add of a new device using incremental assembly. The udev >> version does it on a change event on the container. The question I have >> is, if you had mdmon migrate a spare into a container, would it trigger >> this code in mdadm versus would it trigger the code in the udev rules >> file? Is there any situation where a change triggered via something >> other than incremental assembly would result in us needing to run >> incremental on the container? If the answer is yes, then I would >> recommend doing things in the udev file versus in the incremental >> function of mdadm. >> >> > > Nothing's ever easy..... > > I just tried to reproduce the problem I had which lead me to revert that > patch, but mdadm didn't behave as I expected. Don't you hate it when that happens ;-) > So now I'm running my test > suite with the revert removed to see what happens. > > However I think that if triggering a change event at that point (or at > any point) causes a problem, then the problem was already there and needs to > be fixed. If there is a race as you say (and I suspect you are right) then > we need to put in appropriate locking to make sure the race doesn't cause a > problem. Just removing the code that triggers the race isn't really enough. Agreed. > I think that *both* mdadm and udev should be trying to assemble the contents > of a container when the container changes. I'm not convinced that everyone > will be running udev though most probably will. Again: if this is racy we > should identify and lock-out the races. Agreed. So, I'm going to take this opportunity to make a suggestion. The md map file has shown to be racy over and over again. There are races on incremental assembly, I had to solve races in the udev assembly from the initramfs or else devices belonging to the same array could end up in multiple partially assembled arrays, etc. In the interest of full disclosure, I always thought the md map file was a solution to a non-existent problem (a belief that is only getting stronger by the problem BTW). Therefore, I suggest we do away with the map file entirely. It isn't really necessary and it causes a number of issues (the map file is the only thing that the mdadm binary uses that has to be available from the initramfs...mdmon would still need pre-init love, but not mdadm). What's wrong with the map file? It's needlessly redundant. We don't need to track what arrays we are processing in a file, they are already tracked in the proper place: the kernel. We simply need to look at existing arrays to determine if a new device can be added to an existing array or if we need to create a new one. I propose that mdadm be modified so that on incremental assembly we take a non-exclusive open of /proc/mdstat, compare the uuid of our component device to the uuid of the already existing arrays, if it matches then add the device to the existing array, if none matches then transition to an exclusive open of /proc/mdstat (this is our locking to protect against multiple creations at the same time), once we get the exclusive open rescan the file in case a new array was added between our initial scan and the exclusive open, if there is a new array that matches our uuid (multiple mdadms racing to create the array) then add our device to that array, else create a new array with our device as the first device, drop the exclusive open on /proc/mdstat, do container incremental if needed, then exit. Under that method of operation, we never have to worry about name mapping as we will use the uuid exclusively except when creating the array, so we only have to figure the name out once and then from that point on any device with a uuid that matches the created array will just get added to that array. We might have to modify monitor mode to not hold /proc/mdstat open all the time to make this work though. I would actually prefer some other file besides /proc/mdstat, but I was hoping to use a file that already exists and not one that would have to be created on bootup. Worse case scenario though, we create a lock file for this specific purpose. > I don't see how your 'migrate a spare' scenario is relevant but I've probably > missed something. It was just a thought exercise. The point being: is there any place other than incremental assembly that might cause a transition from unrunnable to runnable and therefore should do incremental on the container. > The scenarios we have been talking about are for assembling a device once all > the required components become available. A 'spare' cannot possibly be a > required component so it doesn't seem relevant. > When a spare is migrated in, mdmon should notice and take whatever action is > necessary. This is quite separate from mdadm or udev. > > And I've found out the details of the regression. It is another race :-( > > It particularly affects DDF as DDF containers can have a name. > > When you "mdadm -I /dev/thing" the first device in a DDF container it will > firstly "set_array_info" to set up the array, then add the info to the 'map' > file. > > The set_array_info (or probably the "open" before that) will trigger the > "ADD" uevent which will run mdadm which should create the /dev/md/WHATEVER > link. > However if this happens before mdadm updates the 'map' file it won't be able > to create the link. Again, map file bad, kernel is authoritative and should be all we need. > So we need to trigger a 'change' event after updating the map file. > > There are probably other ways to fix this. We could maybe update the map > file before creating the array device, but that feels clumsy - error handling > would be messy. > > We could hold the mapfile locked from before creating the device to after > updating the map file. Then make sure the mdadm that is run from udev waits > for the lock before looking for the name of the array. That is probably > better. Or make sure the kernel ioctl for setting the name of the array issues a udev event once the name is set and not before and then create the links based on the kernel data, no need for the map file. > In any case I think we need to give some serious thought to locking to avoid > races between different mdadm. We have some in place already - just need to > make sure it is used consistently. > > > Some of the changes you have made to the udev.rules file look worth while. > Having two separate files is probably a good idea - one for devices which are > array and one for devices which might be members of arrays. > > The various options for skipping things look a bit painful. > i.e. we test for > ANACONDA That's obviously Red Hat specific (well, Red Hat and derivatives). > DM_MULTIPATH_DEVICE_PATH This is LVM generic and so applies to any distro. > noiswmd > nodmraid These two are dracut specific. I don't know which distros are or will pick up dracut, so I don't know how generic this really is. > The last two are probably OK. Cool. > The first two I would rather see something else > set a generic "Don't include this device in anything - I've got it > under control" setting, and then the mdadm rules file just checks that and > ignores as appropriate. I dunno, the Anaconda item and the LVM item really mean two distinct things. In the anaconda case, we aren't ever going to do anything. The anaconda environment setting is global and will be present on all devices we ever look at. The LVM flag is device specific. We are going to see this exact device again, but only after LVM has created the necessary multipath device and we are scanning the multipath device. And our actions in these two cases are pretty unique. For instance, just because we shouldn't process the multipath device doesn't mean the SCSI block device rules shouldn't be run. So it's not like a global "don't touch this device" flag is really even appropriate, we would need a global "md raid don't touch this device" flag instead, at which point you might as well just do it ourselves like we are now. > It would be good if everything in the rules file in the main distribution > were equally appropriate for all distros, and that any change a distro wanted > to make could be done in a separate file. Well, equally applicable to all distros, or at least non-intrusive for other distros. For example, it would be possible to include the various tests in the udev rule file you distribute because even if other distros don't use noiwsraid or nodmraid or any of the others, lack of the item does not negatively affect operation of the rules file. -- Doug Ledford <dledford@redhat.com> GPG KeyID: 0E572FDD http://people.redhat.com/dledford [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 900 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev 2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen 2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen 2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen @ 2011-10-21 13:33 ` Jes.Sorensen 2 siblings, 0 replies; 11+ messages in thread From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw) To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford From: Jes Sorensen <Jes.Sorensen@redhat.com> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com> --- mdadm.c | 6 ++++++ 1 files changed, 6 insertions(+), 0 deletions(-) diff --git a/mdadm.c b/mdadm.c index ebf1c46..56de7b7 100644 --- a/mdadm.c +++ b/mdadm.c @@ -1279,6 +1279,7 @@ int main(int argc, char *argv[]) } else { struct mddev_ident *a, *array_list = conf_get_ident(NULL); struct mddev_dev *devlist = conf_get_devs(); + struct map_ent *map = NULL; int cnt = 0; int failures, successes; if (devlist == NULL) { @@ -1298,6 +1299,10 @@ int main(int argc, char *argv[]) if (a->autof == 0) a->autof = autof; } + if (map_lock(&map)) + fprintf(stderr, Name " %s: failed to get " + "exclusive lock on mapfile\n", + __func__); do { failures = 0; successes = 0; @@ -1364,6 +1369,7 @@ int main(int argc, char *argv[]) fprintf(stderr, Name ": No arrays found in config file\n"); rv = 1; } + map_unlock(&map); } break; case BUILD: -- 1.7.4.4 ^ permalink raw reply related [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-01-06 18:35 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen 2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen 2011-10-22 0:49 ` NeilBrown 2011-10-22 8:22 ` Jes Sorensen 2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen 2011-12-22 23:48 ` NeilBrown 2012-01-03 10:24 ` Jes Sorensen 2012-01-03 15:54 ` Doug Ledford 2012-01-04 2:34 ` NeilBrown 2012-01-06 18:35 ` Doug Ledford 2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).