* [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble
@ 2011-10-21 13:33 Jes.Sorensen
2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
From: Jes Sorensen <Jes.Sorensen@redhat.com>
Hi,
When I posted this yesterday I missed a couple of exist cases in the
patch changing the locking scheme for Incremental_container(). Not
sure how I managed that, but fixed in v2 of the patch. The other two
patches remain onchanged.
This patch set fixes the problem with mdadm racing udev during during
incremental or regular assembly of IMSM raids.
There are three patches to the set:
1) Hold the lock while running Incremental_container() to avoid udev
(or someone else like a boot script) kicking off an additional
mdadm instance to try and assemble the container in parallel.
2) Don't send udev a 'change' event to handle incremental assembly
since we are about to do it ourselves anyway.
3) Hold the lock during Assemble() again to avoid udev kicking in
behind our backs.
With these patches in place, I can no longer reproduce the case I was
seeing on Fedora 16 Beta where an array would not come up correctly.
Patches are on top of Neil's current git repository.
Comments?
Cheers,
Jes
Jes Sorensen (3):
Remove race for starting container devices.
Don't tell sysfs to launch the container as we are doing it ourselves
Hold the map lock while performing Assemble to avoid races with udev
Incremental.c | 30 ++++++++++++++----------------
mdadm.c | 6 ++++++
2 files changed, 20 insertions(+), 16 deletions(-)
--
1.7.4.4
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH 1/3] Remove race for starting container devices.
2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen
@ 2011-10-21 13:33 ` Jes.Sorensen
2011-10-22 0:49 ` NeilBrown
2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen
2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen
2 siblings, 1 reply; 11+ messages in thread
From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
From: Jes Sorensen <Jes.Sorensen@redhat.com>
This moves the lock handling out of Incremental_container() and relies
on the caller holding the lock. This prevents conflict with a
follow-on mdadm comment which may try and launch the device in
parallel.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
---
Incremental.c | 29 ++++++++++++++---------------
1 files changed, 14 insertions(+), 15 deletions(-)
diff --git a/Incremental.c b/Incremental.c
index b90089b..1a9a97b 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -139,10 +139,16 @@ int Incremental(char *devname, int verbose, int runstop,
rv = st->ss->load_container(st, dfd, NULL);
close(dfd);
- if (!rv && st->ss->container_content)
- return Incremental_container(st, devname, homehost,
- verbose, runstop, autof,
- freeze_reshape);
+ if (!rv && st->ss->container_content) {
+ if (map_lock(&map))
+ fprintf(stderr, Name ": failed to get "
+ "exclusive lock on mapfile\n");
+ rv = Incremental_container(st, devname, homehost,
+ verbose, runstop, autof,
+ freeze_reshape);
+ map_unlock(&map);
+ return rv;
+ }
fprintf(stderr, Name ": %s is not part of an md array.\n",
devname);
@@ -440,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
if (info.array.level == LEVEL_CONTAINER) {
int devnum = devnum; /* defined and used iff ->external */
/* Try to assemble within the container */
- map_unlock(&map);
sysfs_uevent(&info, "change");
if (verbose >= 0)
fprintf(stderr, Name
@@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop,
devnum = fd2devnum(mdfd);
close(mdfd);
sysfs_free(sra);
- rv = Incremental(chosen_name, verbose, runstop,
- NULL, homehost, require_homehost, autof,
- freeze_reshape);
+ rv = Incremental_container(st, chosen_name, homehost,
+ verbose, runstop, autof,
+ freeze_reshape);
+ map_unlock(&map);
if (rv == 1)
/* Don't fail the whole -I if a subarray didn't
* have enough devices to start yet
@@ -1463,16 +1469,12 @@ static int Incremental_container(struct supertype *st, char *devname,
trustworthy = FOREIGN;
list = st->ss->container_content(st, NULL);
- if (map_lock(&map))
- fprintf(stderr, Name ": failed to get exclusive lock on "
- "mapfile\n");
/* do not assemble arrays that might have bad blocks */
if (list && list->array.state & (1<<MD_SB_BBM_ERRORS)) {
fprintf(stderr, Name ": BBM log found in metadata. "
"Cannot activate array(s).\n");
/* free container data and exit */
sysfs_free(list);
- map_unlock(&map);
return 2;
}
@@ -1536,7 +1538,6 @@ static int Incremental_container(struct supertype *st, char *devname,
fprintf(stderr, Name ": array %s/%s is "
"explicitly ignored by mdadm.conf\n",
match->container, match->member);
- map_unlock(&map);
return 2;
}
if (match)
@@ -1552,7 +1553,6 @@ static int Incremental_container(struct supertype *st, char *devname,
if (mdfd < 0) {
fprintf(stderr, Name ": failed to open %s: %s.\n",
chosen_name, strerror(errno));
- map_unlock(&map);
return 2;
}
@@ -1607,7 +1607,6 @@ static int Incremental_container(struct supertype *st, char *devname,
close(sfd);
}
domain_free(domains);
- map_unlock(&map);
return 0;
}
--
1.7.4.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen
2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen
@ 2011-10-21 13:33 ` Jes.Sorensen
2011-12-22 23:48 ` NeilBrown
2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen
2 siblings, 1 reply; 11+ messages in thread
From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
From: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
---
Incremental.c | 1 -
1 files changed, 0 insertions(+), 1 deletions(-)
diff --git a/Incremental.c b/Incremental.c
index 1a9a97b..cff7663 100644
--- a/Incremental.c
+++ b/Incremental.c
@@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
if (info.array.level == LEVEL_CONTAINER) {
int devnum = devnum; /* defined and used iff ->external */
/* Try to assemble within the container */
- sysfs_uevent(&info, "change");
if (verbose >= 0)
fprintf(stderr, Name
": container %s now has %d devices\n",
--
1.7.4.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev
2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen
2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen
2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen
@ 2011-10-21 13:33 ` Jes.Sorensen
2 siblings, 0 replies; 11+ messages in thread
From: Jes.Sorensen @ 2011-10-21 13:33 UTC (permalink / raw)
To: neilb; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
From: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
---
mdadm.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)
diff --git a/mdadm.c b/mdadm.c
index ebf1c46..56de7b7 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1279,6 +1279,7 @@ int main(int argc, char *argv[])
} else {
struct mddev_ident *a, *array_list = conf_get_ident(NULL);
struct mddev_dev *devlist = conf_get_devs();
+ struct map_ent *map = NULL;
int cnt = 0;
int failures, successes;
if (devlist == NULL) {
@@ -1298,6 +1299,10 @@ int main(int argc, char *argv[])
if (a->autof == 0)
a->autof = autof;
}
+ if (map_lock(&map))
+ fprintf(stderr, Name " %s: failed to get "
+ "exclusive lock on mapfile\n",
+ __func__);
do {
failures = 0;
successes = 0;
@@ -1364,6 +1369,7 @@ int main(int argc, char *argv[])
fprintf(stderr, Name ": No arrays found in config file\n");
rv = 1;
}
+ map_unlock(&map);
}
break;
case BUILD:
--
1.7.4.4
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] Remove race for starting container devices.
2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen
@ 2011-10-22 0:49 ` NeilBrown
2011-10-22 8:22 ` Jes Sorensen
0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-10-22 0:49 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
[-- Attachment #1: Type: text/plain, Size: 1659 bytes --]
On Fri, 21 Oct 2011 15:33:18 +0200 Jes.Sorensen@redhat.com wrote:
> From: Jes Sorensen <Jes.Sorensen@redhat.com>
>
> This moves the lock handling out of Incremental_container() and relies
> on the caller holding the lock. This prevents conflict with a
> follow-on mdadm comment which may try and launch the device in
> parallel.
>
> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Hi Jes,
I have applied the 7 patches you sent.
However I have question mark over part of this one:
> @@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop,
> devnum = fd2devnum(mdfd);
> close(mdfd);
> sysfs_free(sra);
> - rv = Incremental(chosen_name, verbose, runstop,
> - NULL, homehost, require_homehost, autof,
> - freeze_reshape);
> + rv = Incremental_container(st, chosen_name, homehost,
> + verbose, runstop, autof,
> + freeze_reshape);
> + map_unlock(&map);
> if (rv == 1)
> /* Don't fail the whole -I if a subarray didn't
> * have enough devices to start yet
You have replaced a call to Incremental with a call to Incremental_container.
I feel that is significant enough that I would have liked to have seen it
discussed in the changelog comment.
You seem have have open-coded Incremental and then removed all the bits that
you don't need in this case - which is that vast majority of it.
But you didn't include the call to ->load_content().
So I have put it back in because I'm sure it must be harmless (Because it was
there already) but it may not be needed.
If you have thoughts on this, please let me know.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 1/3] Remove race for starting container devices.
2011-10-22 0:49 ` NeilBrown
@ 2011-10-22 8:22 ` Jes Sorensen
0 siblings, 0 replies; 11+ messages in thread
From: Jes Sorensen @ 2011-10-22 8:22 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
On 10/22/11 02:49, NeilBrown wrote:
> Hi Jes,
>
> I have applied the 7 patches you sent.
>
> However I have question mark over part of this one:
>
>
>> > @@ -451,9 +456,10 @@ int Incremental(char *devname, int verbose, int runstop,
>> > devnum = fd2devnum(mdfd);
>> > close(mdfd);
>> > sysfs_free(sra);
>> > - rv = Incremental(chosen_name, verbose, runstop,
>> > - NULL, homehost, require_homehost, autof,
>> > - freeze_reshape);
>> > + rv = Incremental_container(st, chosen_name, homehost,
>> > + verbose, runstop, autof,
>> > + freeze_reshape);
>> > + map_unlock(&map);
>> > if (rv == 1)
>> > /* Don't fail the whole -I if a subarray didn't
>> > * have enough devices to start yet
> You have replaced a call to Incremental with a call to Incremental_container.
> I feel that is significant enough that I would have liked to have seen it
> discussed in the changelog comment.
>
> You seem have have open-coded Incremental and then removed all the bits that
> you don't need in this case - which is that vast majority of it.
> But you didn't include the call to ->load_content().
>
> So I have put it back in because I'm sure it must be harmless (Because it was
> there already) but it may not be needed.
>
> If you have thoughts on this, please let me know.
Hi Neil,
There is a perfectly good explanation for this: The reason is that the
guy who wrote the initial patch is an idiot!
In fact, said idiot spent most of Friday pulling his hairs out trying to
figure out why the modified mdadm would boot fine on a systemd based
system but fail on a sysvinit based system. The result being the patch
bombing you were exposed to yesterday :)
Thanks for catching this, much appreciated!
Jes
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen
@ 2011-12-22 23:48 ` NeilBrown
2012-01-03 10:24 ` Jes Sorensen
0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2011-12-22 23:48 UTC (permalink / raw)
To: Jes.Sorensen; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
[-- Attachment #1: Type: text/plain, Size: 1121 bytes --]
On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote:
> From: Jes Sorensen <Jes.Sorensen@redhat.com>
>
> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
> ---
> Incremental.c | 1 -
> 1 files changed, 0 insertions(+), 1 deletions(-)
>
> diff --git a/Incremental.c b/Incremental.c
> index 1a9a97b..cff7663 100644
> --- a/Incremental.c
> +++ b/Incremental.c
> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
> if (info.array.level == LEVEL_CONTAINER) {
> int devnum = devnum; /* defined and used iff ->external */
> /* Try to assemble within the container */
> - sysfs_uevent(&info, "change");
> if (verbose >= 0)
> fprintf(stderr, Name
> ": container %s now has %d devices\n",
Hi Jes,
I've had to revert this patch as it causes a regression.
Without the 'change' event the container device doesn't get created in /dev.
I'm not sure udev should be running "--incremental" on containers anyway, but
if it does it shouldn't cause problems should it? If it does we should fix
those problems.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2011-12-22 23:48 ` NeilBrown
@ 2012-01-03 10:24 ` Jes Sorensen
2012-01-03 15:54 ` Doug Ledford
0 siblings, 1 reply; 11+ messages in thread
From: Jes Sorensen @ 2012-01-03 10:24 UTC (permalink / raw)
To: NeilBrown; +Cc: linux-raid, lukasz.dorau, adam.kwolek, dledford
On 12/23/11 00:48, NeilBrown wrote:
> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote:
>
>> From: Jes Sorensen <Jes.Sorensen@redhat.com>
>>
>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
>> ---
>> Incremental.c | 1 -
>> 1 files changed, 0 insertions(+), 1 deletions(-)
>>
>> diff --git a/Incremental.c b/Incremental.c
>> index 1a9a97b..cff7663 100644
>> --- a/Incremental.c
>> +++ b/Incremental.c
>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
>> if (info.array.level == LEVEL_CONTAINER) {
>> int devnum = devnum; /* defined and used iff ->external */
>> /* Try to assemble within the container */
>> - sysfs_uevent(&info, "change");
>> if (verbose >= 0)
>> fprintf(stderr, Name
>> ": container %s now has %d devices\n",
>
> Hi Jes,
> I've had to revert this patch as it causes a regression.
>
> Without the 'change' event the container device doesn't get created in /dev.
>
> I'm not sure udev should be running "--incremental" on containers anyway, but
> if it does it shouldn't cause problems should it? If it does we should fix
> those problems.
Hi Neil,
I am wary of this being reverted as it fixed a genuine race condition.
On Fedora we don't have the problem with the container device not being
created, could it be that your udev scripts are not doing the right
thing in this case?
Cheers,
Jes
Thanks for the heads up!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2012-01-03 10:24 ` Jes Sorensen
@ 2012-01-03 15:54 ` Doug Ledford
2012-01-04 2:34 ` NeilBrown
0 siblings, 1 reply; 11+ messages in thread
From: Doug Ledford @ 2012-01-03 15:54 UTC (permalink / raw)
To: Jes Sorensen; +Cc: NeilBrown, linux-raid, lukasz.dorau, adam.kwolek
[-- Attachment #1.1: Type: text/plain, Size: 2853 bytes --]
On 01/03/2012 05:24 AM, Jes Sorensen wrote:
> On 12/23/11 00:48, NeilBrown wrote:
>> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote:
>>
>>> From: Jes Sorensen <Jes.Sorensen@redhat.com>
>>>
>>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
>>> ---
>>> Incremental.c | 1 -
>>> 1 files changed, 0 insertions(+), 1 deletions(-)
>>>
>>> diff --git a/Incremental.c b/Incremental.c
>>> index 1a9a97b..cff7663 100644
>>> --- a/Incremental.c
>>> +++ b/Incremental.c
>>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
>>> if (info.array.level == LEVEL_CONTAINER) {
>>> int devnum = devnum; /* defined and used iff ->external */
>>> /* Try to assemble within the container */
>>> - sysfs_uevent(&info, "change");
>>> if (verbose >= 0)
>>> fprintf(stderr, Name
>>> ": container %s now has %d devices\n",
>>
>> Hi Jes,
>> I've had to revert this patch as it causes a regression.
>>
>> Without the 'change' event the container device doesn't get created in /dev.
>>
>> I'm not sure udev should be running "--incremental" on containers anyway, but
>> if it does it shouldn't cause problems should it? If it does we should fix
>> those problems.
>
> Hi Neil,
>
> I am wary of this being reverted as it fixed a genuine race condition.
> On Fedora we don't have the problem with the container device not being
> created, could it be that your udev scripts are not doing the right
> thing in this case?
I think it's probably worthwhile for us to do a udev rules file
comparison. So, I'm attaching the patch we apply to the udev rules file
in your distribution before installing it, as also the rules file that
handles incremental assembly on Fedora/RHEL. Look them over and see if
you want to include any of the things we have there in your version.
As to the question of udev running incremental on containers, I think
it's fair to say that either udev or mdadm should be doing so, and not
both. Whether it should be one or the other is up for debate I think.
Doing it in mdadm where this patch pulls it out means that it only
happens on add of a new device using incremental assembly. The udev
version does it on a change event on the container. The question I have
is, if you had mdmon migrate a spare into a container, would it trigger
this code in mdadm versus would it trigger the code in the udev rules
file? Is there any situation where a change triggered via something
other than incremental assembly would result in us needing to run
incremental on the container? If the answer is yes, then I would
recommend doing things in the udev file versus in the incremental
function of mdadm.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: 0E572FDD
http://people.redhat.com/dledford
[-- Attachment #1.2: mdadm.rules --]
[-- Type: text/plain, Size: 3416 bytes --]
# This file causes block devices with Linux RAID (mdadm) signatures to
# automatically cause mdadm to be run.
# See udev(8) for syntax
# Don't process any events if anaconda is running as anaconda brings up
# raid devices manually
ENV{ANACONDA}=="?*", GOTO="md_end"
# Also don't process disks that are slated to be a multipath device
ENV{DM_MULTIPATH_DEVICE_PATH}=="?*", GOTO="md_end"
# We process add events on block devices (since they are ready as soon as
# they are added to the system), but we must process change events as well
# on any dm devices (like LUKS partitions or LVM logical volumes) and on
# md devices because both of these first get added, then get brought live
# and trigger a change event. The reason we don't process change events
# on bare hard disks is because if you stop all arrays on a disk, then
# run fdisk on the disk to change the partitions, when fdisk exits it
# triggers a change event, and we want to wait until all the fdisks on
# all member disks are done before we do anything. Unfortunately, we have
# no way of knowing that, so we just have to let those arrays be brought
# up manually after fdisk has been run on all of the disks.
# First, process all add events (md and dm devices will not really do
# anything here, just regular disks, and this also won't get any imsm
# array members either)
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="linux_raid_member", \
RUN+="/sbin/mdadm -I $env{DEVNAME}"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_FS_TYPE}=="linux_raid_member", \
RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
# Next, check to make sure the BIOS raid stuff wasn't turned off via cmdline
IMPORT{cmdline}="noiswmd"
IMPORT{cmdline}="nodmraid"
ENV{noiswmd}=="?*", GOTO="md_imsm_inc_end"
ENV{nodmraid}=="?*", GOTO="md_imsm_inc_end"
SUBSYSTEM=="block", ACTION=="add", ENV{ID_FS_TYPE}=="isw_raid_member", \
RUN+="/sbin/mdadm -I $env{DEVNAME}"
SUBSYSTEM=="block", ACTION=="remove", ENV{ID_FS_TYPE}=="isw_raid_member", \
RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
LABEL="md_imsm_inc_end"
# Next make sure that this isn't a dm device we should skip for some reason
ENV{DM_UDEV_RULES_VSN}!="?*", GOTO="dm_change_end"
ENV{DM_UDEV_DISABLE_DISK_RULES_FLAG}=="1", GOTO="dm_change_end"
ENV{DM_SUSPENDED}=="1", GOTO="dm_change_end"
KERNEL=="dm-*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \
ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}"
LABEL="dm_change_end"
# Finally catch any nested md raid arrays. If we brought up an md raid
# array that's part of another md raid array, it won't be ready to be used
# until the change event that occurs when it becomes live
KERNEL=="md*", SUBSYSTEM=="block", ENV{ID_FS_TYPE}=="linux_raid_member", \
ACTION=="change", RUN+="/sbin/mdadm -I $env{DEVNAME}"
# In case the initramfs only started some of the arrays in our container,
# run incremental assembly on the container itself. Note: we ran mdadm
# on the container in 64-md-raid.rules, and that's how the MD_LEVEL
# environment variable is already set. If that disappears from the other
# file, we will need to add this line into the middle of the next rule:
# IMPORT{program}="/sbin/mdadm -D --export $tempnode", \
SUBSYSTEM=="block", ACTION=="add|change", KERNEL=="md*", \
ENV{MD_LEVEL}=="container", RUN+="/sbin/mdadm -I $env{DEVNAME}"
LABEL="md_end"
[-- Attachment #1.3: mdadm-3.1.3-udev.patch --]
[-- Type: text/plain, Size: 1300 bytes --]
--- mdadm-3.2.1/udev-md-raid.rules.udev 2011-03-27 22:31:20.000000000 -0400
+++ mdadm-3.2.1/udev-md-raid.rules 2011-03-28 10:14:26.047232843 -0400
@@ -2,11 +2,13 @@
SUBSYSTEM!="block", GOTO="md_end"
+# In Fedora we handle the raid components in 65-md-incremental.rules so that
+# we can do things like honor anaconda command line options and such
# handle potential components of arrays
-ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
-ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}"
-ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
-ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}"
+#ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
+#ENV{ID_FS_TYPE}=="linux_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}"
+#ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="remove", RUN+="/sbin/mdadm -If $name --path $env{ID_PATH}"
+#ENV{ID_FS_TYPE}=="isw_raid_member", ACTION=="add", RUN+="/sbin/mdadm --incremental $env{DEVNAME}"
# handle md arrays
ACTION!="add|change", GOTO="md_end"
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2012-01-03 15:54 ` Doug Ledford
@ 2012-01-04 2:34 ` NeilBrown
2012-01-06 18:35 ` Doug Ledford
0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2012-01-04 2:34 UTC (permalink / raw)
To: Doug Ledford; +Cc: Jes Sorensen, linux-raid, lukasz.dorau, adam.kwolek
[-- Attachment #1: Type: text/plain, Size: 6337 bytes --]
On Tue, 03 Jan 2012 10:54:41 -0500 Doug Ledford <dledford@redhat.com> wrote:
> On 01/03/2012 05:24 AM, Jes Sorensen wrote:
> > On 12/23/11 00:48, NeilBrown wrote:
> >> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote:
> >>
> >>> From: Jes Sorensen <Jes.Sorensen@redhat.com>
> >>>
> >>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
> >>> ---
> >>> Incremental.c | 1 -
> >>> 1 files changed, 0 insertions(+), 1 deletions(-)
> >>>
> >>> diff --git a/Incremental.c b/Incremental.c
> >>> index 1a9a97b..cff7663 100644
> >>> --- a/Incremental.c
> >>> +++ b/Incremental.c
> >>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
> >>> if (info.array.level == LEVEL_CONTAINER) {
> >>> int devnum = devnum; /* defined and used iff ->external */
> >>> /* Try to assemble within the container */
> >>> - sysfs_uevent(&info, "change");
> >>> if (verbose >= 0)
> >>> fprintf(stderr, Name
> >>> ": container %s now has %d devices\n",
> >>
> >> Hi Jes,
> >> I've had to revert this patch as it causes a regression.
> >>
> >> Without the 'change' event the container device doesn't get created in /dev.
> >>
> >> I'm not sure udev should be running "--incremental" on containers anyway, but
> >> if it does it shouldn't cause problems should it? If it does we should fix
> >> those problems.
> >
> > Hi Neil,
> >
> > I am wary of this being reverted as it fixed a genuine race condition.
> > On Fedora we don't have the problem with the container device not being
> > created, could it be that your udev scripts are not doing the right
> > thing in this case?
>
> I think it's probably worthwhile for us to do a udev rules file
> comparison. So, I'm attaching the patch we apply to the udev rules file
> in your distribution before installing it, as also the rules file that
> handles incremental assembly on Fedora/RHEL. Look them over and see if
> you want to include any of the things we have there in your version.
>
> As to the question of udev running incremental on containers, I think
> it's fair to say that either udev or mdadm should be doing so, and not
> both. Whether it should be one or the other is up for debate I think.
> Doing it in mdadm where this patch pulls it out means that it only
> happens on add of a new device using incremental assembly. The udev
> version does it on a change event on the container. The question I have
> is, if you had mdmon migrate a spare into a container, would it trigger
> this code in mdadm versus would it trigger the code in the udev rules
> file? Is there any situation where a change triggered via something
> other than incremental assembly would result in us needing to run
> incremental on the container? If the answer is yes, then I would
> recommend doing things in the udev file versus in the incremental
> function of mdadm.
>
>
Nothing's ever easy.....
I just tried to reproduce the problem I had which lead me to revert that
patch, but mdadm didn't behave as I expected. So now I'm running my test
suite with the revert removed to see what happens.
However I think that if triggering a change event at that point (or at
any point) causes a problem, then the problem was already there and needs to
be fixed. If there is a race as you say (and I suspect you are right) then
we need to put in appropriate locking to make sure the race doesn't cause a
problem. Just removing the code that triggers the race isn't really enough.
I think that *both* mdadm and udev should be trying to assemble the contents
of a container when the container changes. I'm not convinced that everyone
will be running udev though most probably will. Again: if this is racy we
should identify and lock-out the races.
I don't see how your 'migrate a spare' scenario is relevant but I've probably
missed something.
The scenarios we have been talking about are for assembling a device once all
the required components become available. A 'spare' cannot possibly be a
required component so it doesn't seem relevant.
When a spare is migrated in, mdmon should notice and take whatever action is
necessary. This is quite separate from mdadm or udev.
And I've found out the details of the regression. It is another race :-(
It particularly affects DDF as DDF containers can have a name.
When you "mdadm -I /dev/thing" the first device in a DDF container it will
firstly "set_array_info" to set up the array, then add the info to the 'map'
file.
The set_array_info (or probably the "open" before that) will trigger the
"ADD" uevent which will run mdadm which should create the /dev/md/WHATEVER
link.
However if this happens before mdadm updates the 'map' file it won't be able
to create the link.
So we need to trigger a 'change' event after updating the map file.
There are probably other ways to fix this. We could maybe update the map
file before creating the array device, but that feels clumsy - error handling
would be messy.
We could hold the mapfile locked from before creating the device to after
updating the map file. Then make sure the mdadm that is run from udev waits
for the lock before looking for the name of the array. That is probably
better.
In any case I think we need to give some serious thought to locking to avoid
races between different mdadm. We have some in place already - just need to
make sure it is used consistently.
Some of the changes you have made to the udev.rules file look worth while.
Having two separate files is probably a good idea - one for devices which are
array and one for devices which might be members of arrays.
The various options for skipping things look a bit painful.
i.e. we test for
ANACONDA
DM_MULTIPATH_DEVICE_PATH
noiswmd
nodmraid
The last two are probably OK. The first two I would rather see something else
set a generic "Don't include this device in anything - I've got it
under control" setting, and then the mdadm rules file just checks that and
ignores as appropriate.
It would be good if everything in the rules file in the main distribution
were equally appropriate for all distros, and that any change a distro wanted
to make could be done in a separate file.
Thanks,
NeilBrown
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 828 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves
2012-01-04 2:34 ` NeilBrown
@ 2012-01-06 18:35 ` Doug Ledford
0 siblings, 0 replies; 11+ messages in thread
From: Doug Ledford @ 2012-01-06 18:35 UTC (permalink / raw)
To: NeilBrown; +Cc: Jes Sorensen, linux-raid, lukasz.dorau, adam.kwolek
[-- Attachment #1: Type: text/plain, Size: 11280 bytes --]
On 01/03/2012 09:34 PM, NeilBrown wrote:
> On Tue, 03 Jan 2012 10:54:41 -0500 Doug Ledford <dledford@redhat.com> wrote:
>
>> On 01/03/2012 05:24 AM, Jes Sorensen wrote:
>>> On 12/23/11 00:48, NeilBrown wrote:
>>>> On Fri, 21 Oct 2011 15:33:19 +0200 Jes.Sorensen@redhat.com wrote:
>>>>
>>>>> From: Jes Sorensen <Jes.Sorensen@redhat.com>
>>>>>
>>>>> Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
>>>>> ---
>>>>> Incremental.c | 1 -
>>>>> 1 files changed, 0 insertions(+), 1 deletions(-)
>>>>>
>>>>> diff --git a/Incremental.c b/Incremental.c
>>>>> index 1a9a97b..cff7663 100644
>>>>> --- a/Incremental.c
>>>>> +++ b/Incremental.c
>>>>> @@ -446,7 +446,6 @@ int Incremental(char *devname, int verbose, int runstop,
>>>>> if (info.array.level == LEVEL_CONTAINER) {
>>>>> int devnum = devnum; /* defined and used iff ->external */
>>>>> /* Try to assemble within the container */
>>>>> - sysfs_uevent(&info, "change");
>>>>> if (verbose >= 0)
>>>>> fprintf(stderr, Name
>>>>> ": container %s now has %d devices\n",
>>>>
>>>> Hi Jes,
>>>> I've had to revert this patch as it causes a regression.
>>>>
>>>> Without the 'change' event the container device doesn't get created in /dev.
>>>>
>>>> I'm not sure udev should be running "--incremental" on containers anyway, but
>>>> if it does it shouldn't cause problems should it? If it does we should fix
>>>> those problems.
>>>
>>> Hi Neil,
>>>
>>> I am wary of this being reverted as it fixed a genuine race condition.
>>> On Fedora we don't have the problem with the container device not being
>>> created, could it be that your udev scripts are not doing the right
>>> thing in this case?
>>
>> I think it's probably worthwhile for us to do a udev rules file
>> comparison. So, I'm attaching the patch we apply to the udev rules file
>> in your distribution before installing it, as also the rules file that
>> handles incremental assembly on Fedora/RHEL. Look them over and see if
>> you want to include any of the things we have there in your version.
>>
>> As to the question of udev running incremental on containers, I think
>> it's fair to say that either udev or mdadm should be doing so, and not
>> both. Whether it should be one or the other is up for debate I think.
>> Doing it in mdadm where this patch pulls it out means that it only
>> happens on add of a new device using incremental assembly. The udev
>> version does it on a change event on the container. The question I have
>> is, if you had mdmon migrate a spare into a container, would it trigger
>> this code in mdadm versus would it trigger the code in the udev rules
>> file? Is there any situation where a change triggered via something
>> other than incremental assembly would result in us needing to run
>> incremental on the container? If the answer is yes, then I would
>> recommend doing things in the udev file versus in the incremental
>> function of mdadm.
>>
>>
>
> Nothing's ever easy.....
>
> I just tried to reproduce the problem I had which lead me to revert that
> patch, but mdadm didn't behave as I expected.
Don't you hate it when that happens ;-)
> So now I'm running my test
> suite with the revert removed to see what happens.
>
> However I think that if triggering a change event at that point (or at
> any point) causes a problem, then the problem was already there and needs to
> be fixed. If there is a race as you say (and I suspect you are right) then
> we need to put in appropriate locking to make sure the race doesn't cause a
> problem. Just removing the code that triggers the race isn't really enough.
Agreed.
> I think that *both* mdadm and udev should be trying to assemble the contents
> of a container when the container changes. I'm not convinced that everyone
> will be running udev though most probably will. Again: if this is racy we
> should identify and lock-out the races.
Agreed. So, I'm going to take this opportunity to make a suggestion.
The md map file has shown to be racy over and over again. There are
races on incremental assembly, I had to solve races in the udev assembly
from the initramfs or else devices belonging to the same array could end
up in multiple partially assembled arrays, etc. In the interest of full
disclosure, I always thought the md map file was a solution to a
non-existent problem (a belief that is only getting stronger by the
problem BTW).
Therefore, I suggest we do away with the map file entirely. It isn't
really necessary and it causes a number of issues (the map file is the
only thing that the mdadm binary uses that has to be available from the
initramfs...mdmon would still need pre-init love, but not mdadm).
What's wrong with the map file? It's needlessly redundant. We don't
need to track what arrays we are processing in a file, they are already
tracked in the proper place: the kernel. We simply need to look at
existing arrays to determine if a new device can be added to an existing
array or if we need to create a new one.
I propose that mdadm be modified so that on incremental assembly we take
a non-exclusive open of /proc/mdstat, compare the uuid of our component
device to the uuid of the already existing arrays, if it matches then
add the device to the existing array, if none matches then transition to
an exclusive open of /proc/mdstat (this is our locking to protect
against multiple creations at the same time), once we get the exclusive
open rescan the file in case a new array was added between our initial
scan and the exclusive open, if there is a new array that matches our
uuid (multiple mdadms racing to create the array) then add our device to
that array, else create a new array with our device as the first device,
drop the exclusive open on /proc/mdstat, do container incremental if
needed, then exit.
Under that method of operation, we never have to worry about name
mapping as we will use the uuid exclusively except when creating the
array, so we only have to figure the name out once and then from that
point on any device with a uuid that matches the created array will just
get added to that array.
We might have to modify monitor mode to not hold /proc/mdstat open all
the time to make this work though. I would actually prefer some other
file besides /proc/mdstat, but I was hoping to use a file that already
exists and not one that would have to be created on bootup. Worse case
scenario though, we create a lock file for this specific purpose.
> I don't see how your 'migrate a spare' scenario is relevant but I've probably
> missed something.
It was just a thought exercise. The point being: is there any place
other than incremental assembly that might cause a transition from
unrunnable to runnable and therefore should do incremental on the container.
> The scenarios we have been talking about are for assembling a device once all
> the required components become available. A 'spare' cannot possibly be a
> required component so it doesn't seem relevant.
> When a spare is migrated in, mdmon should notice and take whatever action is
> necessary. This is quite separate from mdadm or udev.
>
> And I've found out the details of the regression. It is another race :-(
>
> It particularly affects DDF as DDF containers can have a name.
>
> When you "mdadm -I /dev/thing" the first device in a DDF container it will
> firstly "set_array_info" to set up the array, then add the info to the 'map'
> file.
>
> The set_array_info (or probably the "open" before that) will trigger the
> "ADD" uevent which will run mdadm which should create the /dev/md/WHATEVER
> link.
> However if this happens before mdadm updates the 'map' file it won't be able
> to create the link.
Again, map file bad, kernel is authoritative and should be all we need.
> So we need to trigger a 'change' event after updating the map file.
>
> There are probably other ways to fix this. We could maybe update the map
> file before creating the array device, but that feels clumsy - error handling
> would be messy.
>
> We could hold the mapfile locked from before creating the device to after
> updating the map file. Then make sure the mdadm that is run from udev waits
> for the lock before looking for the name of the array. That is probably
> better.
Or make sure the kernel ioctl for setting the name of the array issues a
udev event once the name is set and not before and then create the links
based on the kernel data, no need for the map file.
> In any case I think we need to give some serious thought to locking to avoid
> races between different mdadm. We have some in place already - just need to
> make sure it is used consistently.
>
>
> Some of the changes you have made to the udev.rules file look worth while.
> Having two separate files is probably a good idea - one for devices which are
> array and one for devices which might be members of arrays.
>
> The various options for skipping things look a bit painful.
> i.e. we test for
> ANACONDA
That's obviously Red Hat specific (well, Red Hat and derivatives).
> DM_MULTIPATH_DEVICE_PATH
This is LVM generic and so applies to any distro.
> noiswmd
> nodmraid
These two are dracut specific. I don't know which distros are or will
pick up dracut, so I don't know how generic this really is.
> The last two are probably OK.
Cool.
> The first two I would rather see something else
> set a generic "Don't include this device in anything - I've got it
> under control" setting, and then the mdadm rules file just checks that and
> ignores as appropriate.
I dunno, the Anaconda item and the LVM item really mean two distinct
things. In the anaconda case, we aren't ever going to do anything. The
anaconda environment setting is global and will be present on all
devices we ever look at. The LVM flag is device specific. We are going
to see this exact device again, but only after LVM has created the
necessary multipath device and we are scanning the multipath device.
And our actions in these two cases are pretty unique. For instance,
just because we shouldn't process the multipath device doesn't mean the
SCSI block device rules shouldn't be run. So it's not like a global
"don't touch this device" flag is really even appropriate, we would need
a global "md raid don't touch this device" flag instead, at which point
you might as well just do it ourselves like we are now.
> It would be good if everything in the rules file in the main distribution
> were equally appropriate for all distros, and that any change a distro wanted
> to make could be done in a separate file.
Well, equally applicable to all distros, or at least non-intrusive for
other distros. For example, it would be possible to include the various
tests in the udev rule file you distribute because even if other distros
don't use noiwsraid or nodmraid or any of the others, lack of the item
does not negatively affect operation of the rules file.
--
Doug Ledford <dledford@redhat.com>
GPG KeyID: 0E572FDD
http://people.redhat.com/dledford
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 900 bytes --]
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2012-01-06 18:35 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-10-21 13:33 [PATCH v2 0/3] Fix mdadm vs udev race in Incremental and Assemble Jes.Sorensen
2011-10-21 13:33 ` [PATCH 1/3] Remove race for starting container devices Jes.Sorensen
2011-10-22 0:49 ` NeilBrown
2011-10-22 8:22 ` Jes Sorensen
2011-10-21 13:33 ` [PATCH 2/3] Don't tell sysfs to launch the container as we are doing it ourselves Jes.Sorensen
2011-12-22 23:48 ` NeilBrown
2012-01-03 10:24 ` Jes Sorensen
2012-01-03 15:54 ` Doug Ledford
2012-01-04 2:34 ` NeilBrown
2012-01-06 18:35 ` Doug Ledford
2011-10-21 13:33 ` [PATCH 3/3] Hold the map lock while performing Assemble to avoid races with udev Jes.Sorensen
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).