[MDADM PATCH 0/2] /dev/md0 can't be released immediatly

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [MDADM PATCH 0/2] /dev/md0 can't be released immediatly
@ 2017-09-19  6:06 Xiao Ni
  2017-09-19  6:06 ` [MDADM PATCH 1/2] Close mdfd before returning main function Xiao Ni
  2017-09-19  6:06 ` [MDADM PATCH 2/2] Give enough time to udev to handle events Xiao Ni
  0 siblings, 2 replies; 8+ messages in thread
From: Xiao Ni @ 2017-09-19  6:06 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, jes.sorensen

Hi all

The device node /dev/md0 can't be released immediatly after the command
mdadm -S /dev/md0 or mdadm -Ss.

We should give enough time to udev to handle udev events.

Xiao Ni (2):
  Close mdfd before returning main function
  Give enough time to udev to handle events

 Create.c |  2 +-
 mdadm.c  | 16 ++++++++++++++++
 2 files changed, 17 insertions(+), 1 deletion(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 8+ messages in thread

* [MDADM PATCH 1/2] Close mdfd before returning main function
  2017-09-19  6:06 [MDADM PATCH 0/2] /dev/md0 can't be released immediatly Xiao Ni
@ 2017-09-19  6:06 ` Xiao Ni
  2017-10-02 17:27   ` Jes Sorensen
  2017-09-19  6:06 ` [MDADM PATCH 2/2] Give enough time to udev to handle events Xiao Ni
  1 sibling, 1 reply; 8+ messages in thread
From: Xiao Ni @ 2017-09-19  6:06 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, jes.sorensen

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 mdadm.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mdadm.c b/mdadm.c
index d80aab3..7cdcdba 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1734,6 +1734,8 @@ int main(int argc, char *argv[])
 		autodetect();
 		break;
 	}
+	if (mdfd > 0)
+		close(mdfd);
 	exit(rv);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [MDADM PATCH 1/2] Close mdfd before returning main function
  2017-09-19  6:06 ` [MDADM PATCH 1/2] Close mdfd before returning main function Xiao Ni
@ 2017-10-02 17:27   ` Jes Sorensen
  0 siblings, 0 replies; 8+ messages in thread
From: Jes Sorensen @ 2017-10-02 17:27 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: neilb

On 09/19/2017 02:06 AM, Xiao Ni wrote:
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>   mdadm.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mdadm.c b/mdadm.c
> index d80aab3..7cdcdba 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1734,6 +1734,8 @@ int main(int argc, char *argv[])
>   		autodetect();
>   		break;
>   	}
> +	if (mdfd > 0)
> +		close(mdfd);
>   	exit(rv);
>   }

While technically not necessary, since exit() will result in file 
descriptors for the process getting closed, I do prefer to have the code 
clean up properly after itself. I have gone ahead and applied this one.

Cheers,
Jes



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [MDADM PATCH 2/2] Give enough time to udev to handle events
  2017-09-19  6:06 [MDADM PATCH 0/2] /dev/md0 can't be released immediatly Xiao Ni
  2017-09-19  6:06 ` [MDADM PATCH 1/2] Close mdfd before returning main function Xiao Ni
@ 2017-09-19  6:06 ` Xiao Ni
  2017-09-19  6:49   ` NeilBrown
  1 sibling, 1 reply; 8+ messages in thread
From: Xiao Ni @ 2017-09-19  6:06 UTC (permalink / raw)
  To: linux-raid; +Cc: neilb, jes.sorensen

After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Remove
events are generated by md_free() -> del_gendisk() -> blk_unregister_queue().
After calling close(mdfd) the Remove events is generated. We should give udev
a little time to handle the event.

I tried usleep(100*1000), but the problem still can be reproduced sometime.
So I choose to sleep(1). Because after close(mdfd) it can be generated CHANGE
events too. So it's ok to choose to sleep(1) to wait udev to handle CHANGE
events.

Signed-off-by: Xiao Ni <xni@redhat.com>
---
 Create.c |  2 +-
 mdadm.c  | 16 +++++++++++++++-
 2 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/Create.c b/Create.c
index 239545f..24e852e 100644
--- a/Create.c
+++ b/Create.c
@@ -1054,7 +1054,7 @@ int Create(struct supertype *st, char *mddev,
 	/* Give udev a moment to process the Change event caused
 	 * by the close.
 	 */
-	usleep(100*1000);
+	sleep(1);
 	udev_unblock();
 	return 0;
 
diff --git a/mdadm.c b/mdadm.c
index 7cdcdba..2905dea 100644
--- a/mdadm.c
+++ b/mdadm.c
@@ -1734,8 +1734,14 @@ int main(int argc, char *argv[])
 		autodetect();
 		break;
 	}
-	if (mdfd > 0)
+
+	if (mdfd > 0) {
 		close(mdfd);
+		/* Give udev a moment to process the udev event caused
+		 * by the close.
+		 */
+		sleep(1);
+	}
 	exit(rv);
 }
 
@@ -1897,6 +1903,10 @@ static int stop_scan(int verbose)
 				else
 					progress = 1;
 				close(mdfd);
+				/* Give udev a moment to process the Remove event caused
+				 * by the close.
+				 */
+				sleep(1);
 			}
 
 			put_md_name(name);
@@ -1997,6 +2007,10 @@ static int misc_list(struct mddev_dev *devlist,
 				break;
 			}
 			close(mdfd);
+			/* Give udev a moment to process the udev event caused
+			 * by the close.
+			 */
+			sleep(1);
 		} else
 			rv |= 1;
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
  2017-09-19  6:06 ` [MDADM PATCH 2/2] Give enough time to udev to handle events Xiao Ni
@ 2017-09-19  6:49   ` NeilBrown
  2017-09-30  8:14     ` Xiao Ni
  2017-10-02 17:28     ` Jes Sorensen
  0 siblings, 2 replies; 8+ messages in thread
From: NeilBrown @ 2017-09-19  6:49 UTC (permalink / raw)
  To: Xiao Ni, linux-raid; +Cc: jes.sorensen

[-- Attachment #1: Type: text/plain, Size: 2605 bytes --]

On Tue, Sep 19 2017, Xiao Ni wrote:

> After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Remove
> events are generated by md_free() -> del_gendisk() -> blk_unregister_queue().
> After calling close(mdfd) the Remove events is generated. We should give udev
> a little time to handle the event.
>
> I tried usleep(100*1000), but the problem still can be reproduced sometime.
> So I choose to sleep(1). Because after close(mdfd) it can be generated CHANGE
> events too. So it's ok to choose to sleep(1) to wait udev to handle CHANGE
> events.

I really don't like this approach.  The fact that 1 second works for you
is no guarantee that it will work for everybody.
We have a few sleeps in the code already, but I don't like them either.
Let's try not to add more.

If there is some event that you want to wait for, wait for that event.
e.g. if you want to wait for /dev/md0 to disappear then write a loop:

while /dev/md0 exists
   usleep(1000)

But I'm still not convinced that this is really needed.  If it is, then
maybe some sort of kernel fix would be better.

NeilBrown


>
> Signed-off-by: Xiao Ni <xni@redhat.com>
> ---
>  Create.c |  2 +-
>  mdadm.c  | 16 +++++++++++++++-
>  2 files changed, 16 insertions(+), 2 deletions(-)
>
> diff --git a/Create.c b/Create.c
> index 239545f..24e852e 100644
> --- a/Create.c
> +++ b/Create.c
> @@ -1054,7 +1054,7 @@ int Create(struct supertype *st, char *mddev,
>  	/* Give udev a moment to process the Change event caused
>  	 * by the close.
>  	 */
> -	usleep(100*1000);
> +	sleep(1);
>  	udev_unblock();
>  	return 0;
>  
> diff --git a/mdadm.c b/mdadm.c
> index 7cdcdba..2905dea 100644
> --- a/mdadm.c
> +++ b/mdadm.c
> @@ -1734,8 +1734,14 @@ int main(int argc, char *argv[])
>  		autodetect();
>  		break;
>  	}
> -	if (mdfd > 0)
> +
> +	if (mdfd > 0) {
>  		close(mdfd);
> +		/* Give udev a moment to process the udev event caused
> +		 * by the close.
> +		 */
> +		sleep(1);
> +	}
>  	exit(rv);
>  }
>  
> @@ -1897,6 +1903,10 @@ static int stop_scan(int verbose)
>  				else
>  					progress = 1;
>  				close(mdfd);
> +				/* Give udev a moment to process the Remove event caused
> +				 * by the close.
> +				 */
> +				sleep(1);
>  			}
>  
>  			put_md_name(name);
> @@ -1997,6 +2007,10 @@ static int misc_list(struct mddev_dev *devlist,
>  				break;
>  			}
>  			close(mdfd);
> +			/* Give udev a moment to process the udev event caused
> +			 * by the close.
> +			 */
> +			sleep(1);
>  		} else
>  			rv |= 1;
>  	}
> -- 
> 2.7.4

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
  2017-09-19  6:49   ` NeilBrown
@ 2017-09-30  8:14     ` Xiao Ni
  2017-10-02 17:28     ` Jes Sorensen
  1 sibling, 0 replies; 8+ messages in thread
From: Xiao Ni @ 2017-09-30  8:14 UTC (permalink / raw)
  To: NeilBrown; +Cc: linux-raid, jes sorensen



----- Original Message -----
> From: "NeilBrown" <neilb@suse.com>
> To: "Xiao Ni" <xni@redhat.com>, linux-raid@vger.kernel.org
> Cc: "jes sorensen" <jes.sorensen@gmail.com>
> Sent: Tuesday, September 19, 2017 2:49:19 PM
> Subject: Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
> 
> On Tue, Sep 19 2017, Xiao Ni wrote:
> 
> > After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Remove
> > events are generated by md_free() -> del_gendisk() ->
> > blk_unregister_queue().
> > After calling close(mdfd) the Remove events is generated. We should give
> > udev
> > a little time to handle the event.
> >
> > I tried usleep(100*1000), but the problem still can be reproduced sometime.
> > So I choose to sleep(1). Because after close(mdfd) it can be generated
> > CHANGE
> > events too. So it's ok to choose to sleep(1) to wait udev to handle CHANGE
> > events.
> 
> I really don't like this approach.  The fact that 1 second works for you
> is no guarantee that it will work for everybody.
> We have a few sleeps in the code already, but I don't like them either.
> Let's try not to add more.
Hi Neil

Yes, there are already some sleeps during STOP_ARRAY
        count = 25; err = 0;
        while (count && fd >= 0 &&
               (err = ioctl(fd, STOP_ARRAY, NULL)) < 0 && errno == EBUSY) {
                usleep(200000);
                count --;
        }
But it doesn't give time to wait for handling udev event which is generated
by close(mdfd). Because they are asynchronous, so there is a possibility that
the device node still exist after mdadm command exist. 

> 
> If there is some event that you want to wait for, wait for that event.
> e.g. if you want to wait for /dev/md0 to disappear then write a loop:
> 
> while /dev/md0 exists
>    usleep(1000)

Ok, I'll try this.

> 
> But I'm still not convinced that this is really needed.  If it is, then
> maybe some sort of kernel fix would be better.
> 

Hmm the reason I focus on this is that there are some bugs about this.
After some investigation I know it's that we don't wait for udev remove 
event. Because the event is generated by mdadm, so we should guarantee 
the udev event finishes, right?
Do you have a better solution for this? The REMOVE event is created by 
close(mdfd). I don't know how to wait for remove event in kernel space.

Best Regards
Xiao


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
  2017-09-19  6:49   ` NeilBrown
  2017-09-30  8:14     ` Xiao Ni
@ 2017-10-02 17:28     ` Jes Sorensen
  2017-10-02 23:00       ` NeilBrown
  1 sibling, 1 reply; 8+ messages in thread
From: Jes Sorensen @ 2017-10-02 17:28 UTC (permalink / raw)
  To: NeilBrown, Xiao Ni, linux-raid

On 09/19/2017 02:49 AM, NeilBrown wrote:
> On Tue, Sep 19 2017, Xiao Ni wrote:
> 
>> After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Remove
>> events are generated by md_free() -> del_gendisk() -> blk_unregister_queue().
>> After calling close(mdfd) the Remove events is generated. We should give udev
>> a little time to handle the event.
>>
>> I tried usleep(100*1000), but the problem still can be reproduced sometime.
>> So I choose to sleep(1). Because after close(mdfd) it can be generated CHANGE
>> events too. So it's ok to choose to sleep(1) to wait udev to handle CHANGE
>> events.
> 
> I really don't like this approach.  The fact that 1 second works for you
> is no guarantee that it will work for everybody.
> We have a few sleeps in the code already, but I don't like them either.
> Let's try not to add more.
> 
> If there is some event that you want to wait for, wait for that event.
> e.g. if you want to wait for /dev/md0 to disappear then write a loop:
> 
> while /dev/md0 exists
>     usleep(1000)
> 
> But I'm still not convinced that this is really needed.  If it is, then
> maybe some sort of kernel fix would be better.

I agree completely - any case where we need a sleep() is a warning that 
there is probably a bigger problem that needs to be addressed.

Cheers,
Jes

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
  2017-10-02 17:28     ` Jes Sorensen
@ 2017-10-02 23:00       ` NeilBrown
  0 siblings, 0 replies; 8+ messages in thread
From: NeilBrown @ 2017-10-02 23:00 UTC (permalink / raw)
  To: Jes Sorensen, Xiao Ni, linux-raid

[-- Attachment #1: Type: text/plain, Size: 4635 bytes --]

On Mon, Oct 02 2017, Jes Sorensen wrote:

> On 09/19/2017 02:49 AM, NeilBrown wrote:
>> On Tue, Sep 19 2017, Xiao Ni wrote:
>> 
>>> After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Remove
>>> events are generated by md_free() -> del_gendisk() -> blk_unregister_queue().
>>> After calling close(mdfd) the Remove events is generated. We should give udev
>>> a little time to handle the event.
>>>
>>> I tried usleep(100*1000), but the problem still can be reproduced sometime.
>>> So I choose to sleep(1). Because after close(mdfd) it can be generated CHANGE
>>> events too. So it's ok to choose to sleep(1) to wait udev to handle CHANGE
>>> events.
>> 
>> I really don't like this approach.  The fact that 1 second works for you
>> is no guarantee that it will work for everybody.
>> We have a few sleeps in the code already, but I don't like them either.
>> Let's try not to add more.
>> 
>> If there is some event that you want to wait for, wait for that event.
>> e.g. if you want to wait for /dev/md0 to disappear then write a loop:
>> 
>> while /dev/md0 exists
>>     usleep(1000)
>> 
>> But I'm still not convinced that this is really needed.  If it is, then
>> maybe some sort of kernel fix would be better.
>
> I agree completely - any case where we need a sleep() is a warning that 
> there is probably a bigger problem that needs to be addressed.

We current call sleep:
 In Assemble.c ... to wait for everything to have closed the device
    so that the next open goes through a path in __blkdev_get()
    which calls bd_set_size().
    We would need some change in __blkdev_get() to remove the need for
    this.

 In Create.c  to give udev a chance to ignore the Change event
    caused by closing the device, before the remove the file which
    is causing udev to ignore the events.
    Possible we could system("udevadm settle") insteadm.

 In Grow.c:start_reshape, waiting for MD_RECOVERY_RUNNING to be
     clear (I think).  Maybe kernel could be more clever about this.

 In Grow.c:Grow_continue_command ... I think this is waiting for
    the newly started reshape to be reflected in the metadata... not sure.

 In Manage.c:Manage_stop() - avoiding races with transient use of the
    array which might cause the array to refuse to go inactive.
    We might be able to get the kernel to check is uses are transient,
    and block new ones... not sure.

 In Manage.c:Manage_stop() again, waiting for the "critical section"
    of a reshape to pass.  Could maybe teach kernel to let us poll
    "sync_max".

 In Manage.c:Manage_stop() third time, - waiting for 'sync_action' to
    stablize.  Maybe we can teach the kernel to provide more stable
    values.

 ... fourth time , same as first time.

 In Manage.c:Manage_remove() : wait for 'fail' to be completely
    processed so that 'remove' can happen.  Maybe we should poll() some
    sysfs thing.

 In managemon.c:replace_array() - wait for monitor thread (in mdmon)
    to make progress.  Probably no value in changing this.

 In managemon.c:manage_member() - again, waiting for monitor thread

 In managemon.c: handle_message - and again

 In managemon.c:handle_message again - more waiting.
     Maybe monitor could indicate progress to managemon somehow.

 In mdmon.c:clone_monitor: wait for monitor thread to start up. As above.

 In super-intel.c:get_super_block -  avoid race with mdmon which might be
    writing metadata while mdadm tries to read it.  I wonder if advisory
    locking could be used here.  Do flock locks work on block devices?

 In super-intel.c:load_super_imsm  - as above

 In super-intel.c:wait_for_reshape_imsm - wait for reshape to stablize.
     Maybe similar to third time of Manage_stop.

 In util.c:open_dev_excl() - avoid race with transient O_EXCL .. I
     guess. This isn't well documented.

 In util.c:wait_for() - wait for a device to appear in /dev.  Maybe
     "udevadm settle" is better...

 In util.c:hot_remove_disk() - similar to Manage_remove() above

 In util.c:sys_hot_remove_disk() - same as above.

I'm fairly sure we can improve the kernel so that several of these
can be removed, or replaced with select/poll.  Others probably have to
stay.  using "udevadm settle" might be a good idea, but we would want
to make sure we can reproduce the problem, then be sure it is fixed.

For the current issue, fixing the kernel is probably a good idea, but
creating a "wait_while()" - similar to wait_for(), but waits while a
device exists in /dev - is probably simplest.

NeilBrown
    
 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 832 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2017-10-02 23:00 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-09-19  6:06 [MDADM PATCH 0/2] /dev/md0 can't be released immediatly Xiao Ni
2017-09-19  6:06 ` [MDADM PATCH 1/2] Close mdfd before returning main function Xiao Ni
2017-10-02 17:27   ` Jes Sorensen
2017-09-19  6:06 ` [MDADM PATCH 2/2] Give enough time to udev to handle events Xiao Ni
2017-09-19  6:49   ` NeilBrown
2017-09-30  8:14     ` Xiao Ni
2017-10-02 17:28     ` Jes Sorensen
2017-10-02 23:00       ` NeilBrown

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).