From mboxrd@z Thu Jan  1 00:00:00 1970
From: NeilBrown <neilb@suse.com>
Subject: Re: [MDADM PATCH 2/2] Give enough time to udev to handle events
Date: Tue, 03 Oct 2017 10:00:12 +1100
Message-ID: <8760bx40lf.fsf@notabene.neil.brown.name>
References: <1505801207-10096-1-git-send-email-xni@redhat.com> <1505801207-10096-3-git-send-email-xni@redhat.com> <87r2v3qj2o.fsf@notabene.neil.brown.name> <a34125ee-b8f5-4179-270f-5c66dadd9c4a@gmail.com>
Mime-Version: 1.0
Content-Type: multipart/signed; boundary="=-=-=";
        micalg=pgp-sha256; protocol="application/pgp-signature"
Return-path: <linux-raid-owner@vger.kernel.org>
In-Reply-To: <a34125ee-b8f5-4179-270f-5c66dadd9c4a@gmail.com>
Sender: linux-raid-owner@vger.kernel.org
To: Jes Sorensen <jes.sorensen@gmail.com>, Xiao Ni <xni@redhat.com>, linux-raid@vger.kernel.org
List-Id: linux-raid.ids

--=-=-=
Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

On Mon, Oct 02 2017, Jes Sorensen wrote:

> On 09/19/2017 02:49 AM, NeilBrown wrote:
>> On Tue, Sep 19 2017, Xiao Ni wrote:
>>=20
>>> After mdadm -S /dev/md0, the device node /dev/md0 still exists. The Rem=
ove
>>> events are generated by md_free() -> del_gendisk() -> blk_unregister_qu=
eue().
>>> After calling close(mdfd) the Remove events is generated. We should giv=
e udev
>>> a little time to handle the event.
>>>
>>> I tried usleep(100*1000), but the problem still can be reproduced somet=
ime.
>>> So I choose to sleep(1). Because after close(mdfd) it can be generated =
CHANGE
>>> events too. So it's ok to choose to sleep(1) to wait udev to handle CHA=
NGE
>>> events.
>>=20
>> I really don't like this approach.  The fact that 1 second works for you
>> is no guarantee that it will work for everybody.
>> We have a few sleeps in the code already, but I don't like them either.
>> Let's try not to add more.
>>=20
>> If there is some event that you want to wait for, wait for that event.
>> e.g. if you want to wait for /dev/md0 to disappear then write a loop:
>>=20
>> while /dev/md0 exists
>>     usleep(1000)
>>=20
>> But I'm still not convinced that this is really needed.  If it is, then
>> maybe some sort of kernel fix would be better.
>
> I agree completely - any case where we need a sleep() is a warning that=20
> there is probably a bigger problem that needs to be addressed.

We current call sleep:
 In Assemble.c ... to wait for everything to have closed the device
    so that the next open goes through a path in __blkdev_get()
    which calls bd_set_size().
    We would need some change in __blkdev_get() to remove the need for
    this.

 In Create.c  to give udev a chance to ignore the Change event
    caused by closing the device, before the remove the file which
    is causing udev to ignore the events.
    Possible we could system("udevadm settle") insteadm.

 In Grow.c:start_reshape, waiting for MD_RECOVERY_RUNNING to be
     clear (I think).  Maybe kernel could be more clever about this.

 In Grow.c:Grow_continue_command ... I think this is waiting for
    the newly started reshape to be reflected in the metadata... not sure.

 In Manage.c:Manage_stop() - avoiding races with transient use of the
    array which might cause the array to refuse to go inactive.
    We might be able to get the kernel to check is uses are transient,
    and block new ones... not sure.

 In Manage.c:Manage_stop() again, waiting for the "critical section"
    of a reshape to pass.  Could maybe teach kernel to let us poll
    "sync_max".

 In Manage.c:Manage_stop() third time, - waiting for 'sync_action' to
    stablize.  Maybe we can teach the kernel to provide more stable
    values.

 ... fourth time , same as first time.

 In Manage.c:Manage_remove() : wait for 'fail' to be completely
    processed so that 'remove' can happen.  Maybe we should poll() some
    sysfs thing.

 In managemon.c:replace_array() - wait for monitor thread (in mdmon)
    to make progress.  Probably no value in changing this.

 In managemon.c:manage_member() - again, waiting for monitor thread

 In managemon.c: handle_message - and again

 In managemon.c:handle_message again - more waiting.
     Maybe monitor could indicate progress to managemon somehow.

 In mdmon.c:clone_monitor: wait for monitor thread to start up. As above.

 In super-intel.c:get_super_block -  avoid race with mdmon which might be
    writing metadata while mdadm tries to read it.  I wonder if advisory
    locking could be used here.  Do flock locks work on block devices?

 In super-intel.c:load_super_imsm  - as above

 In super-intel.c:wait_for_reshape_imsm - wait for reshape to stablize.
     Maybe similar to third time of Manage_stop.

 In util.c:open_dev_excl() - avoid race with transient O_EXCL .. I
     guess. This isn't well documented.

 In util.c:wait_for() - wait for a device to appear in /dev.  Maybe
     "udevadm settle" is better...

 In util.c:hot_remove_disk() - similar to Manage_remove() above

 In util.c:sys_hot_remove_disk() - same as above.

I'm fairly sure we can improve the kernel so that several of these
can be removed, or replaced with select/poll.  Others probably have to
stay.  using "udevadm settle" might be a good idea, but we would want
to make sure we can reproduce the problem, then be sure it is fixed.

For the current issue, fixing the kernel is probably a good idea, but
creating a "wait_while()" - similar to wait_for(), but waits while a
device exists in /dev - is probably simplest.

NeilBrown
=20=20=20=20
=20

--=-=-=
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEG8Yp69OQ2HB7X0l6Oeye3VZigbkFAlnSxP0ACgkQOeye3VZi
gbk/RRAAjDkbFdeSXV1mRUHKRNj6xdWzsSmSlkyJ0QZB+enfGz3NMzkbD3nDWXzz
PO0XLjUghVZcWPn34t0WOV4LNtVF6EPhzk4Rx/nFkqduzd6zTTIXawNOwe61QNBg
Yi9uGu39MxyzMqIq28UiJVg+udrR7s2VHKrrpWaoRT1uvr+y/sMFOt2VensIf+r7
J3ca0EKDm4KHIm3A56dVYr4g4A6X8t7rz7WEF99s2Z4RpbFCgW88QPCDrq6Fly6A
fRrsNWHdCFiF1BLWQT+WS8A/f8sIdDy3c+4BUU4RdKiOqIJf9vNMuJhE3ac+gRgL
mYGz0xSlBokqE+RxSha0U5nEE0EG01QWrLG7DR889SAbveg95jHLcEe500lm2NUr
CKWZEboJe+LlMx1sdqn/tw8rmf/bf5eLdGjsRyd4YEUqmRPJxeQg+4izh3XPsQ9W
fbw81NUM70LwjRyxqEZ3UTIguO0UYFJzBFrCSiQNPbVy1LcsVwa0p7ecBmfik2So
yfKYzjioN0agWp2yVg/FcKBJLbx0doYKWbQcN4PRhQrpIMaZJliveLur2tR1Ncnl
KXaWZJJS4C+wKDYaJRcHCOypF7iKfIWSgEon+w8mhA5dlTyLojhEG25F0xlJBmOq
aoD3l02U9aC9hVUV98qoug8sWS5IMjXHro5IrVQhu+eo2q6bc08=
=8T5w
-----END PGP SIGNATURE-----
--=-=-=--