From: Mariusz Tkaczyk <mariusz.tkaczyk@linux.intel.com>
To: Martin Wilck <mwilck@suse.com>, Song Liu <song@kernel.org>
Cc: Li Xiao Keng <lixiaokeng@huawei.com>,
Jes Sorensen <jes@trained-monkey.org>,
Paul Menzel <pmenzel@molgen.mpg.de>, Coly Li <colyli@suse.de>,
linux-raid@vger.kernel.org, linfeilong <linfeilong@huawei.com>,
louhongxiang@huawei.com,
"liuzhiqiang (I)" <liuzhiqiang26@huawei.com>,
miaoguanqin <miaoguanqin@huawei.com>
Subject: Re: [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export"
Date: Wed, 15 Mar 2023 11:10:27 +0100 [thread overview]
Message-ID: <20230315111027.0000372d@linux.intel.com> (raw)
In-Reply-To: <04a4cc6aac10cd24d5bc0b3485d47f6ccb752eab.camel@suse.com>
On Tue, 14 Mar 2023 17:11:06 +0100
Martin Wilck <mwilck@suse.com> wrote:
> On Tue, 2023-03-14 at 16:59 +0100, Mariusz Tkaczyk wrote:
> > On Tue, 14 Mar 2023 16:04:23 +0100
> > Martin Wilck <mwilck@suse.com> wrote:
> >
> > > On Tue, 2023-03-14 at 22:58 +0800, Li Xiao Keng wrote:
> > > > Hi,
> > > > Here we meet a question. When we add a new disk to a raid, it
> > > > may
> > > > return
> > > > -EBUSY.
> > > > The main process of --add(for example md0, sdf):
> > > > 1.dev_open(sdf)
> > > > 2.add_to_super
> > > > 3.write_init_super
> > > > 4.fsync(fd)
> > > > 5.close(fd)
> > > > 6.ioctl(ADD_NEW_DISK).
> > > > However, there will be some udev(change of sdf) event after
> > > > step5.
> > > > Then
> > > > "/usr/sbin/mdadm --incremental --export $devnode --offroot
> > > > $env{DEVLINKS}"
> > > > will be run, and the sdf will be added to md0. After that, step6
> > > > will
> > > > return
> > > > -EBUSY.
> > > > It is a problem to user. First time adding disk does not
> > > > return
> > > > success
> > > > but disk is actually added. And I have no good idea to deal with
> > > > it.
> > > > Please
> > > > give some great advice.
> > >
> > > I haven't looked at the code in detail, but off the top of my head,
> > > it
> > > should help to execute step 5 after step 6. The close() in step 5
> > > triggers the uevent via inotify; doing it after the ioctl should
> > > avoid
> > > the above problem.
> > Hi,
> > That will result in EBUSY in everytime. mdadm will always handle
> > descriptor and kernel will refuse to add the drive.
>
> Why would it cause EBUSY? Please elaborate. My suggestion would avoid
> the race described by Li Xiao Keng. I only suggested to postpone the
> close(), nothing else. The fsync() would still be done before the
> ioctl, so the metadata should be safely on disk when the ioctl is run.
Because device will be claimed in userspace by mdadm. MD may check if device
is not claimed. I checked bind_rdev_to_array() and I don't find an obvious
answer, so I could be wrong here.
Maybe someone more kernel experienced can speak here? Song, could you look?
I think that the descriptor opened by incremental block kernel from adding the
device to the array but also the same incremental is able to add it later
because metadata is on device. There is no retry in Manage_add() flow so it
must be added by Incremental so the question is if it is already in
array or disk is just claimed and that is the problem.
Eventually, you can test your proposal that should gives us an answer.
>
> This is a recurring pattern. Tools that manipulate block devices must
> be aware that close-after-write triggers an uevent, and should make
> sure that they don't close() such files prematurely.
Agree. Mdadm has this problem, descriptors are opened and closed may times.
>
> > >
> > > Another obvious workaround in mdadm would be to check the state of
> > > the
> > > array in the EBUSY case and find out that the disk had already been
> > > added.
> > >
> > > But again, this was just a high-level guess.
> > >
> > > Martin
> > >
> >
> > Hmm... I'm not a native expert but why we cannot write metadata after
> > adding
> > drive to array? Why kernel can't handle that?
> >
Ok, there is an assumption in kernel that device MUST have metadata.
> > Ideally, we should lock device and block udev- I know that there is
> > flock
> > based API to do that but I'm not sure if flock() won't cause the same
> > problem.
>
> That doesn't work reliably. At least not in general. The mechanmism is
> disabled for for dm devices (e.g. multipath), for example. See
> https://github.com/systemd/systemd/blob/a5c0ad9a9a2964079a19a1db42f79570a3582bee/src/udev/udevd.c#L483
>
Yeah, I know but udev processes the disk, not MD device so the locking
should work. But if we cannot trust it, we shouldn't follow this idea.
>
> > There is also something like "udev-md-raid-creating.rules". Maybe we
> > can reuse
> > it?
> >
>
> Unless I am mistaken, these rules are exactly those that cause the
> issue we are discussing. Since these rules are also part of the mdadm
> package, it might be possible to set some flag under /run that would
> indicate to the rules that auto-assembly should be skipped. But that
> might be racy, too.
Yeah, bad idea. Agree.
New day, new ideas:
- why we cannot let udev to add the device? just write metadata and finish,
udev should handle that because metadata is there.
- incremental uses map_lock() to prevent concurrent updates, I seems to b
missed for --add. That could be a key to prevent the behavior.Incremental is
checking if it can lock the map file. If not, it finishes gracefully. To lock
array we need to read metadata first, so it goes to the first question- is it
a problem that incremental has opened descriptor?
Thanks,
Mariusz
next prev parent reply other threads:[~2023-03-15 10:11 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-14 14:58 [QUESTION] How to fix the race of "mdadm --add" and "mdadm mdadm --incremental --export" Li Xiao Keng
2023-03-14 15:04 ` Martin Wilck
2023-03-14 15:59 ` Mariusz Tkaczyk
2023-03-14 16:11 ` Martin Wilck
2023-03-15 10:10 ` Mariusz Tkaczyk [this message]
2023-03-15 13:10 ` Li Xiao Keng
2023-03-15 14:14 ` Martin Wilck
2023-03-15 14:57 ` Li Xiao Keng
2023-03-15 15:01 ` Martin Wilck
2023-03-16 10:44 ` Mariusz Tkaczyk
2023-03-20 15:36 ` Martin Wilck
2023-03-20 15:51 ` Mariusz Tkaczyk
2023-03-14 16:40 ` Coly Li
2023-03-15 2:25 ` Li Xiao Keng
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230315111027.0000372d@linux.intel.com \
--to=mariusz.tkaczyk@linux.intel.com \
--cc=colyli@suse.de \
--cc=jes@trained-monkey.org \
--cc=linfeilong@huawei.com \
--cc=linux-raid@vger.kernel.org \
--cc=liuzhiqiang26@huawei.com \
--cc=lixiaokeng@huawei.com \
--cc=louhongxiang@huawei.com \
--cc=miaoguanqin@huawei.com \
--cc=mwilck@suse.com \
--cc=pmenzel@molgen.mpg.de \
--cc=song@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).