[PATCH] mkfs: acquire flock before modifying the device superblock

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mkfs: acquire flock before modifying the device superblock
@ 2022-10-14  8:41 Wu Guanghao
  2022-10-14 15:38 ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Wu Guanghao @ 2022-10-14  8:41 UTC (permalink / raw)
  To: cem; +Cc: linux-xfs, liuzhiqiang (I)

We noticed that systemd has an issue about symlink unreliable caused by
formatting filesystem and systemd operating on same device.
Issue Link: https://github.com/systemd/systemd/issues/23746

According to systemd doc, a BSD flock needs to be acquired before
formatting the device.
Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/

So we acquire flock after opening the device but before
writing superblock.

Signed-off-by: wuguanghao <wuguanghao3@huawei.com>
---
 mkfs/xfs_mkfs.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
index 9dd0e79c..b83cb043 100644
--- a/mkfs/xfs_mkfs.c
+++ b/mkfs/xfs_mkfs.c
@@ -13,6 +13,7 @@
 #include "libfrog/crc32cselftest.h"
 #include "proto.h"
 #include <ini.h>
+#include <sys/file.h>

 #define TERABYTES(count, blog) ((uint64_t)(count) << (40 - (blog)))
 #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
@@ -2758,6 +2759,30 @@ _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"

 }

+static void
+lock_device(dev_t dev, int flag, char *name)
+{
+       int fd = libxfs_device_to_fd(dev);
+       int readonly = flag & LIBXFS_ISREADONLY;
+
+       if (!readonly && fd > 0)
+               if (flock(fd, LOCK_EX) != 0) {
+                       fprintf(stderr, "%s: failed to get lock.\n", name);
+                       exit(1);
+               }
+}
+
+static void
+lock_devices(struct libxfs_xinit *xi)
+{
+       if (!xi->disfile)
+               lock_device(xi->ddev, xi->dcreat, xi->dname);
+       if (xi->logdev && !xi->lisfile)
+               lock_device(xi->logdev, xi->lcreat, xi->logname);
+       if (xi->rtdev && !xi->risfile)
+               lock_device(xi->rtdev, xi->rcreat, xi->rtname);
+}
+
 static void
 open_devices(
        struct mkfs_params      *cfg,
@@ -4208,6 +4233,7 @@ main(
         * Open and validate the device configurations
         */
        open_devices(&cfg, &xi);
+       lock_devices(&xi);
        validate_overwrite(dfile, force_overwrite);
        validate_datadev(&cfg, &cli);
        validate_logdev(&cfg, &cli, &logfile);
--
2.27.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] mkfs: acquire flock before modifying the device superblock
  2022-10-14  8:41 [PATCH] mkfs: acquire flock before modifying the device superblock Wu Guanghao
@ 2022-10-14 15:38 ` Darrick J. Wong
  2022-10-18  2:45   ` Wu Guanghao
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2022-10-14 15:38 UTC (permalink / raw)
  To: Wu Guanghao; +Cc: cem, linux-xfs, liuzhiqiang (I)

On Fri, Oct 14, 2022 at 04:41:35PM +0800, Wu Guanghao wrote:
> We noticed that systemd has an issue about symlink unreliable caused by
> formatting filesystem and systemd operating on same device.
> Issue Link: https://github.com/systemd/systemd/issues/23746
> 
> According to systemd doc, a BSD flock needs to be acquired before
> formatting the device.
> Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/

TLDR: udevd wants fs utilities to use advisory file locking to
coordinate (re)writes to block devices to avoid collisions between mkfs
and all the udev magic.

Critically, udev calls flock(LOCK_SH | LOCK_NB) to trylock the device in
shared mode to avoid blocking on fs utilities; if the trylock fails,
they'll move on and try again later.  The old O_EXCL-on-blockdevs trick
will not work for that usecase (I guess) because it's not a shared
reader lock.  It's also not the file locking API.

> So we acquire flock after opening the device but before
> writing superblock.

xfs_db and xfs_repair can write to the filesystem too; shouldn't this
locking apply to them as well?

> Signed-off-by: wuguanghao <wuguanghao3@huawei.com>
> ---
>  mkfs/xfs_mkfs.c | 26 ++++++++++++++++++++++++++
>  1 file changed, 26 insertions(+)
> 
> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> index 9dd0e79c..b83cb043 100644
> --- a/mkfs/xfs_mkfs.c
> +++ b/mkfs/xfs_mkfs.c
> @@ -13,6 +13,7 @@
>  #include "libfrog/crc32cselftest.h"
>  #include "proto.h"
>  #include <ini.h>
> +#include <sys/file.h>
> 
>  #define TERABYTES(count, blog) ((uint64_t)(count) << (40 - (blog)))
>  #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
> @@ -2758,6 +2759,30 @@ _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
> 
>  }
> 
> +static void
> +lock_device(dev_t dev, int flag, char *name)
> +{
> +       int fd = libxfs_device_to_fd(dev);
> +       int readonly = flag & LIBXFS_ISREADONLY;
> +
> +       if (!readonly && fd > 0)
> +               if (flock(fd, LOCK_EX) != 0) {
> +                       fprintf(stderr, "%s: failed to get lock.\n", name);
> +                       exit(1);
> +               }

So yes, this belongs in libxfs_device_open.

If we're opening the bdevs in readonly mode, shouldn't we take LOCK_SH
to prevent mkfs from colliding with (say) xfs_metadump?

Bonus question: Shouldn't the /kernel/ also effectively be taking
LOCK_SH when it opens the bdevs to mount the filesystem?

--D

> +}
> +
> +static void
> +lock_devices(struct libxfs_xinit *xi)
> +{
> +       if (!xi->disfile)
> +               lock_device(xi->ddev, xi->dcreat, xi->dname);
> +       if (xi->logdev && !xi->lisfile)
> +               lock_device(xi->logdev, xi->lcreat, xi->logname);
> +       if (xi->rtdev && !xi->risfile)
> +               lock_device(xi->rtdev, xi->rcreat, xi->rtname);
> +}
> +
>  static void
>  open_devices(
>         struct mkfs_params      *cfg,
> @@ -4208,6 +4233,7 @@ main(
>          * Open and validate the device configurations
>          */
>         open_devices(&cfg, &xi);
> +       lock_devices(&xi);
>         validate_overwrite(dfile, force_overwrite);
>         validate_datadev(&cfg, &cli);
>         validate_logdev(&cfg, &cli, &logfile);
> --
> 2.27.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mkfs: acquire flock before modifying the device superblock
  2022-10-14 15:38 ` Darrick J. Wong
@ 2022-10-18  2:45   ` Wu Guanghao
  2022-10-18 21:09     ` Darrick J. Wong
  0 siblings, 1 reply; 5+ messages in thread
From: Wu Guanghao @ 2022-10-18  2:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: cem, linux-xfs, liuzhiqiang (I)



在 2022/10/14 23:38, Darrick J. Wong 写道:
> On Fri, Oct 14, 2022 at 04:41:35PM +0800, Wu Guanghao wrote:
>> We noticed that systemd has an issue about symlink unreliable caused by
>> formatting filesystem and systemd operating on same device.
>> Issue Link: https://github.com/systemd/systemd/issues/23746
>>
>> According to systemd doc, a BSD flock needs to be acquired before
>> formatting the device.
>> Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/
> 
> TLDR: udevd wants fs utilities to use advisory file locking to
> coordinate (re)writes to block devices to avoid collisions between mkfs
> and all the udev magic.
> 
> Critically, udev calls flock(LOCK_SH | LOCK_NB) to trylock the device in
> shared mode to avoid blocking on fs utilities; if the trylock fails,
> they'll move on and try again later.  The old O_EXCL-on-blockdevs trick
> will not work for that usecase (I guess) because it's not a shared
> reader lock.  It's also not the file locking API.
> 
>> So we acquire flock after opening the device but before
>> writing superblock.
> 
> xfs_db and xfs_repair can write to the filesystem too; shouldn't this
> locking apply to them as well?
> 
xfs_db is an interactive operation.If a lock is added, the lock may be held
for too long. xfs_repair only repairs the data inside the file system ,so it's
unlikely to conflict with systemd. So these two commands aren't locked.

>> Signed-off-by: wuguanghao <wuguanghao3@huawei.com>
>> ---
>>  mkfs/xfs_mkfs.c | 26 ++++++++++++++++++++++++++
>>  1 file changed, 26 insertions(+)
>>
>> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
>> index 9dd0e79c..b83cb043 100644
>> --- a/mkfs/xfs_mkfs.c
>> +++ b/mkfs/xfs_mkfs.c
>> @@ -13,6 +13,7 @@
>>  #include "libfrog/crc32cselftest.h"
>>  #include "proto.h"
>>  #include <ini.h>
>> +#include <sys/file.h>
>>
>>  #define TERABYTES(count, blog) ((uint64_t)(count) << (40 - (blog)))
>>  #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
>> @@ -2758,6 +2759,30 @@ _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
>>
>>  }
>>
>> +static void
>> +lock_device(dev_t dev, int flag, char *name)
>> +{
>> +       int fd = libxfs_device_to_fd(dev);
>> +       int readonly = flag & LIBXFS_ISREADONLY;
>> +
>> +       if (!readonly && fd > 0)
>> +               if (flock(fd, LOCK_EX) != 0) {
>> +                       fprintf(stderr, "%s: failed to get lock.\n", name);
>> +                       exit(1);
>> +               }
> 
> So yes, this belongs in libxfs_device_open.
> 
> If we're opening the bdevs in readonly mode, shouldn't we take LOCK_SH
> to prevent mkfs from colliding with (say) xfs_metadump?
> 
> Bonus question: Shouldn't the /kernel/ also effectively be taking
> LOCK_SH when it opens the bdevs to mount the filesystem?

Systemd normally uses "watch" to monitor disks, only in special cases
will the monitoring be released. During the time from the release of
monitoring to the re-opening of monitoring, the flock is used to
ensure that the disk won't be written to by others.
So if the disk isn't modified or the modified content won't trigger
the udev rule, then it should be OK not to lock.

There is still a problem with this solution, systemd only lock the main
block device, not the partition device. So if we're operating on a
partitioned device, the lock won't work. Currently we are still
communicating with systemd.

> --D
> 
>> +}
>> +
>> +static void
>> +lock_devices(struct libxfs_xinit *xi)
>> +{
>> +       if (!xi->disfile)
>> +               lock_device(xi->ddev, xi->dcreat, xi->dname);
>> +       if (xi->logdev && !xi->lisfile)
>> +               lock_device(xi->logdev, xi->lcreat, xi->logname);
>> +       if (xi->rtdev && !xi->risfile)
>> +               lock_device(xi->rtdev, xi->rcreat, xi->rtname);
>> +}
>> +
>>  static void
>>  open_devices(
>>         struct mkfs_params      *cfg,
>> @@ -4208,6 +4233,7 @@ main(
>>          * Open and validate the device configurations
>>          */
>>         open_devices(&cfg, &xi);
>> +       lock_devices(&xi);
>>         validate_overwrite(dfile, force_overwrite);
>>         validate_datadev(&cfg, &cli);
>>         validate_logdev(&cfg, &cli, &logfile);
>> --
>> 2.27.0
> .
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mkfs: acquire flock before modifying the device superblock
  2022-10-18  2:45   ` Wu Guanghao
@ 2022-10-18 21:09     ` Darrick J. Wong
  2022-10-19  1:00       ` Dave Chinner
  0 siblings, 1 reply; 5+ messages in thread
From: Darrick J. Wong @ 2022-10-18 21:09 UTC (permalink / raw)
  To: Wu Guanghao; +Cc: cem, linux-xfs, liuzhiqiang (I)

On Tue, Oct 18, 2022 at 10:45:54AM +0800, Wu Guanghao wrote:
> 
> 
> 在 2022/10/14 23:38, Darrick J. Wong 写道:
> > On Fri, Oct 14, 2022 at 04:41:35PM +0800, Wu Guanghao wrote:
> >> We noticed that systemd has an issue about symlink unreliable caused by
> >> formatting filesystem and systemd operating on same device.
> >> Issue Link: https://github.com/systemd/systemd/issues/23746
> >>
> >> According to systemd doc, a BSD flock needs to be acquired before
> >> formatting the device.
> >> Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/
> > 
> > TLDR: udevd wants fs utilities to use advisory file locking to
> > coordinate (re)writes to block devices to avoid collisions between mkfs
> > and all the udev magic.
> > 
> > Critically, udev calls flock(LOCK_SH | LOCK_NB) to trylock the device in
> > shared mode to avoid blocking on fs utilities; if the trylock fails,
> > they'll move on and try again later.  The old O_EXCL-on-blockdevs trick
> > will not work for that usecase (I guess) because it's not a shared
> > reader lock.  It's also not the file locking API.
> > 
> >> So we acquire flock after opening the device but before
> >> writing superblock.
> > 
> > xfs_db and xfs_repair can write to the filesystem too; shouldn't this
> > locking apply to them as well?
> > 
> xfs_db is an interactive operation.If a lock is added, the lock may be held
> for too long.

xfs_db can also write to the filesystem; see -x mode.

But first -- let's zoom out here.  You're adding flock() calls to
xfsprogs to coordinate two userspace programs udev and mkfs.xfs.  Why
wouldn't you add the same flock()ing to the rest of the xfs utilities so
that they also don't step on each other?

xfs_mdrestore can also write an XFS image to a block device, so what
makes it special?

> xfs_repair only repairs the data inside the file system ,so it's
> unlikely to conflict with systemd. So these two commands aren't locked.

"Unlikely" isn't good enough -- xfsprogs don't control the udev rules,
which means that a program invoked by a udev rule could read just about
anywhere in the block device.  Hence we need to prevent udev from
getting confused about /any/ block that xfs_repair might write.

(You /do/ know that xfs_db and xfs_repair can rewrite the primary
superblock, right?)

> >> Signed-off-by: wuguanghao <wuguanghao3@huawei.com>
> >> ---
> >>  mkfs/xfs_mkfs.c | 26 ++++++++++++++++++++++++++
> >>  1 file changed, 26 insertions(+)
> >>
> >> diff --git a/mkfs/xfs_mkfs.c b/mkfs/xfs_mkfs.c
> >> index 9dd0e79c..b83cb043 100644
> >> --- a/mkfs/xfs_mkfs.c
> >> +++ b/mkfs/xfs_mkfs.c
> >> @@ -13,6 +13,7 @@
> >>  #include "libfrog/crc32cselftest.h"
> >>  #include "proto.h"
> >>  #include <ini.h>
> >> +#include <sys/file.h>
> >>
> >>  #define TERABYTES(count, blog) ((uint64_t)(count) << (40 - (blog)))
> >>  #define GIGABYTES(count, blog) ((uint64_t)(count) << (30 - (blog)))
> >> @@ -2758,6 +2759,30 @@ _("log stripe unit (%d bytes) is too large (maximum is 256KiB)\n"
> >>
> >>  }
> >>
> >> +static void
> >> +lock_device(dev_t dev, int flag, char *name)
> >> +{
> >> +       int fd = libxfs_device_to_fd(dev);
> >> +       int readonly = flag & LIBXFS_ISREADONLY;
> >> +
> >> +       if (!readonly && fd > 0)
> >> +               if (flock(fd, LOCK_EX) != 0) {
> >> +                       fprintf(stderr, "%s: failed to get lock.\n", name);
> >> +                       exit(1);
> >> +               }
> > 
> > So yes, this belongs in libxfs_device_open.
> > 
> > If we're opening the bdevs in readonly mode, shouldn't we take LOCK_SH
> > to prevent mkfs from colliding with (say) xfs_metadump?
> > 
> > Bonus question: Shouldn't the /kernel/ also effectively be taking
> > LOCK_SH when it opens the bdevs to mount the filesystem?
> 
> Systemd normally uses "watch" to monitor disks, only in special cases
> will the monitoring be released. During the time from the release of
> monitoring to the re-opening of monitoring, the flock is used to
> ensure that the disk won't be written to by others.
> So if the disk isn't modified or the modified content won't trigger
> the udev rule, then it should be OK not to lock.

xfs utilities can't know what kinds of writes will or will not trigger
udev rules, since the sysadmin can install arbitrary udev rules.

> There is still a problem with this solution, systemd only lock the main
> block device, not the partition device. So if we're operating on a
> partitioned device, the lock won't work. Currently we are still
> communicating with systemd.

Er... well, I guess it's good to know that xfs isn't /completely/ behind
the curve here.

--D

> > --D
> > 
> >> +}
> >> +
> >> +static void
> >> +lock_devices(struct libxfs_xinit *xi)
> >> +{
> >> +       if (!xi->disfile)
> >> +               lock_device(xi->ddev, xi->dcreat, xi->dname);
> >> +       if (xi->logdev && !xi->lisfile)
> >> +               lock_device(xi->logdev, xi->lcreat, xi->logname);
> >> +       if (xi->rtdev && !xi->risfile)
> >> +               lock_device(xi->rtdev, xi->rcreat, xi->rtname);
> >> +}
> >> +
> >>  static void
> >>  open_devices(
> >>         struct mkfs_params      *cfg,
> >> @@ -4208,6 +4233,7 @@ main(
> >>          * Open and validate the device configurations
> >>          */
> >>         open_devices(&cfg, &xi);
> >> +       lock_devices(&xi);
> >>         validate_overwrite(dfile, force_overwrite);
> >>         validate_datadev(&cfg, &cli);
> >>         validate_logdev(&cfg, &cli, &logfile);
> >> --
> >> 2.27.0
> > .
> > 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mkfs: acquire flock before modifying the device superblock
  2022-10-18 21:09     ` Darrick J. Wong
@ 2022-10-19  1:00       ` Dave Chinner
  0 siblings, 0 replies; 5+ messages in thread
From: Dave Chinner @ 2022-10-19  1:00 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Wu Guanghao, cem, linux-xfs, liuzhiqiang (I)

On Tue, Oct 18, 2022 at 02:09:06PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 18, 2022 at 10:45:54AM +0800, Wu Guanghao wrote:
> > 
> > 
> > 在 2022/10/14 23:38, Darrick J. Wong 写道:
> > > On Fri, Oct 14, 2022 at 04:41:35PM +0800, Wu Guanghao wrote:
> > >> We noticed that systemd has an issue about symlink unreliable caused by
> > >> formatting filesystem and systemd operating on same device.
> > >> Issue Link: https://github.com/systemd/systemd/issues/23746
> > >>
> > >> According to systemd doc, a BSD flock needs to be acquired before
> > >> formatting the device.
> > >> Related Link: https://systemd.io/BLOCK_DEVICE_LOCKING/
> > > 
> > > TLDR: udevd wants fs utilities to use advisory file locking to
> > > coordinate (re)writes to block devices to avoid collisions between mkfs
> > > and all the udev magic.
> > > 
> > > Critically, udev calls flock(LOCK_SH | LOCK_NB) to trylock the device in
> > > shared mode to avoid blocking on fs utilities; if the trylock fails,
> > > they'll move on and try again later.  The old O_EXCL-on-blockdevs trick
> > > will not work for that usecase (I guess) because it's not a shared
> > > reader lock.  It's also not the file locking API.
> > > 
> > >> So we acquire flock after opening the device but before
> > >> writing superblock.
> > > 
> > > xfs_db and xfs_repair can write to the filesystem too; shouldn't this
> > > locking apply to them as well?
> > > 
> > xfs_db is an interactive operation.If a lock is added, the lock may be held
> > for too long.
> 
> xfs_db can also write to the filesystem; see -x mode.
> 
> But first -- let's zoom out here.  You're adding flock() calls to
> xfsprogs to coordinate two userspace programs udev and mkfs.xfs.  Why
> wouldn't you add the same flock()ing to the rest of the xfs utilities so
> that they also don't step on each other?
> 
> xfs_mdrestore can also write an XFS image to a block device, so what
> makes it special?
> 
> > xfs_repair only repairs the data inside the file system ,so it's
> > unlikely to conflict with systemd. So these two commands aren't locked.
> 
> "Unlikely" isn't good enough -- xfsprogs don't control the udev rules,
> which means that a program invoked by a udev rule could read just about
> anywhere in the block device.  Hence we need to prevent udev from
> getting confused about /any/ block that xfs_repair might write.
> 
> (You /do/ know that xfs_db and xfs_repair can rewrite the primary
> superblock, right?)

I'm kinda in agreement with Darrick here - we have multiple tools
that can create a new filesystem image on a block device, or modify
an existing filesystem image.

mkfs.xfs, xfs_mdrestore, and xfs_copy can all write new filesystem
images to block devices. Hence if mkfs.xfs needs protection, so do
these other utilities.

xfs_repair, xfs_db, xfs_growfs, etc all modify the filesystem in
ways that udev rules might want to know about - changes to superblock
features, fs geometry, filesystem size, etc - can all come about as
a result of running them.

> > There is still a problem with this solution, systemd only lock the main
> > block device, not the partition device. So if we're operating on a
> > partitioned device, the lock won't work. Currently we are still
> > communicating with systemd.
> 
> Er... well, I guess it's good to know that xfs isn't /completely/ behind
> the curve here.

Well, it does seem kinda arbitrary - this is the first I've heard of
such bdev access rules being introduced.

It also doesn't seem well thought out or executed - userspace
infrastructure makes up a rule for accessing block devices, the rule
doesn't work on common configurations, the rule isn't communicated
or discussed with projects that it directly affects, there's no
obvious plan for hwo to support unsupported configs, nobody really
knows what utilities should obey the rule, etc.

We know we have problems with race conditions caused by udev probing
block devices and eacing with mkfs, mount, unmount, etc causing
exclusive opens to randomly fail (we still see people proposing
"sleep 2" after an unmount to fix spurious "failed" filesystem tests
every so often). So if there's a rule that says "lock the block
device to avoid udev probing races", then why wouldn't we want to
make all the tools we have "play nice" by being able to detect and
avoid udev related race conditions rather than have to work around
spurious failures that result from udev probing races?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-10-19  1:01 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-14  8:41 [PATCH] mkfs: acquire flock before modifying the device superblock Wu Guanghao
2022-10-14 15:38 ` Darrick J. Wong
2022-10-18  2:45   ` Wu Guanghao
2022-10-18 21:09     ` Darrick J. Wong
2022-10-19  1:00       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox