* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 6:34 ` Dave Chinner
@ 2023-06-02 10:53 ` Amir Goldstein
2023-06-02 13:52 ` Christian Brauner
2023-06-02 14:58 ` Theodore Ts'o
2 siblings, 0 replies; 15+ messages in thread
From: Amir Goldstein @ 2023-06-02 10:53 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Darrick J. Wong, Christian Brauner,
Jeff Layton, miklos, linux-fsdevel, linux-xfs
On Fri, Jun 2, 2023 at 9:35 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > > Someone ought to cc Ted since I asked him about this topic this morning
> > > and he said he hadn't noticed it going by...
> > >
> > > > > > In addition the uuid should be set when the filesystem is mounted.
> > > > > > Unless the filesystem implements a dedicated ioctl() - like ext4 - to
> > > > > > change the uuid.
> > > > >
> > > > > IMO, that ext4 functionality is a landmine waiting to be stepped on.
> > > > >
> > > > > We should not be changing the sb->s_uuid of filesysetms dynamically.
> > > >
> > > > Yeah, I kinda agree. If it works for ext4 and it's an ext4 specific
> > > > ioctl then this is fine though.
> > >
> > > Now that Dave's brought up all kinds of questions about other parts of
> > > the kernel using s_uuid for things, I'm starting to think that even ext4
> > > shouldn't be changing its own uuid on the fly.
> >
> > So let's set some context here. The tune2fs program in e2fsprogs has
> > supported changing the UUID for a *very* long time. Specifically,
> > since September 7, 1996 (e2fsprogs version 1.05, when we first added
> > the UUID field in the ext2 superblock).
>
> Yup, and XFS has supported offline changing of the UUID a couple of
> years before that.
>
> > This feature was added from
> > the very beginning since in Large Installation System Administration
> > (LISA) systems, a very common thing to do is to image boot disks from
> > a "golden master", and then afterwards, you want to make sure the file
> > systems on each boot disk have a unique UUID; and this is done via
> > "tune2fs -U random /dev/sdXX". Since I was working at MIT Project
> > Athena at the time, we regularly did this when installing Athena
> > client workstations, and when I added UUID support to ext2, I made
> > sure this feature was well-supported.
>
> See xfs_copy(8). This was a tool originally written, IIRC, in early
> 1995 for physically cloning sparse golden images in the SGI factory
> production line. It was multi-threaded and could write up to 16 scsi
> disks at once with a single ascending LBA order pass. The last thing
> it does is change the UUID of each clone to make them unique.
>
> There's nothing new here - this is all 30 years ago, and we've had
> tools changing filesystems UUIDs for all this time.
>
> > The tune2fs program allows the UUID to be changed via the file system
> > is mounted (with some caveats), which it did by directly modifying the
> > on-disk superblock. Obviously, when it did that, it wouldn't change
> > sb->s_uuid "dynamically", although the next time the file system was
> > mounted, sb->s_uuid would get the new UUID.
>
> Yes, which means for userspace and most of the kernel it's no
> different to "unmount, change UUID, mount". It's effectively an
> offline change, even if the on-disk superblock is changed while the
> filesystem is mounted.
>
> > If overlayfs and IMA are
> > expecting that a file system's UUID would stay consant and persistent
> > --- well, that's not true, and it has always been that way, since
> > there are tools that make it trivially easy for a system administrator
> > to adjust the UUID.
>
> Yes, but that's not the point I've been making. My point is that the
> *online change of sb->s_uuid* that was being proposed for the
> XFS/generic variant of the ext4 online UUID change ioctl is
> completely new, and that's where all the problems start....
>
> > In addition to the LISA context, this feature is also commonly used in
> > various cloud deployments, since when you create a new VM, it
> > typically gets a new root file system, which is copied from a fixed,
> > read-only image. So on a particular hyperscale cloud system, if we
> > didn't do anything special, there could be hundreds of thousands VM's
> > whose root file system would all have the same UUID, which would mean
> > that the UUID... isn't terribly unique.
>
> Again, nothing new here - we've been using snapshots/clones/reflinks
> for efficient VM storage provisioning for well over 15 years now.
>
> .....
>
> > This is the reason why we added the ext4 ioctl; it was intended for
> > the express use of "tune2fs -U", and like tune2fs -U, it doesn't
> > actually change sb->s_uuid; it only changes the on-disk superblock's
> > UUID. This was mostly because we forgot about sb->s_uuid, to be
> > honest, but it means that regardless of whether "tune2fs -U" directly
> > modifies the block device, or uses the ext4 ioctl, the behaviour with
> > respect to sb->s_uuid is the same; it's not modified when the on-disk
> > uuid is changed.
>
> IOWs, not only was the ext4 functionality was poorly thought out, it
> was *poorly implemented*.
>
> So, let's take a step back here - we've done the use case thing to
> death now - and consider what is it we actually need here?
>
> All we need for the hyperscale/VM provisioning use case is for the
> the UUID to be changed at first boot/mount time before anything else
> happens.
>
> So why do we need userspace to be involved in that? Indeed,
> all the problems stem from needing to have userspace change the
> UUID.
>
> There's an obvious solution: a newly provisioned filesystem needs to
> change the uuid at first mount. The only issue is the
> kernel/filesystem doesn't know when the first mount is.
>
> Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> requires changing userspace init stuff and, well, I hate single use
> case mount options like this.
>
> However, we have a golden image that every client image is cloned
> from. Say we set a special feature bit in that golden image that
> means "need UUID regeneration". Then on the first mount of the
> cloned image after provisioning, the filesystem sees the bit and
> automatically regenerates the UUID with needing any help from
> userspace at all.
>
> Problem solved, yes? We don't need userspace to change the uuid on
> first boot of the newly provisioned VM - the filesystem just makes
> it happen.
>
I like this idea.
> If the "first run" init scripts are set up to run blkid to grab the
> new uuid after mount and update whatever needs to be updated with
> the new root filesystem UUID, then we've moved the entire problem
> out of the VM boot path and back into the provisioning system where
> it should be.
>
Seems to me like libblkid does not check for unknown feature bits:
https://github.com/util-linux/util-linux/blob/01a0a556018694bfaf6b01a5a40f8d0d10641a1f/libblkid/src/superblocks/xfs.c#L173
I wonder how systems will behave when libblkid examines this image
and finds a null UUID, without regarding the feature flag.
This is something that can be fixed in userspace, but may cause complications.
> And then we don't need an ioctl to change UUIDs online, nor do we
> require the VFS, kernel subsystems, userspace infrastructure and
> applications to be capable of handling the UUID of a mounted
> filesystem changing without warning....
>
> > > > > The VFS does not guarantee in any way that it is safe to change the
> > > > > sb->s_uuid (i.e. no locking, no change notifications, no udev
> > > > > events, etc). Various subsystems - both in the kernel and in
> > > > > userspace - use the sb->s_uuid as a canonical and/or persistent
> > > > > filesystem/device identifier and are unprepared to have it change
> > > > > while the filesystem is mounted and active.
> >
> > Note that the last sentence is a bit ambiguous.
>
> Well, yes, because while the UUID is normally persistent, if the
> administrator chooses to modify the UUID while the filesystem is
> unmounted, it will change between mounts. In that case.....
>
> > There is the question
> > of whether sb->s_uuid won't change while the file system is mounted,
> > and then there is the question of whether s_uuid is **persistent**
> > ---- which is to say, that it won't change across mounts or reboots.
> >
> > If there are subsystems like IMA, overlayfs, pnfs, et.al, which expect
> > that, I'm sorry, but sysadmin tools to make it trivially easy to
> > change the file system UUID long-predate these other subsystems, and
> > there *are* system adminsitrators --- particularly in the LISA or
> > Cloud context --- which have used "tune2fs -U" for good and proper
> > reasons.
>
> .... it's on the sysadmins to understand they need to regenerate
> anything that is reliant on the old filesystem UUIDs before mounting
> the filesystem again to avoid these issues...
>
For the records, overlayfs looks at s_uuid to try and determine if the
underlying fs was swapped underneath it while overlayfs was offline.
It is sometimes allowed to swap the underlying fs, but overlayfs needs
to know about it.
s_uuid is used as part of a "persistent file handle" in a very similar way
that NFS clients use "fsid" for a unique file handle.
For the very basic overlayfs configuration, changing the lower fs uuid
will result in some overlayfs objects changing their inode numbers.
For overlayfs with opt-in index/nfs_export features, after changing the
underlying fs uuid, overlayfs could no longer be mounted with the same
layer configuration and those opt-in features enabled.
Thanks,
Amir.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 6:34 ` Dave Chinner
2023-06-02 10:53 ` Amir Goldstein
@ 2023-06-02 13:52 ` Christian Brauner
2023-06-02 14:23 ` Darrick J. Wong
2023-06-04 22:59 ` Dave Chinner
2023-06-02 14:58 ` Theodore Ts'o
2 siblings, 2 replies; 15+ messages in thread
From: Christian Brauner @ 2023-06-02 13:52 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Darrick J. Wong, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > > Someone ought to cc Ted since I asked him about this topic this morning
> > > and he said he hadn't noticed it going by...
> > >
> > > > > > In addition the uuid should be set when the filesystem is mounted.
> > > > > > Unless the filesystem implements a dedicated ioctl() - like ext4 - to
> > > > > > change the uuid.
> > > > >
> > > > > IMO, that ext4 functionality is a landmine waiting to be stepped on.
> > > > >
> > > > > We should not be changing the sb->s_uuid of filesysetms dynamically.
> > > >
> > > > Yeah, I kinda agree. If it works for ext4 and it's an ext4 specific
> > > > ioctl then this is fine though.
> > >
> > > Now that Dave's brought up all kinds of questions about other parts of
> > > the kernel using s_uuid for things, I'm starting to think that even ext4
> > > shouldn't be changing its own uuid on the fly.
> >
> > So let's set some context here. The tune2fs program in e2fsprogs has
> > supported changing the UUID for a *very* long time. Specifically,
> > since September 7, 1996 (e2fsprogs version 1.05, when we first added
> > the UUID field in the ext2 superblock).
>
> Yup, and XFS has supported offline changing of the UUID a couple of
> years before that.
>
> > This feature was added from
> > the very beginning since in Large Installation System Administration
> > (LISA) systems, a very common thing to do is to image boot disks from
> > a "golden master", and then afterwards, you want to make sure the file
> > systems on each boot disk have a unique UUID; and this is done via
> > "tune2fs -U random /dev/sdXX". Since I was working at MIT Project
> > Athena at the time, we regularly did this when installing Athena
> > client workstations, and when I added UUID support to ext2, I made
> > sure this feature was well-supported.
>
> See xfs_copy(8). This was a tool originally written, IIRC, in early
> 1995 for physically cloning sparse golden images in the SGI factory
> production line. It was multi-threaded and could write up to 16 scsi
> disks at once with a single ascending LBA order pass. The last thing
> it does is change the UUID of each clone to make them unique.
>
> There's nothing new here - this is all 30 years ago, and we've had
> tools changing filesystems UUIDs for all this time.
>
> > The tune2fs program allows the UUID to be changed via the file system
> > is mounted (with some caveats), which it did by directly modifying the
> > on-disk superblock. Obviously, when it did that, it wouldn't change
> > sb->s_uuid "dynamically", although the next time the file system was
> > mounted, sb->s_uuid would get the new UUID.
>
> Yes, which means for userspace and most of the kernel it's no
> different to "unmount, change UUID, mount". It's effectively an
> offline change, even if the on-disk superblock is changed while the
> filesystem is mounted.
>
> > If overlayfs and IMA are
> > expecting that a file system's UUID would stay consant and persistent
> > --- well, that's not true, and it has always been that way, since
> > there are tools that make it trivially easy for a system administrator
> > to adjust the UUID.
>
> Yes, but that's not the point I've been making. My point is that the
> *online change of sb->s_uuid* that was being proposed for the
> XFS/generic variant of the ext4 online UUID change ioctl is
> completely new, and that's where all the problems start....
>
> > In addition to the LISA context, this feature is also commonly used in
> > various cloud deployments, since when you create a new VM, it
> > typically gets a new root file system, which is copied from a fixed,
> > read-only image. So on a particular hyperscale cloud system, if we
> > didn't do anything special, there could be hundreds of thousands VM's
> > whose root file system would all have the same UUID, which would mean
> > that the UUID... isn't terribly unique.
>
> Again, nothing new here - we've been using snapshots/clones/reflinks
> for efficient VM storage provisioning for well over 15 years now.
>
> .....
>
> > This is the reason why we added the ext4 ioctl; it was intended for
> > the express use of "tune2fs -U", and like tune2fs -U, it doesn't
> > actually change sb->s_uuid; it only changes the on-disk superblock's
> > UUID. This was mostly because we forgot about sb->s_uuid, to be
> > honest, but it means that regardless of whether "tune2fs -U" directly
> > modifies the block device, or uses the ext4 ioctl, the behaviour with
> > respect to sb->s_uuid is the same; it's not modified when the on-disk
> > uuid is changed.
>
> IOWs, not only was the ext4 functionality was poorly thought out, it
> was *poorly implemented*.
>
> So, let's take a step back here - we've done the use case thing to
> death now - and consider what is it we actually need here?
>
> All we need for the hyperscale/VM provisioning use case is for the
> the UUID to be changed at first boot/mount time before anything else
> happens.
>
> So why do we need userspace to be involved in that? Indeed,
> all the problems stem from needing to have userspace change the
> UUID.
>
> There's an obvious solution: a newly provisioned filesystem needs to
> change the uuid at first mount. The only issue is the
> kernel/filesystem doesn't know when the first mount is.
>
> Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> requires changing userspace init stuff and, well, I hate single use
> case mount options like this.
>
> However, we have a golden image that every client image is cloned
> from. Say we set a special feature bit in that golden image that
> means "need UUID regeneration". Then on the first mount of the
> cloned image after provisioning, the filesystem sees the bit and
> automatically regenerates the UUID with needing any help from
> userspace at all.
>
> Problem solved, yes? We don't need userspace to change the uuid on
> first boot of the newly provisioned VM - the filesystem just makes
> it happen.
systemd-repart implements the following logic currently: If the GPT
*partition* and *disk* UUIDs are 0 then it will generate new UUIDs
before the first mount.
So for the *filesystem* UUID I think the golden image should either have
the UUID set to zero as well or to a special UUID. Either way, it would
mean the filesystem needs to generate a new UUID when it is mounted the
first time.
If we do this then all filesystems that support this should use the same
value to indicate "generate new UUID".
>
> If the "first run" init scripts are set up to run blkid to grab the
> new uuid after mount and update whatever needs to be updated with
> the new root filesystem UUID, then we've moved the entire problem
> out of the VM boot path and back into the provisioning system where
> it should be.
>
> And then we don't need an ioctl to change UUIDs online, nor do we
It also doesn't really help that much. What userspace would need is a
way to regenerate the filesystem UUID before the filesystem is mounted.
It doesn't help that much if you have to mount it first to change it...
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 13:52 ` Christian Brauner
@ 2023-06-02 14:23 ` Darrick J. Wong
2023-06-02 15:34 ` Christian Brauner
2023-06-04 22:59 ` Dave Chinner
1 sibling, 1 reply; 15+ messages in thread
From: Darrick J. Wong @ 2023-06-02 14:23 UTC (permalink / raw)
To: Christian Brauner
Cc: Dave Chinner, Theodore Ts'o, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 03:52:16PM +0200, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> > On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > > > Someone ought to cc Ted since I asked him about this topic this morning
> > > > and he said he hadn't noticed it going by...
> > > >
> > > > > > > In addition the uuid should be set when the filesystem is mounted.
> > > > > > > Unless the filesystem implements a dedicated ioctl() - like ext4 - to
> > > > > > > change the uuid.
> > > > > >
> > > > > > IMO, that ext4 functionality is a landmine waiting to be stepped on.
> > > > > >
> > > > > > We should not be changing the sb->s_uuid of filesysetms dynamically.
> > > > >
> > > > > Yeah, I kinda agree. If it works for ext4 and it's an ext4 specific
> > > > > ioctl then this is fine though.
> > > >
> > > > Now that Dave's brought up all kinds of questions about other parts of
> > > > the kernel using s_uuid for things, I'm starting to think that even ext4
> > > > shouldn't be changing its own uuid on the fly.
> > >
> > > So let's set some context here. The tune2fs program in e2fsprogs has
> > > supported changing the UUID for a *very* long time. Specifically,
> > > since September 7, 1996 (e2fsprogs version 1.05, when we first added
> > > the UUID field in the ext2 superblock).
> >
> > Yup, and XFS has supported offline changing of the UUID a couple of
> > years before that.
> >
> > > This feature was added from
> > > the very beginning since in Large Installation System Administration
> > > (LISA) systems, a very common thing to do is to image boot disks from
> > > a "golden master", and then afterwards, you want to make sure the file
> > > systems on each boot disk have a unique UUID; and this is done via
> > > "tune2fs -U random /dev/sdXX". Since I was working at MIT Project
> > > Athena at the time, we regularly did this when installing Athena
> > > client workstations, and when I added UUID support to ext2, I made
> > > sure this feature was well-supported.
> >
> > See xfs_copy(8). This was a tool originally written, IIRC, in early
> > 1995 for physically cloning sparse golden images in the SGI factory
> > production line. It was multi-threaded and could write up to 16 scsi
> > disks at once with a single ascending LBA order pass. The last thing
> > it does is change the UUID of each clone to make them unique.
> >
> > There's nothing new here - this is all 30 years ago, and we've had
> > tools changing filesystems UUIDs for all this time.
> >
> > > The tune2fs program allows the UUID to be changed via the file system
> > > is mounted (with some caveats), which it did by directly modifying the
> > > on-disk superblock. Obviously, when it did that, it wouldn't change
> > > sb->s_uuid "dynamically", although the next time the file system was
> > > mounted, sb->s_uuid would get the new UUID.
> >
> > Yes, which means for userspace and most of the kernel it's no
> > different to "unmount, change UUID, mount". It's effectively an
> > offline change, even if the on-disk superblock is changed while the
> > filesystem is mounted.
> >
> > > If overlayfs and IMA are
> > > expecting that a file system's UUID would stay consant and persistent
> > > --- well, that's not true, and it has always been that way, since
> > > there are tools that make it trivially easy for a system administrator
> > > to adjust the UUID.
> >
> > Yes, but that's not the point I've been making. My point is that the
> > *online change of sb->s_uuid* that was being proposed for the
> > XFS/generic variant of the ext4 online UUID change ioctl is
> > completely new, and that's where all the problems start....
> >
> > > In addition to the LISA context, this feature is also commonly used in
> > > various cloud deployments, since when you create a new VM, it
> > > typically gets a new root file system, which is copied from a fixed,
> > > read-only image. So on a particular hyperscale cloud system, if we
> > > didn't do anything special, there could be hundreds of thousands VM's
> > > whose root file system would all have the same UUID, which would mean
> > > that the UUID... isn't terribly unique.
> >
> > Again, nothing new here - we've been using snapshots/clones/reflinks
> > for efficient VM storage provisioning for well over 15 years now.
> >
> > .....
> >
> > > This is the reason why we added the ext4 ioctl; it was intended for
> > > the express use of "tune2fs -U", and like tune2fs -U, it doesn't
> > > actually change sb->s_uuid; it only changes the on-disk superblock's
> > > UUID. This was mostly because we forgot about sb->s_uuid, to be
> > > honest, but it means that regardless of whether "tune2fs -U" directly
> > > modifies the block device, or uses the ext4 ioctl, the behaviour with
> > > respect to sb->s_uuid is the same; it's not modified when the on-disk
> > > uuid is changed.
...which means that anyone writing out non-ext4 ondisk metadata will now
be doing it with a stale fsuuid. Er... that might just be an ext*
quirk that everyone will have to live with.
> > IOWs, not only was the ext4 functionality was poorly thought out, it
> > was *poorly implemented*.
> >
> > So, let's take a step back here - we've done the use case thing to
> > death now - and consider what is it we actually need here?
> >
> > All we need for the hyperscale/VM provisioning use case is for the
> > the UUID to be changed at first boot/mount time before anything else
> > happens.
> >
> > So why do we need userspace to be involved in that? Indeed,
> > all the problems stem from needing to have userspace change the
> > UUID.
> >
> > There's an obvious solution: a newly provisioned filesystem needs to
> > change the uuid at first mount. The only issue is the
> > kernel/filesystem doesn't know when the first mount is.
> >
> > Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> > requires changing userspace init stuff and, well, I hate single use
> > case mount options like this.
> >
> > However, we have a golden image that every client image is cloned
> > from. Say we set a special feature bit in that golden image that
> > means "need UUID regeneration". Then on the first mount of the
> > cloned image after provisioning, the filesystem sees the bit and
> > automatically regenerates the UUID with needing any help from
> > userspace at all.
> >
> > Problem solved, yes? We don't need userspace to change the uuid on
> > first boot of the newly provisioned VM - the filesystem just makes
> > it happen.
>
> systemd-repart implements the following logic currently: If the GPT
> *partition* and *disk* UUIDs are 0 then it will generate new UUIDs
> before the first mount.
>
> So for the *filesystem* UUID I think the golden image should either have
> the UUID set to zero as well or to a special UUID. Either way, it would
> mean the filesystem needs to generate a new UUID when it is mounted the
> first time.
>
> If we do this then all filesystems that support this should use the same
> value to indicate "generate new UUID".
Curiously, I noticed that blkid doesn't report the xfs uuid if it's all
zeroes:
# mkfs.xfs -f /dev/loop0 -m uuid=00000000-0000-0000-0000-000000000000
# blkid /dev/loop0
/dev/loop0: BLOCK_SIZE="512" TYPE="xfs"
Nor does udev create symlinks:
# ls /dev/disk/by-uuid/0*
ls: cannot access '/dev/disk/by-uuid/0*': No such file or directory
Nor does mounting by uuid work:
# mount UUID=00000000-0000-0000-0000-000000000000 /tmp/x
mount: /tmp/x: can't find UUID=00000000-0000-0000-0000-000000000000.
So I wonder if xfs even really needs a new superblock bit at all --
mounting via uuid doesn't work in the zeroed-uuid case, and the kernel
could indeed generate a new one at mount time before it populates
s_uuid, etc. Then the initscripts can re-run blkid (or xfs_info) to
extract the new uuid and update config files as needed.
Though, the first-mount uuid would still break anything recorded in the
non-xfs metadata by the image creating system (such as evm attributes).
But at least that's on the image creator people to know that.
> >
> > If the "first run" init scripts are set up to run blkid to grab the
> > new uuid after mount and update whatever needs to be updated with
> > the new root filesystem UUID, then we've moved the entire problem
> > out of the VM boot path and back into the provisioning system where
> > it should be.
> >
> > And then we don't need an ioctl to change UUIDs online, nor do we
>
> It also doesn't really help that much. What userspace would need is a
> way to regenerate the filesystem UUID before the filesystem is mounted.
> It doesn't help that much if you have to mount it first to change it...
<shrug> Well it's the rootfs where we want to change the uuid at
first-run time, and all the config info that needs updating is inside
the rootfs anyway. If someone needs mount-by-uuid for the rootfs during
the first run or they require a specific uuid, they can still run
xfs_admin from within the initramfs.
--D
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 14:23 ` Darrick J. Wong
@ 2023-06-02 15:34 ` Christian Brauner
0 siblings, 0 replies; 15+ messages in thread
From: Christian Brauner @ 2023-06-02 15:34 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Dave Chinner, Theodore Ts'o, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 07:23:29AM -0700, Darrick J. Wong wrote:
> On Fri, Jun 02, 2023 at 03:52:16PM +0200, Christian Brauner wrote:
> > On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > > > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > > > > Someone ought to cc Ted since I asked him about this topic this morning
> > > > > and he said he hadn't noticed it going by...
> > > > >
> > > > > > > > In addition the uuid should be set when the filesystem is mounted.
> > > > > > > > Unless the filesystem implements a dedicated ioctl() - like ext4 - to
> > > > > > > > change the uuid.
> > > > > > >
> > > > > > > IMO, that ext4 functionality is a landmine waiting to be stepped on.
> > > > > > >
> > > > > > > We should not be changing the sb->s_uuid of filesysetms dynamically.
> > > > > >
> > > > > > Yeah, I kinda agree. If it works for ext4 and it's an ext4 specific
> > > > > > ioctl then this is fine though.
> > > > >
> > > > > Now that Dave's brought up all kinds of questions about other parts of
> > > > > the kernel using s_uuid for things, I'm starting to think that even ext4
> > > > > shouldn't be changing its own uuid on the fly.
> > > >
> > > > So let's set some context here. The tune2fs program in e2fsprogs has
> > > > supported changing the UUID for a *very* long time. Specifically,
> > > > since September 7, 1996 (e2fsprogs version 1.05, when we first added
> > > > the UUID field in the ext2 superblock).
> > >
> > > Yup, and XFS has supported offline changing of the UUID a couple of
> > > years before that.
> > >
> > > > This feature was added from
> > > > the very beginning since in Large Installation System Administration
> > > > (LISA) systems, a very common thing to do is to image boot disks from
> > > > a "golden master", and then afterwards, you want to make sure the file
> > > > systems on each boot disk have a unique UUID; and this is done via
> > > > "tune2fs -U random /dev/sdXX". Since I was working at MIT Project
> > > > Athena at the time, we regularly did this when installing Athena
> > > > client workstations, and when I added UUID support to ext2, I made
> > > > sure this feature was well-supported.
> > >
> > > See xfs_copy(8). This was a tool originally written, IIRC, in early
> > > 1995 for physically cloning sparse golden images in the SGI factory
> > > production line. It was multi-threaded and could write up to 16 scsi
> > > disks at once with a single ascending LBA order pass. The last thing
> > > it does is change the UUID of each clone to make them unique.
> > >
> > > There's nothing new here - this is all 30 years ago, and we've had
> > > tools changing filesystems UUIDs for all this time.
> > >
> > > > The tune2fs program allows the UUID to be changed via the file system
> > > > is mounted (with some caveats), which it did by directly modifying the
> > > > on-disk superblock. Obviously, when it did that, it wouldn't change
> > > > sb->s_uuid "dynamically", although the next time the file system was
> > > > mounted, sb->s_uuid would get the new UUID.
> > >
> > > Yes, which means for userspace and most of the kernel it's no
> > > different to "unmount, change UUID, mount". It's effectively an
> > > offline change, even if the on-disk superblock is changed while the
> > > filesystem is mounted.
> > >
> > > > If overlayfs and IMA are
> > > > expecting that a file system's UUID would stay consant and persistent
> > > > --- well, that's not true, and it has always been that way, since
> > > > there are tools that make it trivially easy for a system administrator
> > > > to adjust the UUID.
> > >
> > > Yes, but that's not the point I've been making. My point is that the
> > > *online change of sb->s_uuid* that was being proposed for the
> > > XFS/generic variant of the ext4 online UUID change ioctl is
> > > completely new, and that's where all the problems start....
> > >
> > > > In addition to the LISA context, this feature is also commonly used in
> > > > various cloud deployments, since when you create a new VM, it
> > > > typically gets a new root file system, which is copied from a fixed,
> > > > read-only image. So on a particular hyperscale cloud system, if we
> > > > didn't do anything special, there could be hundreds of thousands VM's
> > > > whose root file system would all have the same UUID, which would mean
> > > > that the UUID... isn't terribly unique.
> > >
> > > Again, nothing new here - we've been using snapshots/clones/reflinks
> > > for efficient VM storage provisioning for well over 15 years now.
> > >
> > > .....
> > >
> > > > This is the reason why we added the ext4 ioctl; it was intended for
> > > > the express use of "tune2fs -U", and like tune2fs -U, it doesn't
> > > > actually change sb->s_uuid; it only changes the on-disk superblock's
> > > > UUID. This was mostly because we forgot about sb->s_uuid, to be
> > > > honest, but it means that regardless of whether "tune2fs -U" directly
> > > > modifies the block device, or uses the ext4 ioctl, the behaviour with
> > > > respect to sb->s_uuid is the same; it's not modified when the on-disk
> > > > uuid is changed.
>
> ...which means that anyone writing out non-ext4 ondisk metadata will now
> be doing it with a stale fsuuid. Er... that might just be an ext*
> quirk that everyone will have to live with.
>
> > > IOWs, not only was the ext4 functionality was poorly thought out, it
> > > was *poorly implemented*.
> > >
> > > So, let's take a step back here - we've done the use case thing to
> > > death now - and consider what is it we actually need here?
> > >
> > > All we need for the hyperscale/VM provisioning use case is for the
> > > the UUID to be changed at first boot/mount time before anything else
> > > happens.
> > >
> > > So why do we need userspace to be involved in that? Indeed,
> > > all the problems stem from needing to have userspace change the
> > > UUID.
> > >
> > > There's an obvious solution: a newly provisioned filesystem needs to
> > > change the uuid at first mount. The only issue is the
> > > kernel/filesystem doesn't know when the first mount is.
> > >
> > > Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> > > requires changing userspace init stuff and, well, I hate single use
> > > case mount options like this.
> > >
> > > However, we have a golden image that every client image is cloned
> > > from. Say we set a special feature bit in that golden image that
> > > means "need UUID regeneration". Then on the first mount of the
> > > cloned image after provisioning, the filesystem sees the bit and
> > > automatically regenerates the UUID with needing any help from
> > > userspace at all.
> > >
> > > Problem solved, yes? We don't need userspace to change the uuid on
> > > first boot of the newly provisioned VM - the filesystem just makes
> > > it happen.
> >
> > systemd-repart implements the following logic currently: If the GPT
> > *partition* and *disk* UUIDs are 0 then it will generate new UUIDs
> > before the first mount.
> >
> > So for the *filesystem* UUID I think the golden image should either have
> > the UUID set to zero as well or to a special UUID. Either way, it would
> > mean the filesystem needs to generate a new UUID when it is mounted the
> > first time.
> >
> > If we do this then all filesystems that support this should use the same
> > value to indicate "generate new UUID".
>
> Curiously, I noticed that blkid doesn't report the xfs uuid if it's all
> zeroes:
>
> # mkfs.xfs -f /dev/loop0 -m uuid=00000000-0000-0000-0000-000000000000
>
> # blkid /dev/loop0
> /dev/loop0: BLOCK_SIZE="512" TYPE="xfs"
You should use blkid -p btw because without -p blkid checks a cache
which is problematic.
>
> Nor does udev create symlinks:
>
> # ls /dev/disk/by-uuid/0*
> ls: cannot access '/dev/disk/by-uuid/0*': No such file or directory
Yeah, it can't because there's no uuid and zero is treated as "not set".
>
> Nor does mounting by uuid work:
>
> # mount UUID=00000000-0000-0000-0000-000000000000 /tmp/x
> mount: /tmp/x: can't find UUID=00000000-0000-0000-0000-000000000000.
>
> So I wonder if xfs even really needs a new superblock bit at all --
> mounting via uuid doesn't work in the zeroed-uuid case, and the kernel
> could indeed generate a new one at mount time before it populates
> s_uuid, etc. Then the initscripts can re-run blkid (or xfs_info) to
> extract the new uuid and update config files as needed.
Yeah, that's my proposal and it's closely mirrored on what we did for
systemd-repart:
6. Similarly, all existing partitions for which configuration files
exist and which currently have an all-zero identifying UUID will be
assigned a new UUID. This UUID is cryptographically hashed from a
common seed value together with the partition type UUID (and a
counter in case multiple partitions of the same type are defined),
see below. The same is done for all partitions that are created anew.
These assignments are done in memory only, too, the disk is not
updated yet.
7. Similarly, if the disk's volume UUID is all zeroes it is also
initialized, also cryptographically hashed from the same common seed
value. This is done in memory only too.
[...]
9. The new partition table is finally written to disk. The kernel is
asked to reread the partition table.
https://www.freedesktop.org/software/systemd/man/systemd-repart.service.html
>
> Though, the first-mount uuid would still break anything recorded in the
> non-xfs metadata by the image creating system (such as evm attributes).
> But at least that's on the image creator people to know that.
Sure, but that's a generic userspace problem for any identifier relying
on or derived from the filesystem uuid. IOW, that's not really our
concern imho.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 13:52 ` Christian Brauner
2023-06-02 14:23 ` Darrick J. Wong
@ 2023-06-04 22:59 ` Dave Chinner
2023-06-05 11:37 ` Christian Brauner
1 sibling, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2023-06-04 22:59 UTC (permalink / raw)
To: Christian Brauner
Cc: Theodore Ts'o, Darrick J. Wong, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 03:52:16PM +0200, Christian Brauner wrote:
> On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> > On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > There's an obvious solution: a newly provisioned filesystem needs to
> > change the uuid at first mount. The only issue is the
> > kernel/filesystem doesn't know when the first mount is.
> >
> > Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> > requires changing userspace init stuff and, well, I hate single use
> > case mount options like this.
> >
> > However, we have a golden image that every client image is cloned
> > from. Say we set a special feature bit in that golden image that
> > means "need UUID regeneration". Then on the first mount of the
> > cloned image after provisioning, the filesystem sees the bit and
> > automatically regenerates the UUID with needing any help from
> > userspace at all.
> >
> > Problem solved, yes? We don't need userspace to change the uuid on
> > first boot of the newly provisioned VM - the filesystem just makes
> > it happen.
>
> systemd-repart implements the following logic currently: If the GPT
> *partition* and *disk* UUIDs are 0 then it will generate new UUIDs
> before the first mount.
>
> So for the *filesystem* UUID I think the golden image should either have
> the UUID set to zero as well or to a special UUID. Either way, it would
> mean the filesystem needs to generate a new UUID when it is mounted the
> first time.
>
> If we do this then all filesystems that support this should use the same
> value to indicate "generate new UUID".
Ok, the main problem here is that all existing filesystem
implementations don't consider a zero UUID special. If you do this
on an existing kernel, it won't do anything and will not throw any
errors. Now we have the problem that userspace infrastructure can't
rely on the kernel telling it that it doesn't support the
functionality it is relying on. i.e. we have a mounted filesystems
and now userspace has to detect and handle the fact it still needs
to change the filesystem UUID.
Further, if this is not handled properly, every root filesystem
having a zero or duplicate "special" UUID is a landmine for OS
kernel upgrades to trip over. i.e. upgrade from old, unsupported to
new supported kernel and the next boot regens the UUID unexpectedly
and breaks anything relying on the old UUID.
Hence the point of using a feature bit is that the kernel will
refuse to mount the filesysetm if it does not understand the feature
bit. This way we have a hard image deployment testing failure that people
building and deploying images will notice. Hence they can configure
the build scripts to use the correct "change uuid" mechanism
with older OS releases and can take appropriate action when building
"legacy OS" images.
Yes, distros and vendors can backport the feature bit support if
they want, and then deployment of up-to-date older OS releases will
work with this new infrastructure correctly. But that is not
guaranteed to happen, so we really need a hard failure for
unsupported kernels.
So, yeah, I really do think this needs to be driven by a filesystem
feature bit, not retrospectively defining a special UUID value to
trigger this upgrade behaviour...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-04 22:59 ` Dave Chinner
@ 2023-06-05 11:37 ` Christian Brauner
2023-06-05 14:36 ` Theodore Ts'o
0 siblings, 1 reply; 15+ messages in thread
From: Christian Brauner @ 2023-06-05 11:37 UTC (permalink / raw)
To: Dave Chinner
Cc: Theodore Ts'o, Darrick J. Wong, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Mon, Jun 05, 2023 at 08:59:33AM +1000, Dave Chinner wrote:
> On Fri, Jun 02, 2023 at 03:52:16PM +0200, Christian Brauner wrote:
> > On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> > > On Fri, Jun 02, 2023 at 12:27:14AM -0400, Theodore Ts'o wrote:
> > > > On Thu, Jun 01, 2023 at 06:23:35PM -0700, Darrick J. Wong wrote:
> > > There's an obvious solution: a newly provisioned filesystem needs to
> > > change the uuid at first mount. The only issue is the
> > > kernel/filesystem doesn't know when the first mount is.
> > >
> > > Darrick suggested "mount -o setuuid=xxxx" on #xfs earlier, but that
> > > requires changing userspace init stuff and, well, I hate single use
> > > case mount options like this.
> > >
> > > However, we have a golden image that every client image is cloned
> > > from. Say we set a special feature bit in that golden image that
> > > means "need UUID regeneration". Then on the first mount of the
> > > cloned image after provisioning, the filesystem sees the bit and
> > > automatically regenerates the UUID with needing any help from
> > > userspace at all.
> > >
> > > Problem solved, yes? We don't need userspace to change the uuid on
> > > first boot of the newly provisioned VM - the filesystem just makes
> > > it happen.
> >
> > systemd-repart implements the following logic currently: If the GPT
> > *partition* and *disk* UUIDs are 0 then it will generate new UUIDs
> > before the first mount.
> >
> > So for the *filesystem* UUID I think the golden image should either have
> > the UUID set to zero as well or to a special UUID. Either way, it would
> > mean the filesystem needs to generate a new UUID when it is mounted the
> > first time.
> >
> > If we do this then all filesystems that support this should use the same
> > value to indicate "generate new UUID".
>
> Ok, the main problem here is that all existing filesystem
> implementations don't consider a zero UUID special. If you do this
> on an existing kernel, it won't do anything and will not throw any
> errors. Now we have the problem that userspace infrastructure can't
> rely on the kernel telling it that it doesn't support the
> functionality it is relying on. i.e. we have a mounted filesystems
> and now userspace has to detect and handle the fact it still needs
> to change the filesystem UUID.
>
> Further, if this is not handled properly, every root filesystem
> having a zero or duplicate "special" UUID is a landmine for OS
> kernel upgrades to trip over. i.e. upgrade from old, unsupported to
> new supported kernel and the next boot regens the UUID unexpectedly
> and breaks anything relying on the old UUID.
>
> Hence the point of using a feature bit is that the kernel will
> refuse to mount the filesysetm if it does not understand the feature
> bit. This way we have a hard image deployment testing failure that people
> building and deploying images will notice. Hence they can configure
> the build scripts to use the correct "change uuid" mechanism
> with older OS releases and can take appropriate action when building
> "legacy OS" images.
>
> Yes, distros and vendors can backport the feature bit support if
> they want, and then deployment of up-to-date older OS releases will
> work with this new infrastructure correctly. But that is not
> guaranteed to happen, so we really need a hard failure for
> unsupported kernels.
>
> So, yeah, I really do think this needs to be driven by a filesystem
> feature bit, not retrospectively defining a special UUID value to
> trigger this upgrade behaviour...
Using a zero/special UUID would have made this usable for most
filesystems which allows userspace to more easily detect this. Using a
filesystem feature bit makes this a lot more fragmented between
filesystems.
But allowing to refuse being mounted on older kernels when the feature
bit is set and unknown can be quite useful. So this is also fine by me.
So, the protocol should be to create a filesystem with a zero UUID and
the new feature bit set. At the first mount the UUID will be generated.
Only thing I would really love to see is a short blurb about this in
Documentation/filesystems/uuid.rst so we have a reference point for how
we expect this to work and how a filesystem should implement this.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-05 11:37 ` Christian Brauner
@ 2023-06-05 14:36 ` Theodore Ts'o
2023-06-06 0:54 ` Dave Chinner
0 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2023-06-05 14:36 UTC (permalink / raw)
To: Christian Brauner
Cc: Dave Chinner, Darrick J. Wong, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Mon, Jun 05, 2023 at 01:37:40PM +0200, Christian Brauner wrote:
> Using a zero/special UUID would have made this usable for most
> filesystems which allows userspace to more easily detect this. Using a
> filesystem feature bit makes this a lot more fragmented between
> filesystems.
Not all file systems have feature bits. So I'd suggest that how this
should be a file system specific implementation detail. If with a
newer kernel, a file systems sets the UUID to a random value if it is
all zeros when it is mounted should be relatively simple.
However, there are some questions this brings up. What should the
semantics be if a file system creates a file system-level snapshot ---
should the UUID be refreshed? What if it is a block-level file system
snapshot using LVM --- should the UUID be refreshed in that case?
As I've been trying to point out, exactly what the semantics of a file
system level UUID has never been well defined, and it's not clear what
various subsystems are trying to *do* with the UUID. And given that
what can happen with mount name spaces, bind mounts, etc., we should
ask whether the assumptions they are making with respect to UUID is in
fact something we should be encouraging.
> But allowing to refuse being mounted on older kernels when the feature
> bit is set and unknown can be quite useful. So this is also fine by me.
This pretty much guarantees people won't use the feature for a while.
People complain when a file system cann't be mounted. Using a feature
bit is also very likely to mean that you won't be able to run an older
fsck on that file system --- for what users would complain would be no
good reason. And arguably, they would be right to complain.
- Ted
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-05 14:36 ` Theodore Ts'o
@ 2023-06-06 0:54 ` Dave Chinner
0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2023-06-06 0:54 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Christian Brauner, Darrick J. Wong, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Mon, Jun 05, 2023 at 10:36:38AM -0400, Theodore Ts'o wrote:
> On Mon, Jun 05, 2023 at 01:37:40PM +0200, Christian Brauner wrote:
> > Using a zero/special UUID would have made this usable for most
> > filesystems which allows userspace to more easily detect this. Using a
> > filesystem feature bit makes this a lot more fragmented between
> > filesystems.
>
> Not all file systems have feature bits. So I'd suggest that how this
> should be a file system specific implementation detail. If with a
> newer kernel, a file systems sets the UUID to a random value if it is
> all zeros when it is mounted should be relatively simple.
Sure, but this is a *fs implementation detail*, not a user API
requirement.
If the filesysystem has feature bits, then it should use them, not
rely on zero UUID values because existing filesystems and/or images
could have zero values in them and the user may no want them to be
regenerated on mount. That's a retrospective change of on-disk
format behaviour, and hence requires feature bits to manage....
> However, there are some questions this brings up. What should the
> semantics be if a file system creates a file system-level snapshot ---
> should the UUID be refreshed? What if it is a block-level file system
> snapshot using LVM --- should the UUID be refreshed in that case?
Engage your brain, Ted. Existing workflows with snapshots are
completely unchanged by this proposal. If you take a device level
snapshot and then want to mount it, you have to change the UUID
before it gets mounted..
Indeed, XFS will refuse to mount filesystems with duplicate UUIDs;
the admin has been forced to run xfs admin tools to regenerate the
UUID before mounting the snapshot image for the past 20+ years. Or
for pure read-only snapshots, they need to use "-o
ro,norecovery,nouuid" to allow a pure read-only mount with a
duplicate UUID. The "nouuid" mount otion has been around for almost
22 years:
commit 813e9410043e88b474b8b2b43c8d8e52ea90f155
Author: Steve Lord <lord@sgi.com>
Date: Fri Jun 29 22:29:47 2001 +0000
Add nouuid mount option
Either way, the admin has to manage UUIDs for device level
snapshots, and there is no change in that at all.
IOWs, there is no change to existing workflows because they already
require UUIDs to be directly manipulated by the user before or at
mount time for correct behaviour.
> As I've been trying to point out, exactly what the semantics of a file
> system level UUID has never been well defined, and it's not clear what
> various subsystems are trying to *do* with the UUID. And given that
> what can happen with mount name spaces, bind mounts, etc., we should
> ask whether the assumptions they are making with respect to UUID is in
> fact something we should be encouraging.
We can't put that genie back in the bottle.
But it does raise a further interesting questions about sb->s_uuid:
is one uuid sufficient for a superblock? We have two specific use
cases here:
1. A uuid that uniquely identifies every filesystem (e.g. blkid,
pnfs, /dev/disk/by-uuid/, etc)
2. A persistent, unchanging uuid that can be used to key persistent
objects to the underlying filesystem (overlay, security xattrs,
etc) regardless of snapshots, cloning, dedupe, etc.
We already have a solution to that problem in XFS, sbp->sb_uuid
is for case #1, sbp->sb_metauuid is for case #2 as every metadata
block in the filesystem is keyed with sbp->sb_metauuid. Both start out
the same at mkfs time, but if we then regenerate the filesystem
uuid, then only sbp->sb_uuid is changed. We do not rewrite metadata
with the new uuid, doing so would break snapshot/clone/dedupe in
shared filesystem images.
This is one of the things that the XFS online UUID change proposal
added - it allowed for userspace to query the sbp->sb_metauuid in
addition to the sbp->sb_uuid so that userspace init script
orchestration to make use of it for persistent userspace filesystem
objects rather than the sbp->s_uuid identifier....
> > But allowing to refuse being mounted on older kernels when the feature
> > bit is set and unknown can be quite useful. So this is also fine by me.
>
> This pretty much guarantees people won't use the feature for a while.
Perfectly fine by me. Those that need it will backport/upgrade both
userspace and kernels immediately, and they reap the benefits
immediately. Everyone else gets it as distros roll out with the
functionality enabled and fully supported across the toolchain.
This is how all new feature additions work, so I'm not sure why you
think this is a reason not to use a feature bit...
> People complain when a file system cann't be mounted. Using a feature
> bit is also very likely to mean that you won't be able to run an older
> fsck on that file system --- for what users would complain would be no
> good reason. And arguably, they would be right to complain.
In general, yes, but this is *not a general case*.
If you have a golden image with the feature bit set, why would you
ever run a fsck that doesn't support the feature bit on it? You have
to have a tool chain that supports the feature bit to set it in the
first place.
And If the feature bit is set, then you must be running client kernels
that support it (and will clear it on first mount), so once the
client system is running, the feature bit will never be set and so
the toolchain in the client OS just doesn't matter at all.
There is literally no other use case for this feature, so arguing
about generalities that simply don't apply to the specific use case
really isn't that helpful.
As a result, I don't see that there are any concerns about using a
feature bit at all, yet I see substantial benefit from not
retropsectively redefining a special on-disk UUID value that
silently drives new kernel behaviour.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 6:34 ` Dave Chinner
2023-06-02 10:53 ` Amir Goldstein
2023-06-02 13:52 ` Christian Brauner
@ 2023-06-02 14:58 ` Theodore Ts'o
2023-06-04 22:35 ` Dave Chinner
2 siblings, 1 reply; 15+ messages in thread
From: Theodore Ts'o @ 2023-06-02 14:58 UTC (permalink / raw)
To: Dave Chinner
Cc: Darrick J. Wong, Christian Brauner, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> IOWs, not only was the ext4 functionality was poorly thought out, it
> was *poorly implemented*.
Shrug. It's 100% compatible with "tune2fs -U <uuid>" which existed
prior to sb->s_uuid and /proc/XXX/mountstats, and which has allowed
on-line, mounted changes of the UUID. So as far as I'm concerned,
it's "working as intended". It fixed a real bug where racing
resize2fs and tune2fs -U in separate systemd unit files could result
in superblock checksum failures, and it fixed the that issue.
It doesn't make any changes to how on-line "tune2fs -U <uuid>"
functioned, because the definition of s_uuid wasn't terribly well
defined (and "tune2fs -U" predates it in any case). Originally s_uuid
was just to allow /proc/XXX/mountstats expose the UUID, but at this
point, I don't anyone has a complete understanding of other
assumptions of how overlayfs, IMA, and other userspace utilities have
in terms of the assumption of how file system UUID should be used and
what it denotes.
> However, we have a golden image that every client image is cloned
> from. Say we set a special feature bit in that golden image that
> means "need UUID regeneration". Then on the first mount of the
> cloned image after provisioning, the filesystem sees the bit and
> automatically regenerates the UUID with needing any help from
> userspace at all.
> Problem solved, yes? We don't need userspace to change the uuid on
> first boot of the newly provisioned VM - the filesystem just makes
> it happen.
I agree that's a good design --- and ten years now, from all of the
users using old versions of RHEL have finally migrated off to a
version of some enterprise linux that supports this new feature, the
cloud agents which are using "tune2fs -U <uuid>" or "xfs_admin -U
<uuid>" can stop relying on it and switching to this new scheme.
What we could do is to make it easy to determine whether the kernel
supports the "UUID regeneration" feature, and whether the file system
had its UUID regnerated (because some cloud images generated using an
older distro's installer won't request the UUID renegeration), so that
cloud agents (which are typically installed as a daemon that starts
out of an init.d or systemd unit file) will know whether or not they
need to fallback to the userspace UUID regeneration.
For cloud agents which are installed as a one-shop executable run out
of the initramfs, we might be able to change the UUID before the root
file system is mounted. Of course, there are those userspace setups
where the use of an initramfs is optional or not used at all.
So for the short-term, we're going to be stuck with userspace mediated
UUID changes, and if there are going to be userspace or kernel
subsystems that are going to be surprised when UUID changes out from
under them. So having some kind of documentation which describes how
various subsystems are using the file system UUID, and whether they
are getting it from sb->s_uuid, /proc/XXX/mountstats, or some other
source, that would probably be useful. After all, system
administrators' access to "tune2fs -U" and "xfs_admin -U" isn't going
away, and if we're saying "it's up to them to understand the
implications", it's nice if we document the gotchas. :-)
- Ted
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: uuid ioctl - was: Re: [PATCH] overlayfs: Trigger file re-evaluation by IMA / EVM after writes
2023-06-02 14:58 ` Theodore Ts'o
@ 2023-06-04 22:35 ` Dave Chinner
0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2023-06-04 22:35 UTC (permalink / raw)
To: Theodore Ts'o
Cc: Darrick J. Wong, Christian Brauner, Amir Goldstein, Jeff Layton,
miklos, linux-fsdevel, linux-xfs
On Fri, Jun 02, 2023 at 10:58:16AM -0400, Theodore Ts'o wrote:
> On Fri, Jun 02, 2023 at 04:34:58PM +1000, Dave Chinner wrote:
> > However, we have a golden image that every client image is cloned
> > from. Say we set a special feature bit in that golden image that
> > means "need UUID regeneration". Then on the first mount of the
> > cloned image after provisioning, the filesystem sees the bit and
> > automatically regenerates the UUID with needing any help from
> > userspace at all.
>
> > Problem solved, yes? We don't need userspace to change the uuid on
> > first boot of the newly provisioned VM - the filesystem just makes
> > it happen.
>
> I agree that's a good design --- and ten years now, from all of the
> users using old versions of RHEL have finally migrated off to a
> version of some enterprise linux that supports this new feature, the
> cloud agents which are using "tune2fs -U <uuid>" or "xfs_admin -U
> <uuid>" can stop relying on it and switching to this new scheme.
We're talking about building new infrastructure - regardless
of anything else in this discussion, existing software will always
do what existing software does.
As low level infrastructure designers, we have to think *10 years
ahead* and design for when the feature will be widespread. Designing
infrastructure with "we need a fix right now" in mind almost always
ends with poor results because the focus is "this thing right now"
instead of "how will this work when this gets deployed world-wide by
everyone"....
ext4 developers and the hyperscalers that employ them made a bad
decision due to short-termism. It's only right that the wider
community pushes back against propagating that bad decision into
generic code that everyone will have to live with for the next 20+
years.
We can do better. We *should* be doing better.
> So for the short-term, we're going to be stuck with userspace mediated
> UUID changes, and if there are going to be userspace or kernel
No, "we" aren't stuck with whacky dynamic runtime ext4 UUID changes.
*ext4 developers* and _hyperscalers that have deployed this on ext4_
are stuck with this awful stuff.
Everyone else gets to learn from the mistakes that have been made,
and "we" will end up with a generic solution that is better and will
work on all filesystems that support UUIDs, including ext4.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 15+ messages in thread