Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* [PATCH v4 05/10] man/man2/fsmount.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fsmount.2 | 231 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 231 insertions(+)

diff --git a/man/man2/fsmount.2 b/man/man2/fsmount.2
new file mode 100644
index 0000000000000000000000000000000000000000..c054c04376975c620aec08b76ad5151d8b6ae2ed
--- /dev/null
+++ b/man/man2/fsmount.2
@@ -0,0 +1,231 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fsmount 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fsmount \- instantiate mount object from filesystem context
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <sys/mount.h>
+.P
+.BI "int fsmount(int " fsfd ", unsigned int " flags ", \
+unsigned int " attr_flags );
+.fi
+.SH DESCRIPTION
+The
+.BR fsmount ()
+system call is part of
+the suite of file descriptor based mount facilities in Linux.
+.P
+.BR fsmount ()
+creates a new detached mount object
+for the root of the new filesystem instance
+referenced by the filesystem context file descriptor
+.IR fsfd .
+A new file descriptor
+associated with the detached mount object
+is then returned.
+In order to create a mount object with
+.BR fsmount (),
+the calling process must have the
+.BR \%CAP_SYS_ADMIN
+capability.
+.P
+The filesystem context must have been created with a call to
+.BR fsopen (2)
+and then had a filesystem instance instantiated with a call to
+.BR fsconfig (2)
+with
+.B \%FSCONFIG_CMD_CREATE
+or
+.B \%FSCONFIG_CMD_CREATE_EXCL
+in order to be in the correct state
+for this operation
+(the "awaiting-mount" mode in kernel-developer parlance).
+.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
+Unlike
+.BR open_tree (2)
+with
+.BR \%OPEN_TREE_CLONE,
+.BR fsmount ()
+can only be called once
+in the lifetime of a filesystem context
+to produce a mount object.
+.P
+As with file descriptors returned from
+.BR open_tree (2)
+called with
+.BR OPEN_TREE_CLONE ,
+the returned file descriptor
+can then be used with
+.BR move_mount (2),
+.BR mount_setattr (2),
+or other such system calls to do further mount operations.
+This mount object will be unmounted and destroyed
+when the file descriptor is closed
+if it was not otherwise attached to a mount point
+by calling
+.BR move_mount (2).
+(Note that the unmount operation on
+.BR close (2)
+is lazy\[em]akin to calling
+.BR umount2 (2)
+with
+.BR MOUNT_DETACH ;
+any existing open references to files
+from the mount object
+will continue to work,
+and the mount object will only be completely destroyed
+once it ceases to be busy.)
+The returned file descriptor
+also acts the same as one produced by
+.BR open (2)
+with
+.BR O_PATH ,
+meaning it can also be used as a
+.I dirfd
+argument
+to "*at()" system calls.
+.P
+.I flags
+controls the creation of the returned file descriptor.
+A value for
+.I flags
+is constructed by bitwise ORing
+zero or more of the following constants:
+.RS
+.TP
+.B FSMOUNT_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.RE
+.P
+.I attr_flags
+specifies mount attributes
+which will be applied to the created mount object,
+in the form of
+.BI \%MOUNT_ATTR_ *
+flags.
+The flags are interpreted as though
+.BR mount_setattr (2)
+was called with
+.I attr.attr_set
+set to the same value as
+.IR attr_flags .
+.BI \%MOUNT_ATTR_ *
+flags which would require
+specifying additional fields in
+.BR mount_attr (2type)
+(such as
+.BR \%MOUNT_ATTR_IDMAP )
+are not valid flag values for
+.IR attr_flags .
+.P
+If the
+.BR fsmount ()
+operation is successful,
+the filesystem context
+associated with the file descriptor
+.I fsfd
+is reset
+and placed into reconfiguration mode,
+as if it were just returned by
+.BR fspick (2).
+You may continue to use
+.BR fsconfig (2)
+with the now-reset filesystem context,
+including issuing the
+.B \%FSCONFIG_CMD_RECONFIGURE
+command
+to reconfigure the filesystem instance.
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBUSY
+The filesystem context associated with
+.I fsfd
+is not in the right state
+to be used by
+.BR fsmount ().
+.TP
+.B EINVAL
+.I flags
+had an invalid flag set.
+.TP
+.B EINVAL
+.I attr_flags
+had an invalid
+.BI MOUNT_ATTR_ *
+flag set.
+.TP
+.B EMFILE
+The calling process has too many open files to create more.
+.TP
+.B ENFILE
+The system has too many open files to create more.
+.TP
+.B ENOSPC
+The "anonymous" mount namespace
+necessary to contain the new mount object
+could not be allocated,
+as doing so would exceed
+the configured per-user limit on
+the number of mount namespaces in the current user namespace.
+(See also
+.BR namespaces (7).)
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B EPERM
+The calling process does not have the required
+.B CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit 93766fbd2696c2c4453dd8e1070977e9cd4e6b6d
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH EXAMPLES
+.in +4n
+.EX
+int fsfd, mntfd, tmpfd;
+\&
+fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NODEV | MOUNT_ATTR_NOEXEC);
+\&
+/* Create a new file without attaching the mount object. */
+int tmpfd = openat(mntfd, "tmpfile", O_CREAT | O_EXCL | O_RDWR, 0600);
+unlinkat(mntfd, "tmpfile", 0);
+\&
+/* Attach the mount object to "/tmp". */
+move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.SH SEE ALSO
+.BR fsconfig (2),
+.BR fsopen (2),
+.BR fspick (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)
+

-- 
2.51.0


^ permalink raw reply related

* [PATCH v4 04/10] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fsconfig.2 | 727 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 727 insertions(+)

diff --git a/man/man2/fsconfig.2 b/man/man2/fsconfig.2
new file mode 100644
index 0000000000000000000000000000000000000000..5a18e08c700ac93aa22c341b4134944ee3c38d0b
--- /dev/null
+++ b/man/man2/fsconfig.2
@@ -0,0 +1,727 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fsconfig 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fsconfig \- configure new or existing filesystem context
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <sys/mount.h>
+.P
+.BI "int fsconfig(int " fd ", unsigned int " cmd ,
+.BI "             const char *_Nullable " key ,
+.BI "             const void *_Nullable " value ", int " aux );
+.fi
+.SH DESCRIPTION
+The
+.BR fsconfig ()
+system call is part of
+the suite of file descriptor based mount facilities in Linux.
+.P
+.BR fsconfig ()
+is used to supply parameters to
+and issue commands against
+the filesystem configuration context
+associated with the file descriptor
+.IR fd .
+Filesystem configuration contexts can be created with
+.BR fsopen (2)
+or be instantiated from an extant filesystem instance with
+.BR fspick (2).
+.P
+The
+.I cmd
+argument indicates the command to be issued.
+Some commands supply parameters to the context
+(equivalent to mount options specified with
+.BR mount (8)),
+while others are meta-operations on the filesystem context.
+The list of valid
+.I cmd
+values are:
+.RS
+.TP
+.B FSCONFIG_SET_FLAG
+Set the flag parameter named by
+.IR key .
+.I value
+must be NULL,
+and
+.I aux
+must be 0.
+.TP
+.B FSCONFIG_SET_STRING
+Set the string parameter named by
+.I key
+to the value specified by
+.IR value .
+.I value
+points to a null-terminated string,
+and
+.I aux
+must be 0.
+.TP
+.B FSCONFIG_SET_BINARY
+Set the blob parameter named by
+.I key
+to the contents of the binary blob
+specified by
+.IR value .
+.I value
+points to
+the start of a buffer
+that is
+.I aux
+bytes in length.
+.TP
+.B FSCONFIG_SET_FD
+Set the file parameter named by
+.I key
+to the open file description
+referenced by the file descriptor
+.IR aux .
+.I value
+must be NULL.
+.IP
+You may also use
+.B \%FSCONFIG_SET_STRING
+for file parameters,
+with
+.I value
+set to a null-terminated string
+containing a base-10 representation
+of the file descriptor number.
+This mechanism is primarily intended for compatibility
+with older
+.BR mount (2)-based
+programs,
+and only works for parameters
+that
+.I only
+accept file descriptor arguments.
+.TP
+.B FSCONFIG_SET_PATH
+Set the path parameter named by
+.I key
+to the object at a provided path,
+resolved in a similar manner to
+.BR openat (2).
+.I value
+points to a null-terminated pathname string,
+and
+.I aux
+is equivalent to the
+.I dirfd
+argument to
+.BR openat (2).
+See
+.BR openat (2)
+for an explanation of the need for
+.BR \%FSCONFIG_SET_PATH .
+.IP
+You may also use
+.B \%FSCONFIG_SET_STRING
+for path parameters,
+the behaviour of which is equivalent to
+.B \%FSCONFIG_SET_PATH
+with
+.I aux
+set to
+.BR \%AT_FDCWD .
+.TP
+.B FSCONFIG_SET_PATH_EMPTY
+As with
+.BR \%FSCONFIG_SET_PATH ,
+except that if
+.I value
+is an empty string,
+the file descriptor specified by
+.I aux
+is operated on directly
+and may be any type of file
+(not just a directory).
+This is equivalent to the behaviour of
+.B \%AT_EMPTY_PATH
+with most "*at()" system calls.
+If
+.I aux
+is
+.BR \%AT_FDCWD ,
+the parameter will be set to
+the current working directory
+of the calling process.
+.TP
+.B FSCONFIG_CMD_CREATE
+This command instructs the filesystem driver
+to instantiate an instance of the filesystem in the kernel
+with the parameters specified in the filesystem configuration context.
+.I key
+and
+.I value
+must be NULL,
+and
+.I aux
+must be 0.
+.IP
+This command can only be issued once
+in the lifetime of a filesystem context.
+If the operation succeeds,
+the filesystem context
+associated with file descriptor
+.I fd
+now references the created filesystem instance,
+and is placed into a special "awaiting-mount" mode
+that allows you to use
+.BR fsmount (2)
+to create a mount object from the filesystem instance.
+.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
+If the operation fails,
+in most cases
+the filesystem context is placed in a failed mode
+and cannot be used for any further
+.BR fsconfig ()
+operations
+(though you may still retrieve diagnostic messages
+through the message retrieval interface,
+as described in
+the corresponding subsection of
+.BR fsopen (2)).
+.IP
+This command can only be issued against
+filesystem configuration contexts
+that were created with
+.BR fsopen (2).
+In order to create a filesystem instance,
+the calling process must have the
+.B \%CAP_SYS_ADMIN
+capability.
+.IP
+An important thing to be aware of is that
+the Linux kernel will
+.I silently
+reuse extant filesystem instances
+depending on the filesystem type
+and the configured parameters
+(each filesystem driver has
+its own policy for
+how filesystem instances are reused).
+This means that
+the filesystem instance "created" by
+.B \%FSCONFIG_CMD_CREATE
+may, in fact, be a reference
+to an extant filesystem instance in the kernel.
+(For reference,
+this behaviour also applies to
+.BR mount (2).)
+.IP
+One side-effect of this behaviour is that
+if an extant filesystem instance is reused,
+.I all
+parameters configured
+for this filesystem configuration context
+are
+.I silently ignored
+(with the exception of the
+.I ro
+and
+.I rw
+flag parameters;
+if the state of the read-only flag in the
+extant filesystem instance and the filesystem configuration context
+do not match, this operation will return
+.BR EBUSY ).
+This also means that
+.BR \%FSCONFIG_CMD_RECONFIGURE
+commands issued against
+the "created" filesystem instance
+will also affect any mount objects associated with
+the extant filesystem instance.
+.IP
+Programs that need to ensure
+that they create a new filesystem instance
+with specific parameters
+(notably, security-related parameters
+such as
+.I acl
+to enable POSIX ACLs\[em]as described in
+.BR acl (5))
+should use
+.B \%FSCONFIG_CMD_CREATE_EXCL
+instead.
+.TP
+.BR FSCONFIG_CMD_CREATE_EXCL " (since Linux 6.6)"
+.\" commit 22ed7ecdaefe0cac0c6e6295e83048af60435b13
+.\" commit 84ab1277ce5a90a8d1f377707d662ac43cc0918a
+As with
+.BR \%FSCONFIG_CMD_CREATE ,
+except that the kernel is instructed
+to not reuse extant filesystem instances.
+If the operation
+would be forced to
+reuse an extant filesystem instance,
+this operation will return
+.B EBUSY
+instead.
+.IP
+As a result (unlike
+.BR \%FSCONFIG_CMD_CREATE ),
+if this operation succeeds
+then the calling process can be sure that
+all of the parameters successfully configured with
+.BR fsconfig ()
+will actually be applied
+to the created filesystem instance.
+.TP
+.B FSCONFIG_CMD_RECONFIGURE
+This command instructs the filesystem driver
+to apply the parameters specified in the filesystem configuration context
+to the extant filesystem instance
+referenced by the filesystem configuration context.
+.I key
+and
+.I value
+must be NULL,
+and
+.I aux
+must be 0.
+.IP
+This is primarily intended for use with
+.BR fspick (2),
+but may also be used to modify
+the parameters of a filesystem instance
+after
+.BR \%FSCONFIG_CMD_CREATE
+was used to create it
+and a mount object was created using
+.BR fsmount (2).
+In order to reconfigure an extant filesystem instance,
+the calling process must have the
+.B CAP_SYS_ADMIN
+capability.
+.IP
+If the operation succeeds,
+the filesystem context is reset
+but remains in reconfiguration mode
+and thus can be reused for subsequent
+.B \%FSCONFIG_CMD_RECONFIGURE
+commands.
+If the operation fails,
+in most cases
+the filesystem context is placed in a failed mode
+and cannot be used for any further
+.BR fsconfig ()
+operations
+(though you may still retrieve diagnostic messages
+through the message retrieval interface,
+as described in
+the corresponding subsection of
+.BR fsopen (2)).
+.RE
+.P
+Parameters specified with
+.BI FSCONFIG_SET_ *
+do not take effect
+until a corresponding
+.B \%FSCONFIG_CMD_CREATE
+or
+.B \%FSCONFIG_CMD_RECONFIGURE
+command is issued.
+.SH RETURN VALUE
+On success,
+.BR fsconfig ()
+returns 0.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+If an error occurs, the filesystem driver may provide
+additional information about the error
+through the message retrieval interface for filesystem configuration contexts.
+This additional information can be retrieved at any time by calling
+.BR read (2)
+on the filesystem instance or filesystem configuration context
+referenced by the file descriptor
+.IR fd .
+(See the "Message retrieval interface" subsection in
+.BR fsopen (2)
+for more details on the message format.)
+.P
+Even after an error occurs,
+the filesystem configuration context is
+.I not
+invalidated,
+and thus can still be used with other
+.BR fsconfig ()
+commands.
+This means that users can probe support for filesystem parameters
+on a per-parameter basis,
+and adjust which parameters they wish to set.
+.P
+The error values given below result from
+filesystem type independent errors.
+Each filesystem type may have its own special errors
+and its own special behavior.
+See the Linux kernel source code for details.
+.TP
+.B EACCES
+A component of a path
+provided as a path parameter
+was not searchable.
+(See also
+.BR path_resolution (7).)
+.TP
+.B EACCES
+.B \%FSCONFIG_CMD_CREATE
+was attempted
+for a read-only filesystem
+without specifying the
+.RB ' ro '
+flag parameter.
+.TP
+.B EACCES
+A specified block device parameter
+is located on a filesystem
+mounted with the
+.B \%MS_NODEV
+option.
+.TP
+.B EBADF
+The file descriptor given by
+.I fd
+(or possibly by
+.IR aux ,
+depending on the command)
+is invalid.
+.TP
+.B EBUSY
+The filesystem context associated with
+.I fd
+is in the wrong state
+for the given command.
+.TP
+.B EBUSY
+The filesystem instance cannot be reconfigured as read-only
+with
+.B \%FSCONFIG_CMD_RECONFIGURE
+because some programs
+still hold files open for writing.
+.TP
+.B EBUSY
+A new filesystem instance was requested with
+.B \%FSCONFIG_CMD_CREATE_EXCL
+but a matching superblock already existed.
+.TP
+.B EFAULT
+One of the pointer arguments
+points to a location
+outside the calling process's accessible address space.
+.TP
+.B EINVAL
+.I fd
+does not refer to
+a filesystem configuration context
+or filesystem instance.
+.TP
+.B EINVAL
+One of the values of
+.IR name ,
+.IR value ,
+and/or
+.I aux
+were set to a non-zero value when
+.I cmd
+required that they be zero
+(or NULL).
+.TP
+.B EINVAL
+The parameter named by
+.I name
+cannot be set
+using the type specified with
+.IR cmd .
+.TP
+.B EINVAL
+One of the source parameters
+referred to
+an invalid superblock.
+.TP
+.B ELOOP
+Too many links encountered
+during pathname resolution
+of a path argument.
+.TP
+.B ENAMETOOLONG
+A path argument was longer than
+.BR PATH_MAX .
+.TP
+.B ENOENT
+A path argument had a non-existent component.
+.TP
+.B ENOENT
+A path argument is an empty string,
+but
+.I cmd
+is not
+.BR \%FSCONFIG_SET_PATH_EMPTY .
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B ENOTBLK
+The parameter named by
+.I name
+must be a block device,
+but the provided parameter value was not a block device.
+.TP
+.B ENOTDIR
+A component of the path prefix
+of a path argument
+was not a directory.
+.TP
+.B EOPNOTSUPP
+The command given by
+.I cmd
+is not valid.
+.TP
+.B ENXIO
+The major number
+of a block device parameter
+is out of range.
+.TP
+.B EPERM
+The command given by
+.I cmd
+was
+.BR \%FSCONFIG_CMD_CREATE ,
+.BR \%FSCONFIG_CMD_CREATE_EXCL ,
+or
+.BR \%FSCONFIG_CMD_RECONFIGURE ,
+but the calling process does not have the required
+.B \%CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit ecdab150fddb42fe6a739335257949220033b782
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH NOTES
+.SS Generic filesystem parameters
+Each filesystem driver is responsible for
+parsing most parameters specified with
+.BR fsconfig (),
+meaning that individual filesystems
+may have very different behaviour
+when encountering parameters with the same name.
+In general,
+you should not assume that the behaviour of
+.BR fsconfig ()
+when specifying a parameter to one filesystem type
+will match the behaviour of the same parameter
+with a different filesystem type.
+.P
+However,
+the following generic parameters
+apply to all filesystems and have unified behaviour.
+They are set using the listed
+.BI \%FSCONFIG_SET_ *
+command.
+.TP
+\fIro\fP and \fIrw\fP (\fB\%FSCONFIG_SET_FLAG\fP)
+Configure whether the filesystem instance is read-only.
+.TP
+\fIdirsync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
+Make directory changes on this filesystem instance synchronous.
+.TP
+\fIsync\fP and \fIasync\fP (\fB\%FSCONFIG_SET_FLAG\fP)
+Configure whether writes on this filesystem instance
+will be made synchronous
+(as though the
+.B O_SYNC
+flag to
+.BR open (2)
+was specified for
+all file opens in this filesystem instance).
+.TP
+\fIlazytime\fP and \fInolazytime\fP (\fB\%FSCONFIG_SET_FLAG\fP)
+Configure whether to reduce on-disk updates
+of inode timestamps on this filesystem instance
+(as described in the
+.B \%MS_LAZYTIME
+section of
+.BR mount (2)).
+.TP
+\fImand\fP and \fInomand\fP (\fB\%FSCONFIG_SET_FLAG\fP)
+Configure whether the filesystem instance should permit mandatory locking.
+Since Linux 5.15,
+.\" commit f7e33bdbd6d1bdf9c3df8bba5abcf3399f957ac3
+mandatory locking has been deprecated
+and setting this flag is a no-op.
+.TP
+\fIsource\fP (\fB\%FSCONFIG_SET_STRING\fP)
+This parameter is equivalent to the
+.I source
+parameter passed to
+.BR mount (2)
+for the same filesystem type,
+and is usually the pathname of a block device
+containing the filesystem.
+This parameter may only be set once
+per filesystem configuration context transaction.
+.P
+In addition,
+any filesystem parameters associated with
+Linux Security Modules (LSMs)
+are also generic with respect to the underlying filesystem.
+See the documentation for the LSM you wish to configure for more details.
+.SH CAVEATS
+.SS Filesystem parameter types
+As a result of
+each filesystem driver being responsible for
+parsing most parameters specified with
+.BR fsconfig (),
+some filesystem drivers
+may have unintuitive behaviour
+with regards to which
+.BI \%FSCONFIG_SET_ *
+commands are permitted
+to configure a given parameter.
+.P
+In order for
+filesystem parameters to be backwards compatible with
+.BR mount (2),
+they must be parseable as strings;
+this almost universally means that
+.B \%FSCONFIG_SET_STRING
+can also be used to configure them.
+.\" Aleksa Sarai
+.\"   Theoretically, a filesystem could check fc->oldapi and refuse
+.\"   FSCONFIG_SET_STRING if the operation is coming from the new API, but no
+.\"   filesystems do this (and probably never will).
+However, other
+.BI \%FSCONFIG_SET_ *
+commands need to be opted into
+by each filesystem driver's parameter parser.
+.P
+One of the most user-visible instances of
+this inconsistency is that
+many filesystems do not support
+configuring path parameters with
+.B \%FSCONFIG_SET_PATH
+(despite the name),
+which can lead to somewhat confusing
+.B EINVAL
+errors.
+(For example, the generic
+.I source
+parameter\[em]which is usually a path\[em]can only be configured
+with
+.BR \%FSCONFIG_SET_STRING .)
+.P
+When writing programs that use
+.BR fsconfig ()
+to configure parameters
+with commands other than
+.BR \%FSCONFIG_SET_STRING ,
+users should verify
+that the
+.BI \%FSCONFIG_SET_ *
+commands used to configure each parameter
+are supported by the corresponding filesystem driver.
+.\" Aleksa Sarai
+.\"   While this (quite confusing) inconsistency in behaviour is true today
+.\"   (and has been true since this was merged), this appears to mostly be an
+.\"   unintended consequence of filesystem drivers hand-coding fsparam parsing.
+.\"   Path parameters are the most eggregious causes of confusion. Hopefully we
+.\"   can make this no longer the case in a future kernel.
+.SH EXAMPLES
+To illustrate the different kinds of flags that can be configured with
+.BR fsconfig (),
+here are a few examples of some different filesystems being created:
+.P
+.in +4n
+.EX
+int fsfd, mntfd;
+\&
+fsfd = fsopen("tmpfs", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "inode64", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "uid", "1234", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "huge", "never", 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "casefold", NULL, 0);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOEXEC);
+move_mount(mntfd, "", AT_FDCWD, "/tmp", MOVE_MOUNT_F_EMPTY_PATH);
+\&
+fsfd = fsopen("erofs", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "/dev/loop0", 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE_EXCL, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_NOSUID);
+move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.P
+Usually,
+specifying the same parameter named by
+.I key
+multiple times with
+.BR fsconfig ()
+causes the parameter value to be replaced.
+However, some filesystems may have unique behaviour:
+.P
+.in +4n
+.EX
+\&
+int fsfd, mntfd;
+int lowerdirfd = open("/o/ctr/lower1", O_DIRECTORY | O_CLOEXEC);
+\&
+fsfd = fsopen("overlay", FSOPEN_CLOEXEC);
+/* "lowerdir+" appends to the lower dir stack each time. */
+fsconfig(fsfd, FSCONFIG_SET_FD, "lowerdir+", NULL, lowerdirfd);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower2", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower3", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "lowerdir+", "/o/ctr/lower4", 0);
+.\" fsconfig(fsfd, FSCONFIG_SET_PATH, "lowerdir+", "/o/ctr/lower5", AT_FDCWD);
+.\" fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "lowerdir+", "", lowerdirfd);
+.\" Aleksa Sarai: Hopefully these will also be supported in the future.
+fsconfig(fsfd, FSCONFIG_SET_STRING, "xino", "auto", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "nfs_export", "off", 0);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
+move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.P
+And here is an example of how
+.BR fspick (2)
+can be used with
+.BR fsconfig ()
+to reconfigure the parameters
+of an extant filesystem instance
+attached to
+.IR /proc :
+.P
+.in +4n
+.EX
+int fsfd = fspick(AT_FDCWD, "/proc", FSPICK_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "hidepid", "ptraceable", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "subset", "pid", 0);
+fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
+.EE
+.in
+.SH SEE ALSO
+.BR fsmount (2),
+.BR fsopen (2),
+.BR fspick (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)
+

-- 
2.51.0


^ permalink raw reply related

* [PATCH v4 03/10] man/man2/fspick.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fspick.2 | 342 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 342 insertions(+)

diff --git a/man/man2/fspick.2 b/man/man2/fspick.2
new file mode 100644
index 0000000000000000000000000000000000000000..1f87293f44658adeb7ab7cffebcac3174888f040
--- /dev/null
+++ b/man/man2/fspick.2
@@ -0,0 +1,342 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fspick 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fspick \- select filesystem for reconfiguration
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.BR "#include <fcntl.h>" "          /* Definition of " AT_* " constants */"
+.B #include <sys/mount.h>
+.P
+.BI "int fspick(int " dirfd ", const char *" path ", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR fspick ()
+system call is part of
+the suite of file descriptor based mount facilities in Linux.
+.P
+.BR fspick()
+creates a new filesystem configuration context
+for the extant filesystem instance
+associated with the path described by
+.IR dirfd
+and
+.IR path ,
+places it into reconfiguration mode
+(similar to
+.BR mount (8)
+with the
+.I -o remount
+option).
+A new file descriptor
+associated with the filesystem configuration context
+is then returned.
+The calling process must have the
+.BR CAP_SYS_ADMIN
+capability in order to create a new filesystem configuration context.
+.P
+The resultant file descriptor can be used with
+.BR fsconfig (2)
+to specify the desired set of changes to
+filesystem parameters of the filesystem instance.
+Once the desired set of changes have been configured,
+the changes can be effectuated by calling
+.BR fsconfig (2)
+with the
+.B \%FSCONFIG_CMD_RECONFIGURE
+command.
+Please note that\[em]in contrast to
+the behaviour of
+.B MS_REMOUNT
+with
+.BR mount (2)\[em] fspick ()
+instantiates the filesystem configuration context
+with a copy of
+the extant filesystem's filesystem parameters,
+meaning that a subsequent
+.B \%FSCONFIG_CMD_RECONFIGURE
+operation
+will only update filesystem parameters
+explicitly modified with
+.BR fsconfig (2).
+.P
+As with "*at()" system calls,
+.BR fspick ()
+uses the
+.I dirfd
+argument in conjunction with the
+.I path
+argument to determine the path to operate on, as follows:
+.IP \[bu] 3
+If the pathname given in
+.I path
+is absolute, then
+.I dirfd
+is ignored.
+.IP \[bu]
+If the pathname given in
+.I path
+is relative and
+.I dirfd
+is the special value
+.BR \%AT_FDCWD ,
+then
+.I path
+is interpreted relative to
+the current working directory
+of the calling process (like
+.BR open (2)).
+.IP \[bu]
+If the pathname given in
+.I path
+is relative,
+then it is interpreted relative to
+the directory referred to by the file descriptor
+.I dirfd
+(rather than relative to
+the current working directory
+of the calling process,
+as is done by
+.BR open (2)
+for a relative pathname).
+In this case,
+.I dirfd
+must be a directory
+that was opened for reading
+.RB ( O_RDONLY )
+or using the
+.B O_PATH
+flag.
+.IP \[bu]
+If
+.I path
+is an empty string,
+and
+.I flags
+contains
+.BR \%FSPICK_EMPTY_PATH ,
+then the file descriptor
+.I dirfd
+is operated on directly.
+In this case,
+.I dirfd
+may refer to any type of file,
+not just a directory.
+.P
+See
+.BR openat (2)
+for an explanation of why the
+.I dirfd
+argument is useful.
+.P
+.I flags
+can be used to control aspects of how
+.I path
+is resolved and
+properties of the returned file descriptor.
+A value for
+.I flags
+is constructed by bitwise ORing
+zero or more of the following constants:
+.RS
+.TP
+.B FSPICK_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.TP
+.B FSPICK_EMPTY_PATH
+If
+.I path
+is an empty string,
+operate on the file referred to by
+.I dirfd
+(which may have been obtained from
+.BR open (2),
+.BR fsmount (2),
+or
+.BR open_tree (2)).
+In this case,
+.I dirfd
+may refer to any type of file,
+not just a directory.
+If
+.I dirfd
+is
+.BR \%AT_FDCWD ,
+.BR fspick ()
+will operate on the current working directory
+of the calling process.
+.TP
+.B FSPICK_SYMLINK_NOFOLLOW
+Do not follow symbolic links
+in the terminal component of
+.IR path .
+If
+.I path
+references a symbolic link,
+the returned filesystem context will reference
+the filesystem that the symbolic link itself resides on.
+.TP
+.B FSPICK_NO_AUTOMOUNT
+Do not automount any automount points encountered
+while resolving
+.IR path .
+This allows you to reconfigure an automount point,
+rather than the location that would be mounted.
+This flag has no effect if
+the automount point has already been mounted over.
+.RE
+.P
+As with filesystem contexts created with
+.BR fsopen (2),
+the file descriptor returned by
+.BR fspick ()
+may be queried for message strings at any time by calling
+.BR read (2)
+on the file descriptor.
+(See the "Message retrieval interface" subsection in
+.BR fsopen (2)
+for more details on the message format.)
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EACCES
+Search permission is denied
+for one of the directories
+in the path prefix of
+.IR path .
+(See also
+.BR path_resolution (7).)
+.TP
+.B EBADF
+.I path
+is relative but
+.I dirfd
+is neither
+.B \%AT_FDCWD
+nor a valid file descriptor.
+.TP
+.B EFAULT
+.I path
+is NULL
+or a pointer to a location
+outside the calling process's accessible address space.
+.TP
+.B EINVAL
+Invalid flag specified in
+.IR flags .
+.TP
+.B ELOOP
+Too many symbolic links encountered when resolving
+.IR path .
+.TP
+.B EMFILE
+The calling process has too many open files to create more.
+.TP
+.B ENAMETOOLONG
+.I path
+is longer than
+.BR PATH_MAX .
+.TP
+.B ENFILE
+The system has too many open files to create more.
+.TP
+.B ENOENT
+A component of
+.I path
+does not exist,
+or is a dangling symbolic link.
+.TP
+.B ENOENT
+.I path
+is an empty string, but
+.B \%FSPICK_EMPTY_PATH
+is not specified in
+.IR flags .
+.TP
+.B ENOTDIR
+A component of the path prefix of
+.I path
+is not a directory;
+or
+.I path
+is relative and
+.I dirfd
+is a file descriptor referring to a file other than a directory.
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B EPERM
+The calling process does not have the required
+.B \%CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit cf3cba4a429be43e5527a3f78859b1bfd9ebc5fb
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH EXAMPLES
+The following example sets the read-only flag
+on the filesystem instance referenced by
+the mount object attached at
+.IR /tmp .
+.P
+.in +4n
+.EX
+int fsfd = fspick(AT_FDCWD, "/tmp", FSPICK_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
+fsconfig(fsfd, FSCONFIG_CMD_RECONFIGURE, NULL, NULL, 0);
+.EE
+.in
+.P
+The above procedure is roughly equivalent to
+the following mount operation using
+.BR mount (2):
+.P
+.in +4n
+.EX
+mount(NULL, "/tmp", NULL, MS_REMOUNT | MS_RDONLY, NULL);
+.EE
+.in
+.P
+With the notable caveat that
+in this example,
+.BR mount (2)
+will clear all other filesystem parameters
+(such as
+.B MS_NOSUID
+or
+.BR MS_NOEXEC );
+.BR fsconfig (2)
+will only modify the
+.I ro
+parameter.
+.SH SEE ALSO
+.BR fsconfig (2),
+.BR fsmount (2),
+.BR fsopen (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)
+

-- 
2.51.0


^ permalink raw reply related

* [PATCH v4 02/10] man/man2/fsopen.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fsopen.2 | 384 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 384 insertions(+)

diff --git a/man/man2/fsopen.2 b/man/man2/fsopen.2
new file mode 100644
index 0000000000000000000000000000000000000000..7cdbeac7d64b7e5c969dee619a039ec947d1e981
--- /dev/null
+++ b/man/man2/fsopen.2
@@ -0,0 +1,384 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fsopen 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fsopen \- create a new filesystem context
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <sys/mount.h>
+.P
+.BI "int fsopen(const char *" fsname ", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR fsopen ()
+system call is part of
+the suite of file descriptor based mount facilities in Linux.
+.P
+.BR fsopen ()
+creates a blank filesystem configuration context within the kernel
+for the filesystem named by
+.I fsname
+and places it into creation mode.
+A new file descriptor
+associated with the filesystem configuration context
+is then returned.
+The calling process must have the
+.B \%CAP_SYS_ADMIN
+capability in order to create a new filesystem configuration context.
+.P
+A filesystem configuration context is
+an in-kernel representation of a pending transaction,
+containing a set of configuration parameters that are to be applied
+when creating a new instance of a filesystem
+(or modifying the configuration of an existing filesystem instance,
+such as when using
+.BR fspick (2)).
+.P
+After obtaining a filesystem configuration context with
+.BR fsopen (),
+the general workflow for operating on the context looks like the following:
+.IP (1) 5
+Pass the filesystem context file descriptor to
+.BR fsconfig (2)
+to specify any desired filesystem parameters.
+This may be done as many times as necessary.
+.IP (2)
+Pass the same filesystem context file descriptor to
+.BR fsconfig (2)
+with
+.B \%FSCONFIG_CMD_CREATE
+to create an instance of the configured filesystem.
+.IP (3)
+Pass the same filesystem context file descriptor to
+.BR fsmount (2)
+to create a new detached mount object for
+the root of the filesystem instance,
+which is then attached to a new file descriptor.
+(This also places the filesystem context file descriptor into
+reconfiguration mode,
+similar to the mode produced by
+.BR fspick (2).)
+Once a mount object has been created with
+.BR fsmount (2),
+the filesystem context file descriptor can be safely closed.
+.IP (4)
+Now that a mount object has been created,
+you may
+.RS
+.IP (4.1) 7
+use the detached mount object file descriptor as a
+.I dirfd
+argument to "*at()" system calls; and/or
+.IP (4.2) 7
+attach the mount object to a mount point
+by passing the mount object file descriptor to
+.BR move_mount (2).
+This will also prevent the mount object from
+being unmounted and destroyed when
+the mount object file descriptor is closed.
+.RE
+.IP
+The mount object file descriptor will
+remain associated with the mount object
+even after doing the above operations,
+so you may repeatedly use the mount object file descriptor with
+.BR move_mount (2)
+and/or "*at()" system calls
+as many times as necessary.
+.P
+A filesystem context will move between different modes
+throughout its lifecycle
+(such as the creation phase
+when created with
+.BR fsopen (),
+the reconfiguration phase
+when an existing filesystem instance is selected with
+.BR fspick (2),
+and the intermediate "awaiting-mount" phase
+.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
+between
+.BR \%FSCONFIG_CMD_CREATE
+and
+.BR fsmount (2)),
+which has an impact on
+what operations are permitted on the filesystem context.
+.P
+The file descriptor returned by
+.BR fsopen ()
+also acts as a channel for filesystem drivers to
+provide more comprehensive diagnostic information
+than is normally provided through the standard
+.BR errno (3)
+interface for system calls.
+If an error occurs at any time during the workflow mentioned above,
+calling
+.BR read (2)
+on the filesystem context file descriptor
+will retrieve any ancillary information about the encountered errors.
+(See the "Message retrieval interface" section
+for more details on the message format.)
+.P
+.I flags
+can be used to control aspects of
+the creation of the filesystem configuration context file descriptor.
+A value for
+.I flags
+is constructed by bitwise ORing
+zero or more of the following constants:
+.RS
+.TP
+.B FSOPEN_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.RE
+.P
+A list of filesystems supported by the running kernel
+(and thus a list of valid values for
+.IR fsname )
+can be obtained from
+.IR /proc/filesystems .
+(See also
+.BR proc_filesystems (5).)
+.SS Message retrieval interface
+When doing operations on a filesystem configuration context,
+the filesystem driver may choose to provide
+ancillary information to userspace
+in the form of message strings.
+.P
+The filesystem context file descriptors returned by
+.BR fsopen ()
+and
+.BR fspick (2)
+may be queried for message strings at any time by calling
+.BR read (2)
+on the file descriptor.
+Each call to
+.BR read (2)
+will return a single message,
+prefixed to indicate its class:
+.RS
+.TP
+\fBe\fP <\fImessage\fP>
+An error message was logged.
+This is usually associated with an error being returned
+from the corresponding system call which triggered this message.
+.TP
+\fBw\fP <\fImessage\fP>
+A warning message was logged.
+.TP
+\fBi\fP <\fImessage\fP>
+An informational message was logged.
+.RE
+.P
+Messages are removed from the queue as they are read.
+Note that the message queue has limited depth,
+so it is possible for messages to get lost.
+If there are no messages in the message queue,
+.B read(2)
+will return \-1 and
+.I errno
+will be set to
+.BR \%ENODATA .
+If the
+.I buf
+argument to
+.BR read (2)
+is not large enough to contain the entire message,
+.BR read (2)
+will return \-1 and
+.I errno
+will be set to
+.BR \%EMSGSIZE .
+(See BUGS.)
+.P
+If there are multiple filesystem contexts
+referencing the same filesystem instance
+(such as if you call
+.BR fspick (2)
+multiple times for the same mount),
+each one gets its own independent message queue.
+This does not apply to multiple file descriptors that are
+tied to the same underlying open file description
+(such as those created with
+.BR dup (2)).
+.P
+Message strings will usually be prefixed by
+the name of the filesystem or kernel subsystem
+that logged the message,
+though this may not always be the case.
+See the Linux kernel source code for details.
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EFAULT
+.I fsname
+is NULL
+or a pointer to a location
+outside the calling process's accessible address space.
+.TP
+.B EINVAL
+.I flags
+had an invalid flag set.
+.TP
+.B EMFILE
+The calling process has too many open files to create more.
+.TP
+.B ENFILE
+The system has too many open files to create more.
+.TP
+.B ENODEV
+The filesystem named by
+.I fsname
+is not supported by the kernel.
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B EPERM
+The calling process does not have the required
+.B \%CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit 24dcb3d90a1f67fe08c68a004af37df059d74005
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH BUGS
+.SS Message retrieval interface and \fB\%EMSGSIZE\fP
+As described in the "Message retrieval interface" subsection above,
+calling
+.BR read (2)
+with too small a buffer to contain
+the next pending message in the message queue
+for the filesystem configuration context
+will cause
+.BR read (2)
+to return \-1 and set
+.BR errno (3)
+to
+.BR \%EMSGSIZE .
+.P
+However,
+this failed operation still
+consumes the message from the message queue.
+This effectively discards the message silently,
+as no data is copied into the
+.BR read (2)
+buffer.
+.P
+Programs should take care to ensure that
+their buffers are sufficiently large
+to contain any reasonable message string,
+in order to avoid silently losing valuable diagnostic information.
+.\" Aleksa Sarai
+.\"   This unfortunate behaviour has existed since this feature was merged, but
+.\"   I have sent a patchset which will finally fix it.
+.\"   <https://lore.kernel.org/r/20250807-fscontext-log-cleanups-v3-1-8d91d6242dc3@cyphar.com/>
+.SH EXAMPLES
+To illustrate the workflow for creating a new mount,
+the following is an example of how to mount an
+.BR ext4 (5)
+filesystem stored on
+.I /dev/sdb1
+onto
+.IR /mnt .
+.P
+.in +4n
+.EX
+int fsfd, mntfd;
+\&
+fsfd = fsopen("ext4", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_PATH, "source", "/dev/sdb1", AT_FDCWD);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "iversion", NULL, 0)
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_RELATIME);
+move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.P
+First,
+an ext4 configuration context is created and attached to the file descriptor
+.IR fsfd .
+Then, a series of parameters
+(such as the source of the filesystem)
+are provided using
+.BR fsconfig (2),
+followed by the filesystem instance being created with
+.BR \%FSCONFIG_CMD_CREATE .
+.BR fsmount (2)
+is then used to create a new mount object attached to the file descriptor
+.IR mntfd ,
+which is then attached to the intended mount point using
+.BR move_mount (2).
+.P
+The above procedure is functionally equivalent to
+the following mount operation using
+.BR mount (2):
+.P
+.in +4n
+.EX
+mount("/dev/sdb1", "/mnt", "ext4", MS_RELATIME,
+      "ro,noatime,acl,user_xattr,iversion");
+.EE
+.in
+.P
+And here's an example of creating a mount object
+of an NFS server share
+and setting a Smack security module label.
+However, instead of attaching it to a mount point,
+the program uses the mount object directly
+to open a file from the NFS share.
+.P
+.in +4n
+.EX
+int fsfd, mntfd, fd;
+\&
+fsfd = fsopen("nfs", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "example.com/pub/linux", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "nfsvers", "3", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "rsize", "65536", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "wsize", "65536", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "smackfsdef", "foolabel", 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "rdma", NULL, 0);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, 0, MOUNT_ATTR_NODEV);
+fd = openat(mntfd, "src/linux-5.2.tar.xz", O_RDONLY);
+.EE
+.in
+.P
+Unlike the previous example,
+this operation has no trivial equivalent with
+.BR mount (2),
+as it was not previously possible to create a mount object
+that is not attached to any mount point.
+.SH SEE ALSO
+.BR fsconfig (2),
+.BR fsmount (2),
+.BR fspick (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)

-- 
2.51.0


^ permalink raw reply related

* [PATCH v4 01/10] man/man2/mount_setattr.2: move mount_attr struct to mount_attr(2type)
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

As with open_how(2type), it makes sense to move this to a separate man
page.  In addition, future man pages added in this patchset will want to
reference mount_attr(2type).

Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/mount_setattr.2      | 17 ++++--------
 man/man2type/mount_attr.2type | 61 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 66 insertions(+), 12 deletions(-)

diff --git a/man/man2/mount_setattr.2 b/man/man2/mount_setattr.2
index 586633f48e894bf8f2823aa7755c96adcddea6a6..4b55f6d2e09d00d9bc4b3a085f310b1b459f34e8 100644
--- a/man/man2/mount_setattr.2
+++ b/man/man2/mount_setattr.2
@@ -114,18 +114,11 @@ .SH DESCRIPTION
 .I attr
 argument of
 .BR mount_setattr ()
-is a structure of the following form:
-.P
-.in +4n
-.EX
-struct mount_attr {
-    __u64 attr_set;     /* Mount properties to set */
-    __u64 attr_clr;     /* Mount properties to clear */
-    __u64 propagation;  /* Mount propagation type */
-    __u64 userns_fd;    /* User namespace file descriptor */
-};
-.EE
-.in
+is a pointer to a
+.I mount_attr
+structure,
+described in
+.BR mount_attr (2type).
 .P
 The
 .I attr_set
diff --git a/man/man2type/mount_attr.2type b/man/man2type/mount_attr.2type
new file mode 100644
index 0000000000000000000000000000000000000000..f5c4f48be46ec1e6c0d3a211b6724a1e95311a41
--- /dev/null
+++ b/man/man2type/mount_attr.2type
@@ -0,0 +1,61 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH mount_attr 2type (date) "Linux man-pages (unreleased)"
+.SH NAME
+mount_attr \- what mount properties to set and clear
+.SH LIBRARY
+Linux kernel headers
+.SH SYNOPSIS
+.EX
+.B #include <sys/mount.h>
+.P
+.B struct mount_attr {
+.BR "    u64 attr_set;" "     /* Mount properties to set */"
+.BR "    u64 attr_clr;" "     /* Mount properties to clear */"
+.BR "    u64 propagation;" "  /* Mount propagation type */"
+.BR "    u64 userns_fd;" "    /* User namespace file descriptor */"
+    /* ... */
+.B };
+.EE
+.SH DESCRIPTION
+Specifies which mount properties should be changed with
+.BR mount_setattr (2).
+.P
+The fields are as follows:
+.TP
+.I .attr_set
+This field specifies which
+.BI MOUNT_ATTR_ *
+attribute flags to set.
+.TP
+.I .attr_clr
+This field specifies which
+.BI MOUNT_ATTR_ *
+attribute flags to clear.
+.TP
+.I .propagation
+This field specifies what mount propagation will be applied.
+The valid values of this field are the same propagation types described in
+.BR mount_namespaces (7).
+.TP
+.I .userns_fd
+This field specifies a file descriptor that indicates which user namespace to
+use as a reference for ID-mapped mounts with
+.BR MOUNT_ATTR_IDMAP .
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.12.
+.\" commit 2a1867219c7b27f928e2545782b86daaf9ad50bd
+glibc 2.36.
+.P
+Extra fields may be appended to the structure,
+with a zero value in a new field resulting in
+the kernel behaving as though that extension field was not present.
+Therefore, a user
+.I must
+zero-fill this structure on initialization.
+.SH SEE ALSO
+.BR mount_setattr (2)

-- 
2.51.0


^ permalink raw reply related

* [PATCH v4 00/10] man2: document "new" mount API
From: Aleksa Sarai @ 2025-09-19  1:59 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai

Back in 2019, the new mount API was merged[1]. David Howells then set
about writing man pages for these new APIs, and sent some patches back
in 2020[2].

Unfortunately, these patches were never merged, which meant that these
APIs were practically undocumented for many years -- arguably this has
been a contributing factor to the relatively slow adoption of these new
(far better) APIs. For instance, I have often discovered that many folks
are unaware of the read(2)-based message retrieval interface provided by
filesystem context file descriptors.

In 2024, Christian Brauner adapted David Howell's original man pages
into the easier-to-edit Markdown format and published them on GitHub[3].
These have been maintained since, including updated information on new
features added since David Howells's 2020 draft pages (such as
MOVE_MOUNT_BENEATH).

While this was a welcome improvement to the previous status quo (that
had lasted over 6 years), speaking personally my experience is that not
having access to these man pages from the terminal has been a fairly
common painpoint.

So, this is a modern version of the man pages for these APIs, in the
hopes that we can finally (6 years later) get proper documentation for
these APIs in the man-pages project.

One important thing to note is that most of these were re-written by me,
with very minimal copying from the versions available from Christian[2].
The reasons for this are two-fold:

 * Both Howells's original version and Christian's maintained versions
   contain crucial mistakes that I have been bitten by in the past (the
   most obvious being that all of these APIs were merged in Linux 5.2,
   but the man pages all claim they were merged in different versions.)

 * As the man pages appear to have been written from Howells's
   perspective while implementing them, some of the wording is a little
   too tied to the implementation (or appears to describe features that
   don't really exist in the merged versions of these APIs).

 * The original versions of the man-pages lacked bigger-picture
   explanations of the reasoning behind the API, which would make it
   easier for readers to understand what operations are doing.

I decided that the best way to resolve these issues is to rewrite them
from the perspective of an actual user of these APIs (me), and check
that we do not repeat the mistakes I found in the originals. I have also
done my best to resolve the issues raised by Michael Kerrisk on the
original patchset sent by Howells[1].

In addition, I have also included a man page for open_tree_attr(2) (as a
subsection of the new open_tree(2) man page), which was merged in Linux
6.15.

[1]: https://lore.kernel.org/all/20190507204921.GL23075@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/linux-man/159680892602.29015.6551860260436544999.stgit@warthog.procyon.org.uk/
[3]: https://github.com/brauner/man-pages-md

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v4:
- `sed -i s|\\% |\\%|g`.
- Remove unneeded quotes in SYNOPSIS. [Alejandro Colomar]
- open_tree(2): fix leftover confusing usages of "attach" when referring
  to file descriptors being associated with mount objects.
- open_tree(2): rename "Anonymous mount namespaces" NOTES subsection to
  the far more informative "Mount propagation" and clean up the wording
  a little.
- open_tree_attr(2): add a code comment about
  <https://lore.kernel.org/all/20250808-open_tree_attr-bugfix-idmap-v1-0-0ec7bc05646c@cyphar.com/>
- {fsconfig,open_tree_attr}(2): use _Nullable.
- {fsmount,open_tree}(2): mention the the unmount-on-close behaviour is
  actually lazy (a-la MNT_DETACH).
- {fsconfig,mount_setattr}(2): improve "mount attributes and filesystem
  parameters" wording to make it clearer that superblock and mount flags
  are sibling properties, not the same thing.
- open_tree(2): mention that any mount propagation events while the mount
  object is detached are completely lost -- i.e., they don't get replayed once
  you attach the mount somewhere.
- fsconfig(2): fix minor grammatical / missing joining word issues.
- fsconfig(2): fix final leftover `.IR A " and " B` cases.
- fsconfig(2): explain that failed fsconfig(FSCONFIG_CMD_*) operations render
  the filesystem context invalid.
- fsconfig(2): rework the description of superblock reuse, as the previous text
  was very wrong. (Though there has been discussion about changing this
  behaviour...)
- fsconfig(2): remove misleading wording in FSCONFIG_CMD_CREATE_EXCL about how
  we are requesting a new filesystem instance -- in theory filesystems could
  take this request into account but in practice none do (and it seems unlikely
  any ever will).
- fsconfig(2): mention that key, value, and aux must be 0 or NULL for
  FSCONFIG_CMD_RECONF.
- fsmount(2): fix usage of "filesystem instance" in relation to fsmount() and
  open_tree() comparison. [Askar Safin]
- move_mount(2): "as attached" -> "as a detached" [Askar Safin]
- fspick(2): add note about filesystem parameter list being copied rather than
  reset with FSCONFIG_CMD_RECONFIGURE. [Askar Safin]
- v3: <https://lore.kernel.org/r/20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

Changes in v3:
- `sed -i s|Co-developed-by|Co-authored-by|g`. [Alejandro Colomar]
  - Add Signed-off-by for co-authors. [Christian Brauner]
- `sed -i s|needs-mount|awaiting-mount|g`, to match the kernel parlance.
- Fix VERSIONS/HISTORY mixup in mount_attr(2type) that was copied from
  open_how(2type). [Alejandro Colomar]
- Fix incorrect .BR usage in SYNOPSIS.
- Some more semantic newlines fixes. [Alejandro Colomar]
- Minor fixes suggested by Alejandro. [Alejandro Colomar]
- open_tree_attr(2): heavily reword everything to be better formatted
  and more explicit about its behaviour.
- open_tree(2): write proper explanatory paragraphs for the EXAMPLES.
- mount_setattr(2): fix stray doublequote in SYNOPSIS. [Askar Safin]
- fsopen(2): rework structure of the DESCRIPTION introduction.
- fsopen(2): explicitly say that read(2) errors in the message retrieval
  interface are actual errors, not return 0. [Askar Safin]
- fsopen(2): add BUGS section to describe the unfortunate -ENODATA
  message dropping behaviour that should be fixed by
  <https://lore.kernel.org/r/20250807-fscontext-log-cleanups-v3-0-8d91d6242dc3@cyphar.com/>.
- fsconfig(2): add a NOTES subsection about generic filesystem
  parameters.
- fsconfig(2): add comment about the weirdness surrounding
  FSCONFIG_SET_PATH.
- {fspick,open_tree}(2): Correct AT_NO_AUTOMOUNT description (copied
  from David, who probably copied it from statx(2)) -- AT_NO_AUTOMOUNT
  applies to all path components, not just the final one. [Christian
  Brauner]
- statx(2): fix AT_NO_AUTOMOUNT documentation.
- open_tree(2): swap open(2) reference for openat(2) when saying that
  the result is identical. [Askar Safin]
- fsmount(2): fix DESCRIPTION introduction, and rework attr_flags
  description to better reference mount_setattr(2).
- {fsopen,fspick,fsmount,open_tree}(2): don't use "attach" when talking
  about the file descriptors we return that reference in-kernel objects,
  to avoid confusing readers with mount object attachment status.
- fsconfig(2): remove pidns argument example, as it was kind of unclear
  and referenced kernel features not yet merged.
- fsconfig(2): remove rambling FSCONFIG_SET_PATH_EMPTY text (which
  mostly describes an academic issue that doesn't apply to any existing
  filesystem), and instead add a CAVEATS section which touches on the
  weird type behaviour of fsconfig(2).
- v2: <https://lore.kernel.org/r/20250807-new-mount-api-v2-0-558a27b8068c@cyphar.com>

Changes in v2:
- `make -R lint-man`. [Alejandro Colomar]
- `sed -i s|Glibc|glibc|g`. [Alejandro Colomar]
- `sed -i s|pathname|path|g` [Alejandro Colomar]
- Clean up macro usage, example code, and synopsis. [Alejandro Colomar]
- Try to use semantic newlines. [Alejandro Colomar]
- Make sure the usage of "filesystem context", "filesystem instance",
  and "mount object" are consistent. [Askar Safin]
- Avoid referring to these syscalls without an "at" suffix as "*at()
  syscalls". [Askar Safin]
- Use \% to avoid hyphenation of constants. [Askar Safin, G. Branden Robinson]
- Add a new subsection to mount_setattr(2) to describe the distinction
  between mount attributes and filesystem parameters.
- (Under protest) double-space-after-period formatted commit messages.
- v1: <https://lore.kernel.org/r/20250806-new-mount-api-v1-0-8678f56c6ee0@cyphar.com>

---
Aleksa Sarai (10):
      man/man2/mount_setattr.2: move mount_attr struct to mount_attr(2type)
      man/man2/fsopen.2: document "new" mount API
      man/man2/fspick.2: document "new" mount API
      man/man2/fsconfig.2: document "new" mount API
      man/man2/fsmount.2: document "new" mount API
      man/man2/move_mount.2: document "new" mount API
      man/man2/open_tree.2: document "new" mount API
      man/man2/mount_setattr.2: mirror opening sentence from fsopen(2)
      man/man2/open_tree{,_attr}.2: document new open_tree_attr() API
      man/man2/{fsconfig,mount_setattr}.2: add note about attribute-parameter distinction

 man/man2/fsconfig.2           | 739 ++++++++++++++++++++++++++++++++++++++++++
 man/man2/fsmount.2            | 231 +++++++++++++
 man/man2/fsopen.2             | 384 ++++++++++++++++++++++
 man/man2/fspick.2             | 342 +++++++++++++++++++
 man/man2/mount_setattr.2      |  63 +++-
 man/man2/move_mount.2         | 646 ++++++++++++++++++++++++++++++++++++
 man/man2/open_tree.2          | 638 ++++++++++++++++++++++++++++++++++++
 man/man2/open_tree_attr.2     |   1 +
 man/man2type/mount_attr.2type |  61 ++++
 9 files changed, 3092 insertions(+), 13 deletions(-)
---
base-commit: e86f9fd0c279f593242969a2fbb5ef379272d89d
change-id: 20250802-new-mount-api-436db984f432


Kind regards,
-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/


^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Askar Safin @ 2025-09-18 19:58 UTC (permalink / raw)
  To: nschichan
  Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
	ecurtin, email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs,
	jack, julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
	linux-arch, linux-arm-kernel, linux-block, linux-csky, linux-doc,
	linux-efi, linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
	linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
	loongarch, mcgrof, mingo, monstr, mzxreary, patches, rob,
	safinaskar, sparclinux, thomas.weissschuh, thorsten.blum,
	torvalds, tytso, viro, x86
In-Reply-To: <20250918152830.438554-1-nschichan@freebox.fr>

> When booting with root=/dev/ram0 in the kernel commandline,
> handle_initrd() where the deprecation message resides is never called,
> which is rather unfortunate (init/do_mounts_initrd.c):

Yes, this is unfortunate.

I personally still think that initrd should be removed.

I suggest using workaround I described in cover letter.

Also, for unknown reasons I didn't get your letter in my inbox.
(Not even in spam folder.) I ocasionally found it on lore.kernel.org .

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v21 4/8] fork: Add shadow stack support to clone3()
From: Mark Brown @ 2025-09-18 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: brauner, Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, jannh, Andrew Morton, Yury Khrustalev,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook,
	Adhemerval Zanella Netto
In-Reply-To: <aMwtdtRHT7oHhYLf@willie-the-truck>

[-- Attachment #1: Type: text/plain, Size: 1265 bytes --]

On Thu, Sep 18, 2025 at 05:04:06PM +0100, Will Deacon wrote:
> On Thu, Sep 18, 2025 at 01:38:53PM +0100, Will Deacon wrote:

> > It would be great if Christian could give this the thumbs up, given that
> > it changes clone3(). I think the architecture parts are all ready at this
> > point.

> ah, I may have spoken too soon :/

Well, there's also the fact that this is based on the vfs tree (or would
have conflicts with it).

> Catalin pointed me at this glibc thread:

> https://marc.info/?l=glibc-alpha&m=175811917427562

> which sounds like they're not entirely on board with the new ABI.

I think we're getting there on that one, and the main thing they're
asking for is the ability to reuse the GCS after the thread has exited
which would be orthogonal to this stuff.  I see Catalin replied on the
glibc side so I'll direct most of my reply there.

It would be really helpful to get a clear idea of where we're going with
this series, it's been almost landed for an incredibly long time and
having it in that state is getting disruptive to doing cleanup to try to
factor code out of the arches especially with the RISC-V stuff also up
in the air.  I do think the issues glibc have with this are orthogonal
to the changes here so hopefully this can go as is.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: Darrick J. Wong @ 2025-09-18 16:57 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250916-tune2fs-v2-3-d594dc7486f0@mit.edu>

On Tue, Sep 16, 2025 at 11:22:49PM -0400, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
> EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
> parameters to be set while the file system is mounted, without needing
> write access to the block device.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Looks good to me now,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
>  fs/ext4/ioctl.c           | 312 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/ext4.h |  53 +++++++++++++
>  2 files changed, 358 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 84e3c73952d72e436429489f5fc8b7ae1c01c7a1..a93a7baae990cc5580d2ddb3ffcc72fe15246978 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -27,14 +27,16 @@
>  #include "fsmap.h"
>  #include <trace/events/ext4.h>
>  
> -typedef void ext4_update_sb_callback(struct ext4_super_block *es,
> -				       const void *arg);
> +typedef void ext4_update_sb_callback(struct ext4_sb_info *sbi,
> +				     struct ext4_super_block *es,
> +				     const void *arg);
>  
>  /*
>   * Superblock modification callback function for changing file system
>   * label
>   */
> -static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
> +static void ext4_sb_setlabel(struct ext4_sb_info *sbi,
> +			     struct ext4_super_block *es, const void *arg)
>  {
>  	/* Sanity check, this should never happen */
>  	BUILD_BUG_ON(sizeof(es->s_volume_name) < EXT4_LABEL_MAX);
> @@ -46,7 +48,8 @@ static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
>   * Superblock modification callback function for changing file system
>   * UUID.
>   */
> -static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg)
> +static void ext4_sb_setuuid(struct ext4_sb_info *sbi,
> +			    struct ext4_super_block *es, const void *arg)
>  {
>  	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
>  }
> @@ -71,7 +74,7 @@ int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
>  		goto out_err;
>  
>  	lock_buffer(bh);
> -	func(es, arg);
> +	func(sbi, es, arg);
>  	ext4_superblock_csum_set(sb);
>  	unlock_buffer(bh);
>  
> @@ -149,7 +152,7 @@ static int ext4_update_backup_sb(struct super_block *sb,
>  		unlock_buffer(bh);
>  		goto out_bh;
>  	}
> -	func(es, arg);
> +	func(EXT4_SB(sb), es, arg);
>  	if (ext4_has_feature_metadata_csum(sb))
>  		es->s_checksum = ext4_superblock_csum(es);
>  	set_buffer_uptodate(bh);
> @@ -1230,6 +1233,295 @@ static int ext4_ioctl_setuuid(struct file *filp,
>  	return ret;
>  }
>  
> +
> +#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
> +	EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
> +	EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
> +	EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
> +	EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
> +	EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
> +	EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
> +	EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
> +	EXT4_TUNE_FL_FORCE_FSCK | EXT4_TUNE_FL_ENCODING | \
> +	EXT4_TUNE_FL_ENCODING_FLAGS)
> +
> +#define EXT4_TUNE_SET_COMPAT_SUPP \
> +		(EXT4_FEATURE_COMPAT_DIR_INDEX |	\
> +		 EXT4_FEATURE_COMPAT_STABLE_INODES)
> +#define EXT4_TUNE_SET_INCOMPAT_SUPP \
> +		(EXT4_FEATURE_INCOMPAT_EXTENTS |	\
> +		 EXT4_FEATURE_INCOMPAT_EA_INODE |	\
> +		 EXT4_FEATURE_INCOMPAT_ENCRYPT |	\
> +		 EXT4_FEATURE_INCOMPAT_CSUM_SEED |	\
> +		 EXT4_FEATURE_INCOMPAT_LARGEDIR |	\
> +		 EXT4_FEATURE_INCOMPAT_CASEFOLD)
> +#define EXT4_TUNE_SET_RO_COMPAT_SUPP \
> +		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
> +		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
> +		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
> +		 EXT4_FEATURE_RO_COMPAT_PROJECT |	\
> +		 EXT4_FEATURE_RO_COMPAT_VERITY)
> +
> +#define EXT4_TUNE_CLEAR_COMPAT_SUPP (0)
> +#define EXT4_TUNE_CLEAR_INCOMPAT_SUPP (0)
> +#define EXT4_TUNE_CLEAR_RO_COMPAT_SUPP (0)
> +
> +#define SB_ENC_SUPP_MASK (SB_ENC_STRICT_MODE_FL |	\
> +			  SB_ENC_NO_COMPAT_FALLBACK_FL)
> +
> +static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
> +				  struct ext4_tune_sb_params __user *params)
> +{
> +	struct ext4_tune_sb_params ret;
> +	struct ext4_super_block *es = sbi->s_es;
> +
> +	memset(&ret, 0, sizeof(ret));
> +	ret.set_flags = TUNE_OPS_SUPPORTED;
> +	ret.errors_behavior = le16_to_cpu(es->s_errors);
> +	ret.mnt_count = le16_to_cpu(es->s_mnt_count);
> +	ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
> +	ret.checkinterval = le32_to_cpu(es->s_checkinterval);
> +	ret.last_check_time = le32_to_cpu(es->s_lastcheck);
> +	ret.reserved_blocks = ext4_r_blocks_count(es);
> +	ret.blocks_count = ext4_blocks_count(es);
> +	ret.reserved_uid = ext4_get_resuid(es);
> +	ret.reserved_gid = ext4_get_resgid(es);
> +	ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
> +	ret.def_hash_alg = es->s_def_hash_version;
> +	ret.raid_stride = le16_to_cpu(es->s_raid_stride);
> +	ret.raid_stripe_width = le32_to_cpu(es->s_raid_stripe_width);
> +	ret.encoding = le16_to_cpu(es->s_encoding);
> +	ret.encoding_flags = le16_to_cpu(es->s_encoding_flags);
> +	strscpy_pad(ret.mount_opts, es->s_mount_opts);
> +	ret.feature_compat = le32_to_cpu(es->s_feature_compat);
> +	ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
> +	ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
> +	ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
> +	ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
> +	ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
> +	ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
> +	ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
> +	ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
> +	if (copy_to_user(params, &ret, sizeof(ret)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static void ext4_sb_setparams(struct ext4_sb_info *sbi,
> +			      struct ext4_super_block *es, const void *arg)
> +{
> +	const struct ext4_tune_sb_params *params = arg;
> +
> +	if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
> +		es->s_errors = cpu_to_le16(params->errors_behavior);
> +	if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
> +		es->s_mnt_count = cpu_to_le16(params->mnt_count);
> +	if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
> +		es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
> +	if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
> +		es->s_checkinterval = cpu_to_le32(params->checkinterval);
> +	if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
> +		es->s_lastcheck = cpu_to_le32(params->last_check_time);
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
> +		ext4_fsblk_t blk = params->reserved_blocks;
> +
> +		es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
> +		es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
> +		int uid = params->reserved_uid;
> +
> +		es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
> +		es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
> +		int gid = params->reserved_gid;
> +
> +		es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
> +		es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
> +		es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
> +	if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +		es->s_def_hash_version = params->def_hash_alg;
> +	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
> +		es->s_raid_stride = cpu_to_le16(params->raid_stride);
> +	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
> +		es->s_raid_stripe_width =
> +			cpu_to_le32(params->raid_stripe_width);
> +	if (params->set_flags & EXT4_TUNE_FL_ENCODING)
> +		es->s_encoding = cpu_to_le16(params->encoding);
> +	if (params->set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)
> +		es->s_encoding_flags = cpu_to_le16(params->encoding_flags);
> +	strscpy_pad(es->s_mount_opts, params->mount_opts);
> +	if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
> +		es->s_feature_compat |=
> +			cpu_to_le32(params->set_feature_compat_mask);
> +		es->s_feature_incompat |=
> +			cpu_to_le32(params->set_feature_incompat_mask);
> +		es->s_feature_ro_compat |=
> +			cpu_to_le32(params->set_feature_ro_compat_mask);
> +		es->s_feature_compat &=
> +			~cpu_to_le32(params->clear_feature_compat_mask);
> +		es->s_feature_incompat &=
> +			~cpu_to_le32(params->clear_feature_incompat_mask);
> +		es->s_feature_ro_compat &=
> +			~cpu_to_le32(params->clear_feature_ro_compat_mask);
> +		if (params->set_feature_compat_mask &
> +		    EXT4_FEATURE_COMPAT_DIR_INDEX)
> +			es->s_def_hash_version = sbi->s_def_hash_version;
> +		if (params->set_feature_incompat_mask &
> +		    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
> +			es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
> +		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
> +}
> +
> +static int ext4_ioctl_set_tune_sb(struct file *filp,
> +				  struct ext4_tune_sb_params __user *in)
> +{
> +	struct ext4_tune_sb_params params;
> +	struct super_block *sb = file_inode(filp)->i_sb;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_super_block *es = sbi->s_es;
> +	int enabling_casefold = 0;
> +	int ret;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (copy_from_user(&params, in, sizeof(params)))
> +		return -EFAULT;
> +
> +	if ((params.set_flags & ~TUNE_OPS_SUPPORTED) != 0)
> +		return -EOPNOTSUPP;
> +
> +	if ((params.set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR) &&
> +	    (params.errors_behavior > EXT4_ERRORS_PANIC))
> +		return -EINVAL;
> +
> +	if ((params.set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) &&
> +	    (params.reserved_blocks > ext4_blocks_count(sbi->s_es) / 2))
> +		return -EINVAL;
> +	if ((params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG) &&
> +	    ((params.def_hash_alg > DX_HASH_LAST) ||
> +	     (params.def_hash_alg == DX_HASH_SIPHASH)))
> +		return -EINVAL;
> +	if ((params.set_flags & EXT4_TUNE_FL_FEATURES) &&
> +	    (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES))
> +		return -EINVAL;
> +
> +	if (params.set_flags & EXT4_TUNE_FL_FEATURES) {
> +		params.set_feature_compat_mask =
> +			params.feature_compat &
> +			~le32_to_cpu(es->s_feature_compat);
> +		params.set_feature_incompat_mask =
> +			params.feature_incompat &
> +			~le32_to_cpu(es->s_feature_incompat);
> +		params.set_feature_ro_compat_mask =
> +			params.feature_ro_compat &
> +			~le32_to_cpu(es->s_feature_ro_compat);
> +		params.clear_feature_compat_mask =
> +			~params.feature_compat &
> +			le32_to_cpu(es->s_feature_compat);
> +		params.clear_feature_incompat_mask =
> +			~params.feature_incompat &
> +			le32_to_cpu(es->s_feature_incompat);
> +		params.clear_feature_ro_compat_mask =
> +			~params.feature_ro_compat &
> +			le32_to_cpu(es->s_feature_ro_compat);
> +		params.set_flags |= EXT4_TUNE_FL_EDIT_FEATURES;
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
> +		if ((params.set_feature_compat_mask &
> +		     ~EXT4_TUNE_SET_COMPAT_SUPP) ||
> +		    (params.set_feature_incompat_mask &
> +		     ~EXT4_TUNE_SET_INCOMPAT_SUPP) ||
> +		    (params.set_feature_ro_compat_mask &
> +		     ~EXT4_TUNE_SET_RO_COMPAT_SUPP) ||
> +		    (params.clear_feature_compat_mask &
> +		     ~EXT4_TUNE_CLEAR_COMPAT_SUPP) ||
> +		    (params.clear_feature_incompat_mask &
> +		     ~EXT4_TUNE_CLEAR_INCOMPAT_SUPP) ||
> +		    (params.clear_feature_ro_compat_mask &
> +		     ~EXT4_TUNE_CLEAR_RO_COMPAT_SUPP))
> +			return -EOPNOTSUPP;
> +
> +		/*
> +		 * Filter out the features that are already set from
> +		 * the set_mask.
> +		 */
> +		params.set_feature_compat_mask &=
> +			~le32_to_cpu(es->s_feature_compat);
> +		params.set_feature_incompat_mask &=
> +			~le32_to_cpu(es->s_feature_incompat);
> +		params.set_feature_ro_compat_mask &=
> +			~le32_to_cpu(es->s_feature_ro_compat);
> +		if ((params.set_feature_incompat_mask &
> +		     EXT4_FEATURE_INCOMPAT_CASEFOLD)) {
> +			enabling_casefold = 1;
> +			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING)) {
> +				params.encoding = EXT4_ENC_UTF8_12_1;
> +				params.set_flags |= EXT4_TUNE_FL_ENCODING;
> +			}
> +			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)) {
> +				params.encoding_flags = 0;
> +				params.set_flags |= EXT4_TUNE_FL_ENCODING_FLAGS;
> +			}
> +		}
> +		if ((params.set_feature_compat_mask &
> +		     EXT4_FEATURE_COMPAT_DIR_INDEX)) {
> +			uuid_t	uu;
> +
> +			memcpy(&uu, sbi->s_hash_seed, UUID_SIZE);
> +			if (uuid_is_null(&uu))
> +				generate_random_uuid((char *)
> +						     &sbi->s_hash_seed);
> +			if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +				sbi->s_def_hash_version = params.def_hash_alg;
> +			else if (sbi->s_def_hash_version == 0)
> +				sbi->s_def_hash_version = DX_HASH_HALF_MD4;
> +			if (!(es->s_flags &
> +			      cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH)) &&
> +			    !(es->s_flags &
> +			      cpu_to_le32(EXT2_FLAGS_SIGNED_HASH))) {
> +#ifdef __CHAR_UNSIGNED__
> +				sbi->s_hash_unsigned = 3;
> +#else
> +				sbi->s_hash_unsigned = 0;
> +#endif
> +			}
> +		}
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_ENCODING) {
> +		if (!enabling_casefold)
> +			return -EINVAL;
> +		if (params.encoding == 0)
> +			params.encoding = EXT4_ENC_UTF8_12_1;
> +		else if (params.encoding != EXT4_ENC_UTF8_12_1)
> +			return -EINVAL;
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS) {
> +		if (!enabling_casefold)
> +			return -EINVAL;
> +		if (params.encoding_flags & ~SB_ENC_SUPP_MASK)
> +			return -EINVAL;
> +	}
> +
> +	ret = mnt_want_write_file(filp);
> +	if (ret)
> +		return ret;
> +
> +	ret = ext4_update_superblocks_fn(sb, ext4_sb_setparams, &params);
> +	mnt_drop_write_file(filp);
> +
> +	if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +		sbi->s_def_hash_version = params.def_hash_alg;
> +
> +	return ret;
> +}
> +
>  static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  {
>  	struct inode *inode = file_inode(filp);
> @@ -1616,6 +1908,11 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
>  	case EXT4_IOC_SETFSUUID:
>  		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
> +	case EXT4_IOC_GET_TUNE_SB_PARAM:
> +		return ext4_ioctl_get_tune_sb(EXT4_SB(sb),
> +					      (void __user *)arg);
> +	case EXT4_IOC_SET_TUNE_SB_PARAM:
> +		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1703,7 +2000,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  }
>  #endif
>  
> -static void set_overhead(struct ext4_super_block *es, const void *arg)
> +static void set_overhead(struct ext4_sb_info *sbi,
> +			 struct ext4_super_block *es, const void *arg)
>  {
>  	es->s_overhead_clusters = cpu_to_le32(*((unsigned long *) arg));
>  }
> diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
> index 1c4c2dd29112cda9f7dc91d917492cffc33ee524..411dcc1e4a35c8c6a10f3768d17b8cc50cff4c34 100644
> --- a/include/uapi/linux/ext4.h
> +++ b/include/uapi/linux/ext4.h
> @@ -33,6 +33,8 @@
>  #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
>  #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
>  #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
> +#define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
> +#define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
>  
>  #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
>  
> @@ -108,6 +110,57 @@ struct ext4_new_group_input {
>  	__u16 unused;
>  };
>  
> +struct ext4_tune_sb_params {
> +	__u32 set_flags;
> +	__u32 checkinterval;
> +	__u16 errors_behavior;
> +	__u16 mnt_count;
> +	__u16 max_mnt_count;
> +	__u16 raid_stride;
> +	__u64 last_check_time;
> +	__u64 reserved_blocks;
> +	__u64 blocks_count;
> +	__u32 default_mnt_opts;
> +	__u32 reserved_uid;
> +	__u32 reserved_gid;
> +	__u32 raid_stripe_width;
> +	__u16 encoding;
> +	__u16 encoding_flags;
> +	__u8  def_hash_alg;
> +	__u8  pad_1;
> +	__u16 pad_2;
> +	__u32 feature_compat;
> +	__u32 feature_incompat;
> +	__u32 feature_ro_compat;
> +	__u32 set_feature_compat_mask;
> +	__u32 set_feature_incompat_mask;
> +	__u32 set_feature_ro_compat_mask;
> +	__u32 clear_feature_compat_mask;
> +	__u32 clear_feature_incompat_mask;
> +	__u32 clear_feature_ro_compat_mask;
> +	__u8  mount_opts[64];
> +	__u8  pad[64];
> +};
> +
> +#define EXT4_TUNE_FL_ERRORS_BEHAVIOR	0x00000001
> +#define EXT4_TUNE_FL_MNT_COUNT		0x00000002
> +#define EXT4_TUNE_FL_MAX_MNT_COUNT	0x00000004
> +#define EXT4_TUNE_FL_CHECKINTRVAL	0x00000008
> +#define EXT4_TUNE_FL_LAST_CHECK_TIME	0x00000010
> +#define EXT4_TUNE_FL_RESERVED_BLOCKS	0x00000020
> +#define EXT4_TUNE_FL_RESERVED_UID	0x00000040
> +#define EXT4_TUNE_FL_RESERVED_GID	0x00000080
> +#define EXT4_TUNE_FL_DEFAULT_MNT_OPTS	0x00000100
> +#define EXT4_TUNE_FL_DEF_HASH_ALG	0x00000200
> +#define EXT4_TUNE_FL_RAID_STRIDE	0x00000400
> +#define EXT4_TUNE_FL_RAID_STRIPE_WIDTH	0x00000800
> +#define EXT4_TUNE_FL_MOUNT_OPTS		0x00001000
> +#define EXT4_TUNE_FL_FEATURES		0x00002000
> +#define EXT4_TUNE_FL_EDIT_FEATURES	0x00004000
> +#define EXT4_TUNE_FL_FORCE_FSCK		0x00008000
> +#define EXT4_TUNE_FL_ENCODING		0x00010000
> +#define EXT4_TUNE_FL_ENCODING_FLAGS	0x00020000
> +
>  /*
>   * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag.
>   * It indicates that the entry in extent status cache is for a hole.
> 
> -- 
> 2.51.0
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v2 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Darrick J. Wong @ 2025-09-18 16:55 UTC (permalink / raw)
  To: Jan Kara; +Cc: tytso, linux-ext4, linux-api, stable
In-Reply-To: <ik565ztp7s7zyhiog6n52zxylybtkkael6opfrmtvtf6su34iw@zho5dlhfvibq>

On Wed, Sep 17, 2025 at 06:05:06PM +0200, Jan Kara wrote:
> On Tue 16-09-25 23:22:47, Theodore Ts'o via B4 Relay wrote:
> > From: Theodore Ts'o <tytso@mit.edu>
> > 
> > Unlike other strings in the ext4 superblock, we rely on tune2fs to
> > make sure s_mount_opts is NUL terminated.  Harden
> > parse_apply_sb_mount_options() by treating s_mount_opts as a potential
> > __nonstring.
> > 
> > Cc: stable@vger.kernel.org
> > Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
> > Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> Looks good. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack@suse.cz>

Looks fine to me too,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> 
> 								Honza
> 
> > ---
> >  fs/ext4/super.c | 17 +++++------------
> >  1 file changed, 5 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
> >  					struct ext4_fs_context *m_ctx)
> >  {
> >  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> > -	char *s_mount_opts = NULL;
> > +	char s_mount_opts[65];
> >  	struct ext4_fs_context *s_ctx = NULL;
> >  	struct fs_context *fc = NULL;
> >  	int ret = -ENOMEM;
> > @@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
> >  	if (!sbi->s_es->s_mount_opts[0])
> >  		return 0;
> >  
> > -	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
> > -				sizeof(sbi->s_es->s_mount_opts),
> > -				GFP_KERNEL);
> > -	if (!s_mount_opts)
> > -		return ret;
> > +	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
> >  
> >  	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
> >  	if (!fc)
> > -		goto out_free;
> > +		return -ENOMEM;
> >  
> >  	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
> >  	if (!s_ctx)
> > @@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
> >  	ret = 0;
> >  
> >  out_free:
> > -	if (fc) {
> > -		ext4_fc_free(fc);
> > -		kfree(fc);
> > -	}
> > -	kfree(s_mount_opts);
> > +	ext4_fc_free(fc);
> > +	kfree(fc);
> >  	return ret;
> >  }
> >  
> > 
> > -- 
> > 2.51.0
> > 
> > 
> > 
> -- 
> Jan Kara <jack@suse.com>
> SUSE Labs, CR
> 

^ permalink raw reply

* Re: [PATCH v21 4/8] fork: Add shadow stack support to clone3()
From: Will Deacon @ 2025-09-18 16:04 UTC (permalink / raw)
  To: Mark Brown, brauner
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, jannh, Andrew Morton, Yury Khrustalev,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook
In-Reply-To: <aMv9XRq_sAQbQwjI@willie-the-truck>

On Thu, Sep 18, 2025 at 01:38:53PM +0100, Will Deacon wrote:
> On Tue, Sep 16, 2025 at 12:12:09AM +0100, Mark Brown wrote:
> > Unlike with the normal stack there is no API for configuring the shadow
> > stack for a new thread, instead the kernel will dynamically allocate a
> > new shadow stack with the same size as the normal stack. This appears to
> > be due to the shadow stack series having been in development since
> > before the more extensible clone3() was added rather than anything more
> > deliberate.
> > 
> > Add a parameter to clone3() specifying a shadow stack pointer to use
> > for the new thread, this is inconsistent with the way we specify the
> > normal stack but during review concerns were expressed about having to
> > identify where the shadow stack pointer should be placed especially in
> > cases where the shadow stack has been previously active.  If no shadow
> > stack is specified then the existing implicit allocation behaviour is
> > maintained.
> > 
> > If a shadow stack pointer is specified then it is required to have an
> > architecture defined token placed on the stack, this will be consumed by
> > the new task, the shadow stack is specified by pointing to this token.  If
> > no valid token is present then this will be reported with -EINVAL.  This
> > token prevents new threads being created pointing at the shadow stack of
> > an existing running thread.  On architectures with support for userspace
> > pivoting of shadow stacks it is expected that the same format and placement
> > of tokens will be used, this is the case for arm64 and x86.
> > 
> > If the architecture does not support shadow stacks the shadow stack
> > pointer must be not be specified, architectures that do support the
> > feature are expected to enforce the same requirement on individual
> > systems that lack shadow stack support.
> > 
> > Update the existing arm64 and x86 implementations to pay attention to
> > the newly added arguments, in order to maintain compatibility we use the
> > existing behaviour if no shadow stack is specified. Since we are now
> > using more fields from the kernel_clone_args we pass that into the
> > shadow stack code rather than individual fields.
> > 
> > Portions of the x86 architecture code were written by Rick Edgecombe.
> > 
> > Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
> > Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> > Signed-off-by: Mark Brown <broonie@kernel.org>
> > ---
> >  arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
> >  arch/x86/include/asm/shstk.h     | 11 +++--
> >  arch/x86/kernel/process.c        |  2 +-
> >  arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
> >  include/asm-generic/cacheflush.h | 11 +++++
> >  include/linux/sched/task.h       | 17 ++++++++
> >  include/uapi/linux/sched.h       |  9 ++--
> >  kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
> >  8 files changed, 217 insertions(+), 26 deletions(-)
> 
> It would be great if Christian could give this the thumbs up, given that
> it changes clone3(). I think the architecture parts are all ready at this
> point.

ah, I may have spoken too soon :/

Catalin pointed me at this glibc thread:

https://marc.info/?l=glibc-alpha&m=175811917427562

which sounds like they're not entirely on board with the new ABI.

Will

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Nicolas Schichan @ 2025-09-18 15:28 UTC (permalink / raw)
  To: safinaskar
  Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
	ecurtin, email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs,
	jack, julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
	linux-arch, linux-arm-kernel, linux-block, linux-csky, linux-doc,
	linux-efi, linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
	linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
	loongarch, mcgrof, mingo, monstr, mzxreary, patches, rob,
	sparclinux, thomas.weissschuh, thorsten.blum, torvalds, tytso,
	viro, x86, nschichan
In-Reply-To: <20250913003842.41944-1-safinaskar@gmail.com>

Hello,

> Intro
> ====
> This patchset removes classic initrd (initial RAM disk) support,
> which was deprecated in 2020.

This serie came a bit as a surprise, because even though the message
notifying of the initrd deprecation was added in July 2020, the message
was never displayed on our kernels.

When booting with root=/dev/ram0 in the kernel commandline,
handle_initrd() where the deprecation message resides is never called,
which is rather unfortunate (init/do_mounts_initrd.c):

	if (rd_load_image("/initrd.image") && ROOT_DEV != Root_RAM0) {
		init_unlink("/initrd.image");
		handle_initrd(root_device_name); // shows the deprecation msg
		return true;
	}

It is likely we are not the alone booting with that particular
configuration, so other people are probably going to be surprised when
initrd support is removed, because they never saw the deprecation
message.

We do depend on initrd support a lot on our embedded platforms (more
than a million devices with a yearlyish upgrade to the latest
kernel). If it eventually becomes removed this is going to impact us.

We use an initrd squashfs4 image, because coming from a time where
embedded flash devices were fragile, we avoid having the root
filesystem directly mounted (even when read only) on the flash
block/mtd device, and have the bootloader load the root filesystem as
an initrd.

We use a squashfs4 because we can mount it and keep it compressed. The
kernel would decompress data on demand in the page cache, and evict it
as needed.

Regards,

-- 
Nicolas Schichan

^ permalink raw reply

* Re: [PATCH v21 4/8] fork: Add shadow stack support to clone3()
From: Will Deacon @ 2025-09-18 12:38 UTC (permalink / raw)
  To: Mark Brown, brauner
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Shuah Khan, linux-kernel,
	Catalin Marinas, jannh, Andrew Morton, Yury Khrustalev,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook
In-Reply-To: <20250916-clone3-shadow-stack-v21-4-910493527013@kernel.org>

On Tue, Sep 16, 2025 at 12:12:09AM +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the shadow
> stack for a new thread, instead the kernel will dynamically allocate a
> new shadow stack with the same size as the normal stack. This appears to
> be due to the shadow stack series having been in development since
> before the more extensible clone3() was added rather than anything more
> deliberate.
> 
> Add a parameter to clone3() specifying a shadow stack pointer to use
> for the new thread, this is inconsistent with the way we specify the
> normal stack but during review concerns were expressed about having to
> identify where the shadow stack pointer should be placed especially in
> cases where the shadow stack has been previously active.  If no shadow
> stack is specified then the existing implicit allocation behaviour is
> maintained.
> 
> If a shadow stack pointer is specified then it is required to have an
> architecture defined token placed on the stack, this will be consumed by
> the new task, the shadow stack is specified by pointing to this token.  If
> no valid token is present then this will be reported with -EINVAL.  This
> token prevents new threads being created pointing at the shadow stack of
> an existing running thread.  On architectures with support for userspace
> pivoting of shadow stacks it is expected that the same format and placement
> of tokens will be used, this is the case for arm64 and x86.
> 
> If the architecture does not support shadow stacks the shadow stack
> pointer must be not be specified, architectures that do support the
> feature are expected to enforce the same requirement on individual
> systems that lack shadow stack support.
> 
> Update the existing arm64 and x86 implementations to pay attention to
> the newly added arguments, in order to maintain compatibility we use the
> existing behaviour if no shadow stack is specified. Since we are now
> using more fields from the kernel_clone_args we pass that into the
> shadow stack code rather than individual fields.
> 
> Portions of the x86 architecture code were written by Rick Edgecombe.
> 
> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Mark Brown <broonie@kernel.org>
> ---
>  arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
>  arch/x86/include/asm/shstk.h     | 11 +++--
>  arch/x86/kernel/process.c        |  2 +-
>  arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
>  include/asm-generic/cacheflush.h | 11 +++++
>  include/linux/sched/task.h       | 17 ++++++++
>  include/uapi/linux/sched.h       |  9 ++--
>  kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
>  8 files changed, 217 insertions(+), 26 deletions(-)

It would be great if Christian could give this the thumbs up, given that
it changes clone3(). I think the architecture parts are all ready at this
point.

Will

^ permalink raw reply

* Re: [PATCH 00/62] initrd: remove classic initrd support
From: Andy Lutomirski @ 2025-09-17 18:00 UTC (permalink / raw)
  To: Rob Landley
  Cc: Askar Safin, linux-fsdevel, linux-kernel, Linus Torvalds,
	Greg Kroah-Hartman, Christian Brauner, Al Viro, Jan Kara,
	Christoph Hellwig, Jens Axboe, Andy Shevchenko, Aleksa Sarai,
	Thomas Weißschuh, Julian Stecklina, Gao Xiang, Art Nikpal,
	Andrew Morton, Eric Curtin, Alexander Graf, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <0342fbda-9901-4293-afa7-ba6085eb1688@landley.net>

On Mon, Sep 15, 2025 at 10:09 AM Rob Landley <rob@landley.net> wrote:

> While you're at it, could you fix static/builtin initramfs so PID 1 has
> a valid stdin/stdout/stderr?
>
> A static initramfs won't create /dev/console if the embedded initramfs
> image doesn't contain it, which a non-root build can't mknod, so the
> kernel plumbing won't see it dev in the directory we point it at unless
> we build with root access.

I have no current insight as to whether there's a kernel issue here,
but why are you trying to put actual device nodes in an actual
filesystem as part of a build process?  It's extremely straightforward
to emit devices nodes in cpio format, and IMO it's far *more*
straightforward to do that than to make a whole directory, try to get
all the modes right, and cpio it up.

I wrote an absolutely trivial tool for this several years ago:

https://github.com/amluto/virtme/blob/master/virtme/cpiowriter.py

it would be barely more complicated to strip the trailer off an cpio
file from some other source, add some device nodes, and stick the
trailer back on.  But it's also really, really, really easy to emit an
entire, functioning cpio-formatted initramfs from plain user code with
no filesystem manipulation at all.  This also makes that portion of
the build reproducible, which is worth quite a bit IMO.

--Andy

^ permalink raw reply

* Re: [PATCH v2 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: Jan Kara @ 2025-09-17 16:22 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250916-tune2fs-v2-3-d594dc7486f0@mit.edu>

On Tue 16-09-25 23:22:49, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
> EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
> parameters to be set while the file system is mounted, without needing
> write access to the block device.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza
> ---
>  fs/ext4/ioctl.c           | 312 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>  include/uapi/linux/ext4.h |  53 +++++++++++++
>  2 files changed, 358 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 84e3c73952d72e436429489f5fc8b7ae1c01c7a1..a93a7baae990cc5580d2ddb3ffcc72fe15246978 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -27,14 +27,16 @@
>  #include "fsmap.h"
>  #include <trace/events/ext4.h>
>  
> -typedef void ext4_update_sb_callback(struct ext4_super_block *es,
> -				       const void *arg);
> +typedef void ext4_update_sb_callback(struct ext4_sb_info *sbi,
> +				     struct ext4_super_block *es,
> +				     const void *arg);
>  
>  /*
>   * Superblock modification callback function for changing file system
>   * label
>   */
> -static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
> +static void ext4_sb_setlabel(struct ext4_sb_info *sbi,
> +			     struct ext4_super_block *es, const void *arg)
>  {
>  	/* Sanity check, this should never happen */
>  	BUILD_BUG_ON(sizeof(es->s_volume_name) < EXT4_LABEL_MAX);
> @@ -46,7 +48,8 @@ static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
>   * Superblock modification callback function for changing file system
>   * UUID.
>   */
> -static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg)
> +static void ext4_sb_setuuid(struct ext4_sb_info *sbi,
> +			    struct ext4_super_block *es, const void *arg)
>  {
>  	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
>  }
> @@ -71,7 +74,7 @@ int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
>  		goto out_err;
>  
>  	lock_buffer(bh);
> -	func(es, arg);
> +	func(sbi, es, arg);
>  	ext4_superblock_csum_set(sb);
>  	unlock_buffer(bh);
>  
> @@ -149,7 +152,7 @@ static int ext4_update_backup_sb(struct super_block *sb,
>  		unlock_buffer(bh);
>  		goto out_bh;
>  	}
> -	func(es, arg);
> +	func(EXT4_SB(sb), es, arg);
>  	if (ext4_has_feature_metadata_csum(sb))
>  		es->s_checksum = ext4_superblock_csum(es);
>  	set_buffer_uptodate(bh);
> @@ -1230,6 +1233,295 @@ static int ext4_ioctl_setuuid(struct file *filp,
>  	return ret;
>  }
>  
> +
> +#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
> +	EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
> +	EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
> +	EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
> +	EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
> +	EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
> +	EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
> +	EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
> +	EXT4_TUNE_FL_FORCE_FSCK | EXT4_TUNE_FL_ENCODING | \
> +	EXT4_TUNE_FL_ENCODING_FLAGS)
> +
> +#define EXT4_TUNE_SET_COMPAT_SUPP \
> +		(EXT4_FEATURE_COMPAT_DIR_INDEX |	\
> +		 EXT4_FEATURE_COMPAT_STABLE_INODES)
> +#define EXT4_TUNE_SET_INCOMPAT_SUPP \
> +		(EXT4_FEATURE_INCOMPAT_EXTENTS |	\
> +		 EXT4_FEATURE_INCOMPAT_EA_INODE |	\
> +		 EXT4_FEATURE_INCOMPAT_ENCRYPT |	\
> +		 EXT4_FEATURE_INCOMPAT_CSUM_SEED |	\
> +		 EXT4_FEATURE_INCOMPAT_LARGEDIR |	\
> +		 EXT4_FEATURE_INCOMPAT_CASEFOLD)
> +#define EXT4_TUNE_SET_RO_COMPAT_SUPP \
> +		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
> +		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
> +		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
> +		 EXT4_FEATURE_RO_COMPAT_PROJECT |	\
> +		 EXT4_FEATURE_RO_COMPAT_VERITY)
> +
> +#define EXT4_TUNE_CLEAR_COMPAT_SUPP (0)
> +#define EXT4_TUNE_CLEAR_INCOMPAT_SUPP (0)
> +#define EXT4_TUNE_CLEAR_RO_COMPAT_SUPP (0)
> +
> +#define SB_ENC_SUPP_MASK (SB_ENC_STRICT_MODE_FL |	\
> +			  SB_ENC_NO_COMPAT_FALLBACK_FL)
> +
> +static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
> +				  struct ext4_tune_sb_params __user *params)
> +{
> +	struct ext4_tune_sb_params ret;
> +	struct ext4_super_block *es = sbi->s_es;
> +
> +	memset(&ret, 0, sizeof(ret));
> +	ret.set_flags = TUNE_OPS_SUPPORTED;
> +	ret.errors_behavior = le16_to_cpu(es->s_errors);
> +	ret.mnt_count = le16_to_cpu(es->s_mnt_count);
> +	ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
> +	ret.checkinterval = le32_to_cpu(es->s_checkinterval);
> +	ret.last_check_time = le32_to_cpu(es->s_lastcheck);
> +	ret.reserved_blocks = ext4_r_blocks_count(es);
> +	ret.blocks_count = ext4_blocks_count(es);
> +	ret.reserved_uid = ext4_get_resuid(es);
> +	ret.reserved_gid = ext4_get_resgid(es);
> +	ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
> +	ret.def_hash_alg = es->s_def_hash_version;
> +	ret.raid_stride = le16_to_cpu(es->s_raid_stride);
> +	ret.raid_stripe_width = le32_to_cpu(es->s_raid_stripe_width);
> +	ret.encoding = le16_to_cpu(es->s_encoding);
> +	ret.encoding_flags = le16_to_cpu(es->s_encoding_flags);
> +	strscpy_pad(ret.mount_opts, es->s_mount_opts);
> +	ret.feature_compat = le32_to_cpu(es->s_feature_compat);
> +	ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
> +	ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
> +	ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
> +	ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
> +	ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
> +	ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
> +	ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
> +	ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
> +	if (copy_to_user(params, &ret, sizeof(ret)))
> +		return -EFAULT;
> +	return 0;
> +}
> +
> +static void ext4_sb_setparams(struct ext4_sb_info *sbi,
> +			      struct ext4_super_block *es, const void *arg)
> +{
> +	const struct ext4_tune_sb_params *params = arg;
> +
> +	if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
> +		es->s_errors = cpu_to_le16(params->errors_behavior);
> +	if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
> +		es->s_mnt_count = cpu_to_le16(params->mnt_count);
> +	if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
> +		es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
> +	if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
> +		es->s_checkinterval = cpu_to_le32(params->checkinterval);
> +	if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
> +		es->s_lastcheck = cpu_to_le32(params->last_check_time);
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
> +		ext4_fsblk_t blk = params->reserved_blocks;
> +
> +		es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
> +		es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
> +		int uid = params->reserved_uid;
> +
> +		es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
> +		es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
> +		int gid = params->reserved_gid;
> +
> +		es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
> +		es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
> +		es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
> +	if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +		es->s_def_hash_version = params->def_hash_alg;
> +	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
> +		es->s_raid_stride = cpu_to_le16(params->raid_stride);
> +	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
> +		es->s_raid_stripe_width =
> +			cpu_to_le32(params->raid_stripe_width);
> +	if (params->set_flags & EXT4_TUNE_FL_ENCODING)
> +		es->s_encoding = cpu_to_le16(params->encoding);
> +	if (params->set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)
> +		es->s_encoding_flags = cpu_to_le16(params->encoding_flags);
> +	strscpy_pad(es->s_mount_opts, params->mount_opts);
> +	if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
> +		es->s_feature_compat |=
> +			cpu_to_le32(params->set_feature_compat_mask);
> +		es->s_feature_incompat |=
> +			cpu_to_le32(params->set_feature_incompat_mask);
> +		es->s_feature_ro_compat |=
> +			cpu_to_le32(params->set_feature_ro_compat_mask);
> +		es->s_feature_compat &=
> +			~cpu_to_le32(params->clear_feature_compat_mask);
> +		es->s_feature_incompat &=
> +			~cpu_to_le32(params->clear_feature_incompat_mask);
> +		es->s_feature_ro_compat &=
> +			~cpu_to_le32(params->clear_feature_ro_compat_mask);
> +		if (params->set_feature_compat_mask &
> +		    EXT4_FEATURE_COMPAT_DIR_INDEX)
> +			es->s_def_hash_version = sbi->s_def_hash_version;
> +		if (params->set_feature_incompat_mask &
> +		    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
> +			es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
> +	}
> +	if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
> +		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
> +}
> +
> +static int ext4_ioctl_set_tune_sb(struct file *filp,
> +				  struct ext4_tune_sb_params __user *in)
> +{
> +	struct ext4_tune_sb_params params;
> +	struct super_block *sb = file_inode(filp)->i_sb;
> +	struct ext4_sb_info *sbi = EXT4_SB(sb);
> +	struct ext4_super_block *es = sbi->s_es;
> +	int enabling_casefold = 0;
> +	int ret;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (copy_from_user(&params, in, sizeof(params)))
> +		return -EFAULT;
> +
> +	if ((params.set_flags & ~TUNE_OPS_SUPPORTED) != 0)
> +		return -EOPNOTSUPP;
> +
> +	if ((params.set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR) &&
> +	    (params.errors_behavior > EXT4_ERRORS_PANIC))
> +		return -EINVAL;
> +
> +	if ((params.set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) &&
> +	    (params.reserved_blocks > ext4_blocks_count(sbi->s_es) / 2))
> +		return -EINVAL;
> +	if ((params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG) &&
> +	    ((params.def_hash_alg > DX_HASH_LAST) ||
> +	     (params.def_hash_alg == DX_HASH_SIPHASH)))
> +		return -EINVAL;
> +	if ((params.set_flags & EXT4_TUNE_FL_FEATURES) &&
> +	    (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES))
> +		return -EINVAL;
> +
> +	if (params.set_flags & EXT4_TUNE_FL_FEATURES) {
> +		params.set_feature_compat_mask =
> +			params.feature_compat &
> +			~le32_to_cpu(es->s_feature_compat);
> +		params.set_feature_incompat_mask =
> +			params.feature_incompat &
> +			~le32_to_cpu(es->s_feature_incompat);
> +		params.set_feature_ro_compat_mask =
> +			params.feature_ro_compat &
> +			~le32_to_cpu(es->s_feature_ro_compat);
> +		params.clear_feature_compat_mask =
> +			~params.feature_compat &
> +			le32_to_cpu(es->s_feature_compat);
> +		params.clear_feature_incompat_mask =
> +			~params.feature_incompat &
> +			le32_to_cpu(es->s_feature_incompat);
> +		params.clear_feature_ro_compat_mask =
> +			~params.feature_ro_compat &
> +			le32_to_cpu(es->s_feature_ro_compat);
> +		params.set_flags |= EXT4_TUNE_FL_EDIT_FEATURES;
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
> +		if ((params.set_feature_compat_mask &
> +		     ~EXT4_TUNE_SET_COMPAT_SUPP) ||
> +		    (params.set_feature_incompat_mask &
> +		     ~EXT4_TUNE_SET_INCOMPAT_SUPP) ||
> +		    (params.set_feature_ro_compat_mask &
> +		     ~EXT4_TUNE_SET_RO_COMPAT_SUPP) ||
> +		    (params.clear_feature_compat_mask &
> +		     ~EXT4_TUNE_CLEAR_COMPAT_SUPP) ||
> +		    (params.clear_feature_incompat_mask &
> +		     ~EXT4_TUNE_CLEAR_INCOMPAT_SUPP) ||
> +		    (params.clear_feature_ro_compat_mask &
> +		     ~EXT4_TUNE_CLEAR_RO_COMPAT_SUPP))
> +			return -EOPNOTSUPP;
> +
> +		/*
> +		 * Filter out the features that are already set from
> +		 * the set_mask.
> +		 */
> +		params.set_feature_compat_mask &=
> +			~le32_to_cpu(es->s_feature_compat);
> +		params.set_feature_incompat_mask &=
> +			~le32_to_cpu(es->s_feature_incompat);
> +		params.set_feature_ro_compat_mask &=
> +			~le32_to_cpu(es->s_feature_ro_compat);
> +		if ((params.set_feature_incompat_mask &
> +		     EXT4_FEATURE_INCOMPAT_CASEFOLD)) {
> +			enabling_casefold = 1;
> +			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING)) {
> +				params.encoding = EXT4_ENC_UTF8_12_1;
> +				params.set_flags |= EXT4_TUNE_FL_ENCODING;
> +			}
> +			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)) {
> +				params.encoding_flags = 0;
> +				params.set_flags |= EXT4_TUNE_FL_ENCODING_FLAGS;
> +			}
> +		}
> +		if ((params.set_feature_compat_mask &
> +		     EXT4_FEATURE_COMPAT_DIR_INDEX)) {
> +			uuid_t	uu;
> +
> +			memcpy(&uu, sbi->s_hash_seed, UUID_SIZE);
> +			if (uuid_is_null(&uu))
> +				generate_random_uuid((char *)
> +						     &sbi->s_hash_seed);
> +			if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +				sbi->s_def_hash_version = params.def_hash_alg;
> +			else if (sbi->s_def_hash_version == 0)
> +				sbi->s_def_hash_version = DX_HASH_HALF_MD4;
> +			if (!(es->s_flags &
> +			      cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH)) &&
> +			    !(es->s_flags &
> +			      cpu_to_le32(EXT2_FLAGS_SIGNED_HASH))) {
> +#ifdef __CHAR_UNSIGNED__
> +				sbi->s_hash_unsigned = 3;
> +#else
> +				sbi->s_hash_unsigned = 0;
> +#endif
> +			}
> +		}
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_ENCODING) {
> +		if (!enabling_casefold)
> +			return -EINVAL;
> +		if (params.encoding == 0)
> +			params.encoding = EXT4_ENC_UTF8_12_1;
> +		else if (params.encoding != EXT4_ENC_UTF8_12_1)
> +			return -EINVAL;
> +	}
> +	if (params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS) {
> +		if (!enabling_casefold)
> +			return -EINVAL;
> +		if (params.encoding_flags & ~SB_ENC_SUPP_MASK)
> +			return -EINVAL;
> +	}
> +
> +	ret = mnt_want_write_file(filp);
> +	if (ret)
> +		return ret;
> +
> +	ret = ext4_update_superblocks_fn(sb, ext4_sb_setparams, &params);
> +	mnt_drop_write_file(filp);
> +
> +	if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
> +		sbi->s_def_hash_version = params.def_hash_alg;
> +
> +	return ret;
> +}
> +
>  static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  {
>  	struct inode *inode = file_inode(filp);
> @@ -1616,6 +1908,11 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
>  		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
>  	case EXT4_IOC_SETFSUUID:
>  		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
> +	case EXT4_IOC_GET_TUNE_SB_PARAM:
> +		return ext4_ioctl_get_tune_sb(EXT4_SB(sb),
> +					      (void __user *)arg);
> +	case EXT4_IOC_SET_TUNE_SB_PARAM:
> +		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
>  	default:
>  		return -ENOTTY;
>  	}
> @@ -1703,7 +2000,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
>  }
>  #endif
>  
> -static void set_overhead(struct ext4_super_block *es, const void *arg)
> +static void set_overhead(struct ext4_sb_info *sbi,
> +			 struct ext4_super_block *es, const void *arg)
>  {
>  	es->s_overhead_clusters = cpu_to_le32(*((unsigned long *) arg));
>  }
> diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
> index 1c4c2dd29112cda9f7dc91d917492cffc33ee524..411dcc1e4a35c8c6a10f3768d17b8cc50cff4c34 100644
> --- a/include/uapi/linux/ext4.h
> +++ b/include/uapi/linux/ext4.h
> @@ -33,6 +33,8 @@
>  #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
>  #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
>  #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
> +#define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
> +#define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
>  
>  #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
>  
> @@ -108,6 +110,57 @@ struct ext4_new_group_input {
>  	__u16 unused;
>  };
>  
> +struct ext4_tune_sb_params {
> +	__u32 set_flags;
> +	__u32 checkinterval;
> +	__u16 errors_behavior;
> +	__u16 mnt_count;
> +	__u16 max_mnt_count;
> +	__u16 raid_stride;
> +	__u64 last_check_time;
> +	__u64 reserved_blocks;
> +	__u64 blocks_count;
> +	__u32 default_mnt_opts;
> +	__u32 reserved_uid;
> +	__u32 reserved_gid;
> +	__u32 raid_stripe_width;
> +	__u16 encoding;
> +	__u16 encoding_flags;
> +	__u8  def_hash_alg;
> +	__u8  pad_1;
> +	__u16 pad_2;
> +	__u32 feature_compat;
> +	__u32 feature_incompat;
> +	__u32 feature_ro_compat;
> +	__u32 set_feature_compat_mask;
> +	__u32 set_feature_incompat_mask;
> +	__u32 set_feature_ro_compat_mask;
> +	__u32 clear_feature_compat_mask;
> +	__u32 clear_feature_incompat_mask;
> +	__u32 clear_feature_ro_compat_mask;
> +	__u8  mount_opts[64];
> +	__u8  pad[64];
> +};
> +
> +#define EXT4_TUNE_FL_ERRORS_BEHAVIOR	0x00000001
> +#define EXT4_TUNE_FL_MNT_COUNT		0x00000002
> +#define EXT4_TUNE_FL_MAX_MNT_COUNT	0x00000004
> +#define EXT4_TUNE_FL_CHECKINTRVAL	0x00000008
> +#define EXT4_TUNE_FL_LAST_CHECK_TIME	0x00000010
> +#define EXT4_TUNE_FL_RESERVED_BLOCKS	0x00000020
> +#define EXT4_TUNE_FL_RESERVED_UID	0x00000040
> +#define EXT4_TUNE_FL_RESERVED_GID	0x00000080
> +#define EXT4_TUNE_FL_DEFAULT_MNT_OPTS	0x00000100
> +#define EXT4_TUNE_FL_DEF_HASH_ALG	0x00000200
> +#define EXT4_TUNE_FL_RAID_STRIDE	0x00000400
> +#define EXT4_TUNE_FL_RAID_STRIPE_WIDTH	0x00000800
> +#define EXT4_TUNE_FL_MOUNT_OPTS		0x00001000
> +#define EXT4_TUNE_FL_FEATURES		0x00002000
> +#define EXT4_TUNE_FL_EDIT_FEATURES	0x00004000
> +#define EXT4_TUNE_FL_FORCE_FSCK		0x00008000
> +#define EXT4_TUNE_FL_ENCODING		0x00010000
> +#define EXT4_TUNE_FL_ENCODING_FLAGS	0x00020000
> +
>  /*
>   * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag.
>   * It indicates that the entry in extent status cache is for a hole.
> 
> -- 
> 2.51.0
> 
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 2/3] ext4: add support for 32-bit default reserved uid and gid values
From: Jan Kara @ 2025-09-17 16:10 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250916-tune2fs-v2-2-d594dc7486f0@mit.edu>

On Tue 16-09-25 23:22:48, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Support for specifying the default user id and group id that is
> allowed to use the reserved block space was added way back when Linux
> only supported 16-bit uid's and gid's.  (Yeah, that long ago.)  It's
> not a commonly used feature, but let's add support for 32-bit user and
> group id's.
> 
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> ---
>  fs/ext4/ext4.h  | 16 +++++++++++++++-
>  fs/ext4/super.c |  8 ++++----
>  2 files changed, 19 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 01a6e2de7fc3ef0e20b039d3200b9c9bd656f59f..4bfcd5f0c74fda30db4009ee28fbee00a2f6b76f 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1442,7 +1442,9 @@ struct ext4_super_block {
>  	__le16  s_encoding;		/* Filename charset encoding */
>  	__le16  s_encoding_flags;	/* Filename charset encoding flags */
>  	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
> -	__le32	s_reserved[94];		/* Padding to the end of the block */
> +	__le16	s_def_resuid_hi;
> +	__le16	s_def_resgid_hi;
> +	__le32	s_reserved[93];		/* Padding to the end of the block */
>  	__le32	s_checksum;		/* crc32c(superblock) */
>  };
>  
> @@ -1812,6 +1814,18 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
>  		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
>  }
>  
> +static inline int ext4_get_resuid(struct ext4_super_block *es)
> +{
> +	return(le16_to_cpu(es->s_def_resuid) |
> +	       (le16_to_cpu(es->s_def_resuid_hi) << 16));
> +}

I'd prefer a style like:

	return le16_to_cpu(es->s_def_resuid) |
	       (le16_to_cpu(es->s_def_resuid_hi) << 16);

but whatever...

> @@ -5270,8 +5270,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  
>  	ext4_set_def_opts(sb, es);
>  
> -	sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
> -	sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
> +	sbi->s_resuid = make_kuid(&init_user_ns, ext4_get_resuid(es));
> +	sbi->s_resgid = make_kgid(&init_user_ns, ext4_get_resuid(es));
						^^^^ ext4_get_resgid() here.

>  	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
>  	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
>  	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* Re: [PATCH v2 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Jan Kara @ 2025-09-17 16:05 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250916-tune2fs-v2-1-d594dc7486f0@mit.edu>

On Tue 16-09-25 23:22:47, Theodore Ts'o via B4 Relay wrote:
> From: Theodore Ts'o <tytso@mit.edu>
> 
> Unlike other strings in the ext4 superblock, we rely on tune2fs to
> make sure s_mount_opts is NUL terminated.  Harden
> parse_apply_sb_mount_options() by treating s_mount_opts as a potential
> __nonstring.
> 
> Cc: stable@vger.kernel.org
> Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
> Signed-off-by: Theodore Ts'o <tytso@mit.edu>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/super.c | 17 +++++------------
>  1 file changed, 5 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  					struct ext4_fs_context *m_ctx)
>  {
>  	struct ext4_sb_info *sbi = EXT4_SB(sb);
> -	char *s_mount_opts = NULL;
> +	char s_mount_opts[65];
>  	struct ext4_fs_context *s_ctx = NULL;
>  	struct fs_context *fc = NULL;
>  	int ret = -ENOMEM;
> @@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  	if (!sbi->s_es->s_mount_opts[0])
>  		return 0;
>  
> -	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
> -				sizeof(sbi->s_es->s_mount_opts),
> -				GFP_KERNEL);
> -	if (!s_mount_opts)
> -		return ret;
> +	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
>  
>  	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
>  	if (!fc)
> -		goto out_free;
> +		return -ENOMEM;
>  
>  	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
>  	if (!s_ctx)
> @@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
>  	ret = 0;
>  
>  out_free:
> -	if (fc) {
> -		ext4_fc_free(fc);
> -		kfree(fc);
> -	}
> -	kfree(s_mount_opts);
> +	ext4_fc_free(fc);
> +	kfree(fc);
>  	return ret;
>  }
>  
> 
> -- 
> 2.51.0
> 
> 
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply

* [PATCH v2 0/3] ext4: Add support for mounted updates to the superblock via an ioctl
From: Theodore Ts'o via B4 Relay @ 2025-09-17  3:22 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable

This patch series enables a future version of tune2fs to be able to
modify certain parts of the ext4 superblock without to write to the
block device.

The first patch fixes a potential buffer overrun caused by a
maliciously moified superblock.  The second patch adds support for
32-bit uid and gid's which can have access to the reserved blocks pool.
The last patch adds the ioctl's which will be used by tune2fs.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
Changes in v2:
- fix bugs that were detected using sparse
- remove tune (unsafe) ability to clear certain compat faatures
- add the ability to set the encoding and encoding flags for case folding
- Link to v1: https://lore.kernel.org/r/20250908-tune2fs-v1-0-e3a6929f3355@mit.edu

---
Theodore Ts'o (3):
      ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
      ext4: add support for 32-bit default reserved uid and gid values
      ext4: implemet new ioctls to set and get superblock parameters

 fs/ext4/ext4.h            |  16 +++-
 fs/ext4/ioctl.c           | 312 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c           |  25 +++----
 include/uapi/linux/ext4.h |  53 +++++++++++++
 4 files changed, 382 insertions(+), 24 deletions(-)
---
base-commit: b320789d6883cc00ac78ce83bccbfe7ed58afcf0
change-id: 20250830-tune2fs-3376beb72403

Best regards,
-- 
Theodore Ts'o <tytso@mit.edu>



^ permalink raw reply

* [PATCH v2 1/3] ext4: avoid potential buffer over-read in parse_apply_sb_mount_options()
From: Theodore Ts'o via B4 Relay @ 2025-09-17  3:22 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api, stable
In-Reply-To: <20250916-tune2fs-v2-0-d594dc7486f0@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Unlike other strings in the ext4 superblock, we rely on tune2fs to
make sure s_mount_opts is NUL terminated.  Harden
parse_apply_sb_mount_options() by treating s_mount_opts as a potential
__nonstring.

Cc: stable@vger.kernel.org
Fixes: 8b67f04ab9de ("ext4: Add mount options in superblock")
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/super.c | 17 +++++------------
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 699c15db28a82f26809bf68533454a242596f0fd..94c98446c84f9a4614971d246ca7f001de610a8a 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2460,7 +2460,7 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 					struct ext4_fs_context *m_ctx)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(sb);
-	char *s_mount_opts = NULL;
+	char s_mount_opts[65];
 	struct ext4_fs_context *s_ctx = NULL;
 	struct fs_context *fc = NULL;
 	int ret = -ENOMEM;
@@ -2468,15 +2468,11 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	if (!sbi->s_es->s_mount_opts[0])
 		return 0;
 
-	s_mount_opts = kstrndup(sbi->s_es->s_mount_opts,
-				sizeof(sbi->s_es->s_mount_opts),
-				GFP_KERNEL);
-	if (!s_mount_opts)
-		return ret;
+	strscpy_pad(s_mount_opts, sbi->s_es->s_mount_opts);
 
 	fc = kzalloc(sizeof(struct fs_context), GFP_KERNEL);
 	if (!fc)
-		goto out_free;
+		return -ENOMEM;
 
 	s_ctx = kzalloc(sizeof(struct ext4_fs_context), GFP_KERNEL);
 	if (!s_ctx)
@@ -2508,11 +2504,8 @@ static int parse_apply_sb_mount_options(struct super_block *sb,
 	ret = 0;
 
 out_free:
-	if (fc) {
-		ext4_fc_free(fc);
-		kfree(fc);
-	}
-	kfree(s_mount_opts);
+	ext4_fc_free(fc);
+	kfree(fc);
 	return ret;
 }
 

-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 2/3] ext4: add support for 32-bit default reserved uid and gid values
From: Theodore Ts'o via B4 Relay @ 2025-09-17  3:22 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250916-tune2fs-v2-0-d594dc7486f0@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Support for specifying the default user id and group id that is
allowed to use the reserved block space was added way back when Linux
only supported 16-bit uid's and gid's.  (Yeah, that long ago.)  It's
not a commonly used feature, but let's add support for 32-bit user and
group id's.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ext4.h  | 16 +++++++++++++++-
 fs/ext4/super.c |  8 ++++----
 2 files changed, 19 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 01a6e2de7fc3ef0e20b039d3200b9c9bd656f59f..4bfcd5f0c74fda30db4009ee28fbee00a2f6b76f 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1442,7 +1442,9 @@ struct ext4_super_block {
 	__le16  s_encoding;		/* Filename charset encoding */
 	__le16  s_encoding_flags;	/* Filename charset encoding flags */
 	__le32  s_orphan_file_inum;	/* Inode for tracking orphan inodes */
-	__le32	s_reserved[94];		/* Padding to the end of the block */
+	__le16	s_def_resuid_hi;
+	__le16	s_def_resgid_hi;
+	__le32	s_reserved[93];		/* Padding to the end of the block */
 	__le32	s_checksum;		/* crc32c(superblock) */
 };
 
@@ -1812,6 +1814,18 @@ static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 		 ino <= le32_to_cpu(EXT4_SB(sb)->s_es->s_inodes_count));
 }
 
+static inline int ext4_get_resuid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resuid) |
+	       (le16_to_cpu(es->s_def_resuid_hi) << 16));
+}
+
+static inline int ext4_get_resgid(struct ext4_super_block *es)
+{
+	return(le16_to_cpu(es->s_def_resgid) |
+	       (le16_to_cpu(es->s_def_resgid_hi) << 16));
+}
+
 /*
  * Returns: sbi->field[index]
  * Used to access an array element from the following sbi fields which require
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 94c98446c84f9a4614971d246ca7f001de610a8a..0256c8f7c6cee2b8d9295f2fa9a7acd904382e83 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2951,11 +2951,11 @@ static int _ext4_show_options(struct seq_file *seq, struct super_block *sb,
 	}
 
 	if (nodefs || !uid_eq(sbi->s_resuid, make_kuid(&init_user_ns, EXT4_DEF_RESUID)) ||
-	    le16_to_cpu(es->s_def_resuid) != EXT4_DEF_RESUID)
+	    ext4_get_resuid(es) != EXT4_DEF_RESUID)
 		SEQ_OPTS_PRINT("resuid=%u",
 				from_kuid_munged(&init_user_ns, sbi->s_resuid));
 	if (nodefs || !gid_eq(sbi->s_resgid, make_kgid(&init_user_ns, EXT4_DEF_RESGID)) ||
-	    le16_to_cpu(es->s_def_resgid) != EXT4_DEF_RESGID)
+	    ext4_get_resgid(es) != EXT4_DEF_RESGID)
 		SEQ_OPTS_PRINT("resgid=%u",
 				from_kgid_munged(&init_user_ns, sbi->s_resgid));
 	def_errors = nodefs ? -1 : le16_to_cpu(es->s_errors);
@@ -5270,8 +5270,8 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 
 	ext4_set_def_opts(sb, es);
 
-	sbi->s_resuid = make_kuid(&init_user_ns, le16_to_cpu(es->s_def_resuid));
-	sbi->s_resgid = make_kgid(&init_user_ns, le16_to_cpu(es->s_def_resgid));
+	sbi->s_resuid = make_kuid(&init_user_ns, ext4_get_resuid(es));
+	sbi->s_resgid = make_kgid(&init_user_ns, ext4_get_resuid(es));
 	sbi->s_commit_interval = JBD2_DEFAULT_MAX_COMMIT_AGE * HZ;
 	sbi->s_min_batch_time = EXT4_DEF_MIN_BATCH_TIME;
 	sbi->s_max_batch_time = EXT4_DEF_MAX_BATCH_TIME;

-- 
2.51.0



^ permalink raw reply related

* [PATCH v2 3/3] ext4: implemet new ioctls to set and get superblock parameters
From: Theodore Ts'o via B4 Relay @ 2025-09-17  3:22 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4, linux-api
In-Reply-To: <20250916-tune2fs-v2-0-d594dc7486f0@mit.edu>

From: Theodore Ts'o <tytso@mit.edu>

Implement the EXT4_IOC_GET_TUNE_SB_PARAM and
EXT4_IOC_SET_TUNE_SB_PARAM ioctls, which allow certains superblock
parameters to be set while the file system is mounted, without needing
write access to the block device.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
---
 fs/ext4/ioctl.c           | 312 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 include/uapi/linux/ext4.h |  53 +++++++++++++
 2 files changed, 358 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 84e3c73952d72e436429489f5fc8b7ae1c01c7a1..a93a7baae990cc5580d2ddb3ffcc72fe15246978 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -27,14 +27,16 @@
 #include "fsmap.h"
 #include <trace/events/ext4.h>
 
-typedef void ext4_update_sb_callback(struct ext4_super_block *es,
-				       const void *arg);
+typedef void ext4_update_sb_callback(struct ext4_sb_info *sbi,
+				     struct ext4_super_block *es,
+				     const void *arg);
 
 /*
  * Superblock modification callback function for changing file system
  * label
  */
-static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setlabel(struct ext4_sb_info *sbi,
+			     struct ext4_super_block *es, const void *arg)
 {
 	/* Sanity check, this should never happen */
 	BUILD_BUG_ON(sizeof(es->s_volume_name) < EXT4_LABEL_MAX);
@@ -46,7 +48,8 @@ static void ext4_sb_setlabel(struct ext4_super_block *es, const void *arg)
  * Superblock modification callback function for changing file system
  * UUID.
  */
-static void ext4_sb_setuuid(struct ext4_super_block *es, const void *arg)
+static void ext4_sb_setuuid(struct ext4_sb_info *sbi,
+			    struct ext4_super_block *es, const void *arg)
 {
 	memcpy(es->s_uuid, (__u8 *)arg, UUID_SIZE);
 }
@@ -71,7 +74,7 @@ int ext4_update_primary_sb(struct super_block *sb, handle_t *handle,
 		goto out_err;
 
 	lock_buffer(bh);
-	func(es, arg);
+	func(sbi, es, arg);
 	ext4_superblock_csum_set(sb);
 	unlock_buffer(bh);
 
@@ -149,7 +152,7 @@ static int ext4_update_backup_sb(struct super_block *sb,
 		unlock_buffer(bh);
 		goto out_bh;
 	}
-	func(es, arg);
+	func(EXT4_SB(sb), es, arg);
 	if (ext4_has_feature_metadata_csum(sb))
 		es->s_checksum = ext4_superblock_csum(es);
 	set_buffer_uptodate(bh);
@@ -1230,6 +1233,295 @@ static int ext4_ioctl_setuuid(struct file *filp,
 	return ret;
 }
 
+
+#define TUNE_OPS_SUPPORTED (EXT4_TUNE_FL_ERRORS_BEHAVIOR |    \
+	EXT4_TUNE_FL_MNT_COUNT | EXT4_TUNE_FL_MAX_MNT_COUNT | \
+	EXT4_TUNE_FL_CHECKINTRVAL | EXT4_TUNE_FL_LAST_CHECK_TIME | \
+	EXT4_TUNE_FL_RESERVED_BLOCKS | EXT4_TUNE_FL_RESERVED_UID | \
+	EXT4_TUNE_FL_RESERVED_GID | EXT4_TUNE_FL_DEFAULT_MNT_OPTS | \
+	EXT4_TUNE_FL_DEF_HASH_ALG | EXT4_TUNE_FL_RAID_STRIDE | \
+	EXT4_TUNE_FL_RAID_STRIPE_WIDTH | EXT4_TUNE_FL_MOUNT_OPTS | \
+	EXT4_TUNE_FL_FEATURES | EXT4_TUNE_FL_EDIT_FEATURES | \
+	EXT4_TUNE_FL_FORCE_FSCK | EXT4_TUNE_FL_ENCODING | \
+	EXT4_TUNE_FL_ENCODING_FLAGS)
+
+#define EXT4_TUNE_SET_COMPAT_SUPP \
+		(EXT4_FEATURE_COMPAT_DIR_INDEX |	\
+		 EXT4_FEATURE_COMPAT_STABLE_INODES)
+#define EXT4_TUNE_SET_INCOMPAT_SUPP \
+		(EXT4_FEATURE_INCOMPAT_EXTENTS |	\
+		 EXT4_FEATURE_INCOMPAT_EA_INODE |	\
+		 EXT4_FEATURE_INCOMPAT_ENCRYPT |	\
+		 EXT4_FEATURE_INCOMPAT_CSUM_SEED |	\
+		 EXT4_FEATURE_INCOMPAT_LARGEDIR |	\
+		 EXT4_FEATURE_INCOMPAT_CASEFOLD)
+#define EXT4_TUNE_SET_RO_COMPAT_SUPP \
+		(EXT4_FEATURE_RO_COMPAT_LARGE_FILE |	\
+		 EXT4_FEATURE_RO_COMPAT_DIR_NLINK |	\
+		 EXT4_FEATURE_RO_COMPAT_EXTRA_ISIZE |	\
+		 EXT4_FEATURE_RO_COMPAT_PROJECT |	\
+		 EXT4_FEATURE_RO_COMPAT_VERITY)
+
+#define EXT4_TUNE_CLEAR_COMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_INCOMPAT_SUPP (0)
+#define EXT4_TUNE_CLEAR_RO_COMPAT_SUPP (0)
+
+#define SB_ENC_SUPP_MASK (SB_ENC_STRICT_MODE_FL |	\
+			  SB_ENC_NO_COMPAT_FALLBACK_FL)
+
+static int ext4_ioctl_get_tune_sb(struct ext4_sb_info *sbi,
+				  struct ext4_tune_sb_params __user *params)
+{
+	struct ext4_tune_sb_params ret;
+	struct ext4_super_block *es = sbi->s_es;
+
+	memset(&ret, 0, sizeof(ret));
+	ret.set_flags = TUNE_OPS_SUPPORTED;
+	ret.errors_behavior = le16_to_cpu(es->s_errors);
+	ret.mnt_count = le16_to_cpu(es->s_mnt_count);
+	ret.max_mnt_count = le16_to_cpu(es->s_max_mnt_count);
+	ret.checkinterval = le32_to_cpu(es->s_checkinterval);
+	ret.last_check_time = le32_to_cpu(es->s_lastcheck);
+	ret.reserved_blocks = ext4_r_blocks_count(es);
+	ret.blocks_count = ext4_blocks_count(es);
+	ret.reserved_uid = ext4_get_resuid(es);
+	ret.reserved_gid = ext4_get_resgid(es);
+	ret.default_mnt_opts = le32_to_cpu(es->s_default_mount_opts);
+	ret.def_hash_alg = es->s_def_hash_version;
+	ret.raid_stride = le16_to_cpu(es->s_raid_stride);
+	ret.raid_stripe_width = le32_to_cpu(es->s_raid_stripe_width);
+	ret.encoding = le16_to_cpu(es->s_encoding);
+	ret.encoding_flags = le16_to_cpu(es->s_encoding_flags);
+	strscpy_pad(ret.mount_opts, es->s_mount_opts);
+	ret.feature_compat = le32_to_cpu(es->s_feature_compat);
+	ret.feature_incompat = le32_to_cpu(es->s_feature_incompat);
+	ret.feature_ro_compat = le32_to_cpu(es->s_feature_ro_compat);
+	ret.set_feature_compat_mask = EXT4_TUNE_SET_COMPAT_SUPP;
+	ret.set_feature_incompat_mask = EXT4_TUNE_SET_INCOMPAT_SUPP;
+	ret.set_feature_ro_compat_mask = EXT4_TUNE_SET_RO_COMPAT_SUPP;
+	ret.clear_feature_compat_mask = EXT4_TUNE_CLEAR_COMPAT_SUPP;
+	ret.clear_feature_incompat_mask = EXT4_TUNE_CLEAR_INCOMPAT_SUPP;
+	ret.clear_feature_ro_compat_mask = EXT4_TUNE_CLEAR_RO_COMPAT_SUPP;
+	if (copy_to_user(params, &ret, sizeof(ret)))
+		return -EFAULT;
+	return 0;
+}
+
+static void ext4_sb_setparams(struct ext4_sb_info *sbi,
+			      struct ext4_super_block *es, const void *arg)
+{
+	const struct ext4_tune_sb_params *params = arg;
+
+	if (params->set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR)
+		es->s_errors = cpu_to_le16(params->errors_behavior);
+	if (params->set_flags & EXT4_TUNE_FL_MNT_COUNT)
+		es->s_mnt_count = cpu_to_le16(params->mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_MAX_MNT_COUNT)
+		es->s_max_mnt_count = cpu_to_le16(params->max_mnt_count);
+	if (params->set_flags & EXT4_TUNE_FL_CHECKINTRVAL)
+		es->s_checkinterval = cpu_to_le32(params->checkinterval);
+	if (params->set_flags & EXT4_TUNE_FL_LAST_CHECK_TIME)
+		es->s_lastcheck = cpu_to_le32(params->last_check_time);
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) {
+		ext4_fsblk_t blk = params->reserved_blocks;
+
+		es->s_r_blocks_count_lo = cpu_to_le32((u32)blk);
+		es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_UID) {
+		int uid = params->reserved_uid;
+
+		es->s_def_resuid = cpu_to_le16(uid & 0xFFFF);
+		es->s_def_resuid_hi = cpu_to_le16(uid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_RESERVED_GID) {
+		int gid = params->reserved_gid;
+
+		es->s_def_resgid = cpu_to_le16(gid & 0xFFFF);
+		es->s_def_resgid_hi = cpu_to_le16(gid >> 16);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_DEFAULT_MNT_OPTS)
+		es->s_default_mount_opts = cpu_to_le32(params->default_mnt_opts);
+	if (params->set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		es->s_def_hash_version = params->def_hash_alg;
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIDE)
+		es->s_raid_stride = cpu_to_le16(params->raid_stride);
+	if (params->set_flags & EXT4_TUNE_FL_RAID_STRIPE_WIDTH)
+		es->s_raid_stripe_width =
+			cpu_to_le32(params->raid_stripe_width);
+	if (params->set_flags & EXT4_TUNE_FL_ENCODING)
+		es->s_encoding = cpu_to_le16(params->encoding);
+	if (params->set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)
+		es->s_encoding_flags = cpu_to_le16(params->encoding_flags);
+	strscpy_pad(es->s_mount_opts, params->mount_opts);
+	if (params->set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		es->s_feature_compat |=
+			cpu_to_le32(params->set_feature_compat_mask);
+		es->s_feature_incompat |=
+			cpu_to_le32(params->set_feature_incompat_mask);
+		es->s_feature_ro_compat |=
+			cpu_to_le32(params->set_feature_ro_compat_mask);
+		es->s_feature_compat &=
+			~cpu_to_le32(params->clear_feature_compat_mask);
+		es->s_feature_incompat &=
+			~cpu_to_le32(params->clear_feature_incompat_mask);
+		es->s_feature_ro_compat &=
+			~cpu_to_le32(params->clear_feature_ro_compat_mask);
+		if (params->set_feature_compat_mask &
+		    EXT4_FEATURE_COMPAT_DIR_INDEX)
+			es->s_def_hash_version = sbi->s_def_hash_version;
+		if (params->set_feature_incompat_mask &
+		    EXT4_FEATURE_INCOMPAT_CSUM_SEED)
+			es->s_checksum_seed = cpu_to_le32(sbi->s_csum_seed);
+	}
+	if (params->set_flags & EXT4_TUNE_FL_FORCE_FSCK)
+		es->s_state |= cpu_to_le16(EXT4_ERROR_FS);
+}
+
+static int ext4_ioctl_set_tune_sb(struct file *filp,
+				  struct ext4_tune_sb_params __user *in)
+{
+	struct ext4_tune_sb_params params;
+	struct super_block *sb = file_inode(filp)->i_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	struct ext4_super_block *es = sbi->s_es;
+	int enabling_casefold = 0;
+	int ret;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (copy_from_user(&params, in, sizeof(params)))
+		return -EFAULT;
+
+	if ((params.set_flags & ~TUNE_OPS_SUPPORTED) != 0)
+		return -EOPNOTSUPP;
+
+	if ((params.set_flags & EXT4_TUNE_FL_ERRORS_BEHAVIOR) &&
+	    (params.errors_behavior > EXT4_ERRORS_PANIC))
+		return -EINVAL;
+
+	if ((params.set_flags & EXT4_TUNE_FL_RESERVED_BLOCKS) &&
+	    (params.reserved_blocks > ext4_blocks_count(sbi->s_es) / 2))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG) &&
+	    ((params.def_hash_alg > DX_HASH_LAST) ||
+	     (params.def_hash_alg == DX_HASH_SIPHASH)))
+		return -EINVAL;
+	if ((params.set_flags & EXT4_TUNE_FL_FEATURES) &&
+	    (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES))
+		return -EINVAL;
+
+	if (params.set_flags & EXT4_TUNE_FL_FEATURES) {
+		params.set_feature_compat_mask =
+			params.feature_compat &
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask =
+			params.feature_incompat &
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask =
+			params.feature_ro_compat &
+			~le32_to_cpu(es->s_feature_ro_compat);
+		params.clear_feature_compat_mask =
+			~params.feature_compat &
+			le32_to_cpu(es->s_feature_compat);
+		params.clear_feature_incompat_mask =
+			~params.feature_incompat &
+			le32_to_cpu(es->s_feature_incompat);
+		params.clear_feature_ro_compat_mask =
+			~params.feature_ro_compat &
+			le32_to_cpu(es->s_feature_ro_compat);
+		params.set_flags |= EXT4_TUNE_FL_EDIT_FEATURES;
+	}
+	if (params.set_flags & EXT4_TUNE_FL_EDIT_FEATURES) {
+		if ((params.set_feature_compat_mask &
+		     ~EXT4_TUNE_SET_COMPAT_SUPP) ||
+		    (params.set_feature_incompat_mask &
+		     ~EXT4_TUNE_SET_INCOMPAT_SUPP) ||
+		    (params.set_feature_ro_compat_mask &
+		     ~EXT4_TUNE_SET_RO_COMPAT_SUPP) ||
+		    (params.clear_feature_compat_mask &
+		     ~EXT4_TUNE_CLEAR_COMPAT_SUPP) ||
+		    (params.clear_feature_incompat_mask &
+		     ~EXT4_TUNE_CLEAR_INCOMPAT_SUPP) ||
+		    (params.clear_feature_ro_compat_mask &
+		     ~EXT4_TUNE_CLEAR_RO_COMPAT_SUPP))
+			return -EOPNOTSUPP;
+
+		/*
+		 * Filter out the features that are already set from
+		 * the set_mask.
+		 */
+		params.set_feature_compat_mask &=
+			~le32_to_cpu(es->s_feature_compat);
+		params.set_feature_incompat_mask &=
+			~le32_to_cpu(es->s_feature_incompat);
+		params.set_feature_ro_compat_mask &=
+			~le32_to_cpu(es->s_feature_ro_compat);
+		if ((params.set_feature_incompat_mask &
+		     EXT4_FEATURE_INCOMPAT_CASEFOLD)) {
+			enabling_casefold = 1;
+			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING)) {
+				params.encoding = EXT4_ENC_UTF8_12_1;
+				params.set_flags |= EXT4_TUNE_FL_ENCODING;
+			}
+			if (!(params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS)) {
+				params.encoding_flags = 0;
+				params.set_flags |= EXT4_TUNE_FL_ENCODING_FLAGS;
+			}
+		}
+		if ((params.set_feature_compat_mask &
+		     EXT4_FEATURE_COMPAT_DIR_INDEX)) {
+			uuid_t	uu;
+
+			memcpy(&uu, sbi->s_hash_seed, UUID_SIZE);
+			if (uuid_is_null(&uu))
+				generate_random_uuid((char *)
+						     &sbi->s_hash_seed);
+			if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+				sbi->s_def_hash_version = params.def_hash_alg;
+			else if (sbi->s_def_hash_version == 0)
+				sbi->s_def_hash_version = DX_HASH_HALF_MD4;
+			if (!(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_UNSIGNED_HASH)) &&
+			    !(es->s_flags &
+			      cpu_to_le32(EXT2_FLAGS_SIGNED_HASH))) {
+#ifdef __CHAR_UNSIGNED__
+				sbi->s_hash_unsigned = 3;
+#else
+				sbi->s_hash_unsigned = 0;
+#endif
+			}
+		}
+	}
+	if (params.set_flags & EXT4_TUNE_FL_ENCODING) {
+		if (!enabling_casefold)
+			return -EINVAL;
+		if (params.encoding == 0)
+			params.encoding = EXT4_ENC_UTF8_12_1;
+		else if (params.encoding != EXT4_ENC_UTF8_12_1)
+			return -EINVAL;
+	}
+	if (params.set_flags & EXT4_TUNE_FL_ENCODING_FLAGS) {
+		if (!enabling_casefold)
+			return -EINVAL;
+		if (params.encoding_flags & ~SB_ENC_SUPP_MASK)
+			return -EINVAL;
+	}
+
+	ret = mnt_want_write_file(filp);
+	if (ret)
+		return ret;
+
+	ret = ext4_update_superblocks_fn(sb, ext4_sb_setparams, &params);
+	mnt_drop_write_file(filp);
+
+	if (params.set_flags & EXT4_TUNE_FL_DEF_HASH_ALG)
+		sbi->s_def_hash_version = params.def_hash_alg;
+
+	return ret;
+}
+
 static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 {
 	struct inode *inode = file_inode(filp);
@@ -1616,6 +1908,11 @@ static long __ext4_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
 		return ext4_ioctl_getuuid(EXT4_SB(sb), (void __user *)arg);
 	case EXT4_IOC_SETFSUUID:
 		return ext4_ioctl_setuuid(filp, (const void __user *)arg);
+	case EXT4_IOC_GET_TUNE_SB_PARAM:
+		return ext4_ioctl_get_tune_sb(EXT4_SB(sb),
+					      (void __user *)arg);
+	case EXT4_IOC_SET_TUNE_SB_PARAM:
+		return ext4_ioctl_set_tune_sb(filp, (void __user *)arg);
 	default:
 		return -ENOTTY;
 	}
@@ -1703,7 +2000,8 @@ long ext4_compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
 }
 #endif
 
-static void set_overhead(struct ext4_super_block *es, const void *arg)
+static void set_overhead(struct ext4_sb_info *sbi,
+			 struct ext4_super_block *es, const void *arg)
 {
 	es->s_overhead_clusters = cpu_to_le32(*((unsigned long *) arg));
 }
diff --git a/include/uapi/linux/ext4.h b/include/uapi/linux/ext4.h
index 1c4c2dd29112cda9f7dc91d917492cffc33ee524..411dcc1e4a35c8c6a10f3768d17b8cc50cff4c34 100644
--- a/include/uapi/linux/ext4.h
+++ b/include/uapi/linux/ext4.h
@@ -33,6 +33,8 @@
 #define EXT4_IOC_CHECKPOINT		_IOW('f', 43, __u32)
 #define EXT4_IOC_GETFSUUID		_IOR('f', 44, struct fsuuid)
 #define EXT4_IOC_SETFSUUID		_IOW('f', 44, struct fsuuid)
+#define EXT4_IOC_GET_TUNE_SB_PARAM	_IOR('f', 45, struct ext4_tune_sb_params)
+#define EXT4_IOC_SET_TUNE_SB_PARAM	_IOW('f', 46, struct ext4_tune_sb_params)
 
 #define EXT4_IOC_SHUTDOWN _IOR('X', 125, __u32)
 
@@ -108,6 +110,57 @@ struct ext4_new_group_input {
 	__u16 unused;
 };
 
+struct ext4_tune_sb_params {
+	__u32 set_flags;
+	__u32 checkinterval;
+	__u16 errors_behavior;
+	__u16 mnt_count;
+	__u16 max_mnt_count;
+	__u16 raid_stride;
+	__u64 last_check_time;
+	__u64 reserved_blocks;
+	__u64 blocks_count;
+	__u32 default_mnt_opts;
+	__u32 reserved_uid;
+	__u32 reserved_gid;
+	__u32 raid_stripe_width;
+	__u16 encoding;
+	__u16 encoding_flags;
+	__u8  def_hash_alg;
+	__u8  pad_1;
+	__u16 pad_2;
+	__u32 feature_compat;
+	__u32 feature_incompat;
+	__u32 feature_ro_compat;
+	__u32 set_feature_compat_mask;
+	__u32 set_feature_incompat_mask;
+	__u32 set_feature_ro_compat_mask;
+	__u32 clear_feature_compat_mask;
+	__u32 clear_feature_incompat_mask;
+	__u32 clear_feature_ro_compat_mask;
+	__u8  mount_opts[64];
+	__u8  pad[64];
+};
+
+#define EXT4_TUNE_FL_ERRORS_BEHAVIOR	0x00000001
+#define EXT4_TUNE_FL_MNT_COUNT		0x00000002
+#define EXT4_TUNE_FL_MAX_MNT_COUNT	0x00000004
+#define EXT4_TUNE_FL_CHECKINTRVAL	0x00000008
+#define EXT4_TUNE_FL_LAST_CHECK_TIME	0x00000010
+#define EXT4_TUNE_FL_RESERVED_BLOCKS	0x00000020
+#define EXT4_TUNE_FL_RESERVED_UID	0x00000040
+#define EXT4_TUNE_FL_RESERVED_GID	0x00000080
+#define EXT4_TUNE_FL_DEFAULT_MNT_OPTS	0x00000100
+#define EXT4_TUNE_FL_DEF_HASH_ALG	0x00000200
+#define EXT4_TUNE_FL_RAID_STRIDE	0x00000400
+#define EXT4_TUNE_FL_RAID_STRIPE_WIDTH	0x00000800
+#define EXT4_TUNE_FL_MOUNT_OPTS		0x00001000
+#define EXT4_TUNE_FL_FEATURES		0x00002000
+#define EXT4_TUNE_FL_EDIT_FEATURES	0x00004000
+#define EXT4_TUNE_FL_FORCE_FSCK		0x00008000
+#define EXT4_TUNE_FL_ENCODING		0x00010000
+#define EXT4_TUNE_FL_ENCODING_FLAGS	0x00020000
+
 /*
  * Returned by EXT4_IOC_GET_ES_CACHE as an additional possible flag.
  * It indicates that the entry in extent status cache is for a hole.

-- 
2.51.0



^ permalink raw reply related

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Jessica Clarke @ 2025-09-16 17:08 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <20250913003842.41944-1-safinaskar@gmail.com>

On 13 Sep 2025, at 01:37, Askar Safin <safinaskar@gmail.com> wrote:
> [...]
> For example, I renamed the following global variables:
> 
> __initramfs_start
> __initramfs_size
> [...]
> 
> to:
> 
> __builtin_initramfs_start
> __builtin_initramfs_size

I strongly suggest picking different names given __builtin_foo is the
naming scheme used for GNU C builtins/intrinsics. I leave you and
others to bikeshed that one.

Jessica


^ permalink raw reply

* Re: [PATCH v21 4/8] fork: Add shadow stack support to clone3()
From: Yury Khrustalev @ 2025-09-16 12:29 UTC (permalink / raw)
  To: Mark Brown
  Cc: Rick P. Edgecombe, Deepak Gupta, Szabolcs Nagy, H.J. Lu,
	Florian Weimer, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, H. Peter Anvin, Peter Zijlstra, Juri Lelli,
	Vincent Guittot, Dietmar Eggemann, Steven Rostedt, Ben Segall,
	Mel Gorman, Valentin Schneider, Christian Brauner, Shuah Khan,
	linux-kernel, Catalin Marinas, Will Deacon, jannh, Andrew Morton,
	Wilco Dijkstra, linux-kselftest, linux-api, Kees Cook
In-Reply-To: <20250916-clone3-shadow-stack-v21-4-910493527013@kernel.org>

On Tue, Sep 16, 2025 at 12:12:09AM +0100, Mark Brown wrote:
> Unlike with the normal stack there is no API for configuring the shadow
> stack for a new thread, instead the kernel will dynamically allocate a
> new shadow stack with the same size as the normal stack. This appears to
> be due to the shadow stack series having been in development since
> before the more extensible clone3() was added rather than anything more
> deliberate.
> 
> Add a parameter to clone3() specifying a shadow stack pointer to use
> for the new thread, this is inconsistent with the way we specify the
> normal stack but during review concerns were expressed about having to
> identify where the shadow stack pointer should be placed especially in
> cases where the shadow stack has been previously active.  If no shadow
> stack is specified then the existing implicit allocation behaviour is
> maintained.
> 
> ...
> 
> Update the existing arm64 and x86 implementations to pay attention to
> the newly added arguments, in order to maintain compatibility we use the
> existing behaviour if no shadow stack is specified. Since we are now
> using more fields from the kernel_clone_args we pass that into the
> shadow stack code rather than individual fields.
> 
> Portions of the x86 architecture code were written by Rick Edgecombe.
> 
> Acked-by: Yury Khrustalev <yury.khrustalev@arm.com>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> Signed-off-by: Mark Brown <broonie@kernel.org>

I've tested this version together with my Glibc patches [1]. No issues
to report.

Tested-by: Yury Khrustalev <yury.khrustalev@arm.com>

[1]: https://inbox.sourceware.org/libc-alpha/20250916122757.2341920-1-yury.khrustalev@arm.com/

> ---
>  arch/arm64/mm/gcs.c              | 47 +++++++++++++++++++-
>  arch/x86/include/asm/shstk.h     | 11 +++--
>  arch/x86/kernel/process.c        |  2 +-
>  arch/x86/kernel/shstk.c          | 53 ++++++++++++++++++++---
>  include/asm-generic/cacheflush.h | 11 +++++
>  include/linux/sched/task.h       | 17 ++++++++
>  include/uapi/linux/sched.h       |  9 ++--
>  kernel/fork.c                    | 93 ++++++++++++++++++++++++++++++++++------
>  8 files changed, 217 insertions(+), 26 deletions(-)
> 
> diff --git a/arch/arm64/mm/gcs.c b/arch/arm64/mm/gcs.c
> index 3abcbf9adb5c..fd1d5a6655de 100644
> --- a/arch/arm64/mm/gcs.c
> +++ b/arch/arm64/mm/gcs.c
> @@ -43,8 +43,23 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  {
>  	unsigned long addr, size;
>  
> -	if (!system_supports_gcs())
> +	if (!system_supports_gcs()) {
> +		if (args->shstk_token)
> +			return -EINVAL;
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a GCS then use it, otherwise fall
> +	 * back to a default allocation strategy. Validation is done
> +	 * in arch_shstk_validate_clone().
> +	 */
> +	if (args->shstk_token) {
> +		tsk->thread.gcs_base = 0;
> +		tsk->thread.gcs_size = 0;
> +		return 0;
> +	}
>  
>  	if (!task_gcs_el0_enabled(tsk))
>  		return 0;
> @@ -68,6 +83,36 @@ int gcs_alloc_thread_stack(struct task_struct *tsk,
>  	return 0;
>  }
>  
> +static bool gcs_consume_token(struct vm_area_struct *vma, struct page *page,
> +			      unsigned long user_addr)
> +{
> +	u64 expected = GCS_CAP(user_addr);
> +	u64 *token = page_address(page) + offset_in_page(user_addr);
> +
> +	if (!cmpxchg_to_user_page(vma, page, user_addr, token, expected, 0))
> +		return false;
> +	set_page_dirty_lock(page);
> +
> +	return true;
> +}
> +
> +int arch_shstk_validate_clone(struct task_struct *tsk,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	unsigned long gcspr_el0;
> +	int ret = 0;
> +
> +	gcspr_el0 = args->shstk_token;
> +	if (!gcs_consume_token(vma, page, gcspr_el0))
> +		return -EINVAL;
> +
> +	tsk->thread.gcspr_el0 = gcspr_el0 + sizeof(u64);
> +
> +	return ret;
> +}
> +
>  SYSCALL_DEFINE3(map_shadow_stack, unsigned long, addr, unsigned long, size, unsigned int, flags)
>  {
>  	unsigned long alloc_size;
> diff --git a/arch/x86/include/asm/shstk.h b/arch/x86/include/asm/shstk.h
> index 0f50e0125943..827e983430aa 100644
> --- a/arch/x86/include/asm/shstk.h
> +++ b/arch/x86/include/asm/shstk.h
> @@ -6,6 +6,7 @@
>  #include <linux/types.h>
>  
>  struct task_struct;
> +struct kernel_clone_args;
>  struct ksignal;
>  
>  #ifdef CONFIG_X86_USER_SHADOW_STACK
> @@ -16,8 +17,8 @@ struct thread_shstk {
>  
>  long shstk_prctl(struct task_struct *task, int option, unsigned long arg2);
>  void reset_thread_features(void);
> -unsigned long shstk_alloc_thread_stack(struct task_struct *p, u64 clone_flags,
> -				       unsigned long stack_size);
> +unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> +				       const struct kernel_clone_args *args);
>  void shstk_free(struct task_struct *p);
>  int setup_signal_shadow_stack(struct ksignal *ksig);
>  int restore_signal_shadow_stack(void);
> @@ -28,8 +29,10 @@ static inline long shstk_prctl(struct task_struct *task, int option,
>  			       unsigned long arg2) { return -EINVAL; }
>  static inline void reset_thread_features(void) {}
>  static inline unsigned long shstk_alloc_thread_stack(struct task_struct *p,
> -						     u64 clone_flags,
> -						     unsigned long stack_size) { return 0; }
> +						     const struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
>  static inline void shstk_free(struct task_struct *p) {}
>  static inline int setup_signal_shadow_stack(struct ksignal *ksig) { return 0; }
>  static inline int restore_signal_shadow_stack(void) { return 0; }
> diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
> index e3a3987b0c4f..e8c8447cc0fd 100644
> --- a/arch/x86/kernel/process.c
> +++ b/arch/x86/kernel/process.c
> @@ -209,7 +209,7 @@ int copy_thread(struct task_struct *p, const struct kernel_clone_args *args)
>  	 * is disabled, new_ssp will remain 0, and fpu_clone() will know not to
>  	 * update it.
>  	 */
> -	new_ssp = shstk_alloc_thread_stack(p, clone_flags, args->stack_size);
> +	new_ssp = shstk_alloc_thread_stack(p, args);
>  	if (IS_ERR_VALUE(new_ssp))
>  		return PTR_ERR((void *)new_ssp);
>  
> diff --git a/arch/x86/kernel/shstk.c b/arch/x86/kernel/shstk.c
> index 5eba6c5a6775..56893523b6f2 100644
> --- a/arch/x86/kernel/shstk.c
> +++ b/arch/x86/kernel/shstk.c
> @@ -191,18 +191,61 @@ void reset_thread_features(void)
>  	current->thread.features_locked = 0;
>  }
>  
> -unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, u64 clone_flags,
> -				       unsigned long stack_size)
> +int arch_shstk_validate_clone(struct task_struct *t,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	void *maddr = page_address(page);
> +	unsigned long token;
> +	int offset;
> +	u64 expected;
> +
> +	/*
> +	 * kernel_clone_args() verification assures token address is 8
> +	 * byte aligned.
> +	 */
> +	token = args->shstk_token;
> +	expected = (token + SS_FRAME_SIZE) | BIT(0);
> +	offset = offset_in_page(token);
> +
> +	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
> +				  expected, 0))
> +		return -EINVAL;
> +	set_page_dirty_lock(page);
> +
> +	return 0;
> +}
> +
> +unsigned long shstk_alloc_thread_stack(struct task_struct *tsk,
> +				       const struct kernel_clone_args *args)
>  {
>  	struct thread_shstk *shstk = &tsk->thread.shstk;
> +	u64 clone_flags = args->flags;
>  	unsigned long addr, size;
>  
>  	/*
>  	 * If shadow stack is not enabled on the new thread, skip any
> -	 * switch to a new shadow stack.
> +	 * implicit switch to a new shadow stack and reject attempts to
> +	 * explicitly specify one.
>  	 */
> -	if (!features_enabled(ARCH_SHSTK_SHSTK))
> +	if (!features_enabled(ARCH_SHSTK_SHSTK)) {
> +		if (args->shstk_token)
> +			return (unsigned long)ERR_PTR(-EINVAL);
> +
>  		return 0;
> +	}
> +
> +	/*
> +	 * If the user specified a shadow stack then use it, otherwise
> +	 * fall back to a default allocation strategy. Validation is
> +	 * done in arch_shstk_validate_clone().
> +	 */
> +	if (args->shstk_token) {
> +		shstk->base = 0;
> +		shstk->size = 0;
> +		return args->shstk_token + 8;
> +	}
>  
>  	/*
>  	 * For CLONE_VFORK the child will share the parents shadow stack.
> @@ -222,7 +265,7 @@ unsigned long shstk_alloc_thread_stack(struct task_struct *tsk, u64 clone_flags,
>  	if (!(clone_flags & CLONE_VM))
>  		return 0;
>  
> -	size = adjust_shstk_size(stack_size);
> +	size = adjust_shstk_size(args->stack_size);
>  	addr = alloc_shstk(0, size, 0, false);
>  	if (IS_ERR_VALUE(addr))
>  		return addr;
> diff --git a/include/asm-generic/cacheflush.h b/include/asm-generic/cacheflush.h
> index 7ee8a179d103..96cc0c7a5c90 100644
> --- a/include/asm-generic/cacheflush.h
> +++ b/include/asm-generic/cacheflush.h
> @@ -124,4 +124,15 @@ static inline void flush_cache_vunmap(unsigned long start, unsigned long end)
>  	} while (0)
>  #endif
>  
> +#ifndef cmpxchg_to_user_page
> +#define cmpxchg_to_user_page(vma, page, vaddr, ptr, old, new)  \
> +({							  \
> +	bool ret;						  \
> +								  \
> +	ret = try_cmpxchg(ptr, &old, new);			  \
> +	flush_icache_user_page(vma, page, vaddr, sizeof(*ptr));	  \
> +	ret;							  \
> +})
> +#endif
> +
>  #endif /* _ASM_GENERIC_CACHEFLUSH_H */
> diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
> index 34d6a0e108c3..96fb485ff2dd 100644
> --- a/include/linux/sched/task.h
> +++ b/include/linux/sched/task.h
> @@ -16,6 +16,7 @@ struct task_struct;
>  struct rusage;
>  union thread_union;
>  struct css_set;
> +struct vm_area_struct;
>  
>  /* All the bits taken by the old clone syscall. */
>  #define CLONE_LEGACY_FLAGS 0xffffffffULL
> @@ -44,6 +45,7 @@ struct kernel_clone_args {
>  	struct cgroup *cgrp;
>  	struct css_set *cset;
>  	unsigned int kill_seq;
> +	unsigned long shstk_token;
>  };
>  
>  /*
> @@ -226,4 +228,19 @@ static inline void task_unlock(struct task_struct *p)
>  
>  DEFINE_GUARD(task_lock, struct task_struct *, task_lock(_T), task_unlock(_T))
>  
> +#ifdef CONFIG_ARCH_HAS_USER_SHADOW_STACK
> +int arch_shstk_validate_clone(struct task_struct *p,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args);
> +#else
> +static inline int arch_shstk_validate_clone(struct task_struct *p,
> +					    struct vm_area_struct *vma,
> +					    struct page *page,
> +					    struct kernel_clone_args *args)
> +{
> +	return 0;
> +}
> +#endif
> +
>  #endif /* _LINUX_SCHED_TASK_H */
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 359a14cc76a4..7e18e7b3df3a 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -84,6 +84,7 @@
>   *                kernel's limit of nested PID namespaces.
>   * @cgroup:       If CLONE_INTO_CGROUP is specified set this to
>   *                a file descriptor for the cgroup.
> + * @shstk_token:  Pointer to shadow stack token at top of stack.
>   *
>   * The structure is versioned by size and thus extensible.
>   * New struct members must go at the end of the struct and
> @@ -101,12 +102,14 @@ struct clone_args {
>  	__aligned_u64 set_tid;
>  	__aligned_u64 set_tid_size;
>  	__aligned_u64 cgroup;
> +	__aligned_u64 shstk_token;
>  };
>  #endif
>  
> -#define CLONE_ARGS_SIZE_VER0 64 /* sizeof first published struct */
> -#define CLONE_ARGS_SIZE_VER1 80 /* sizeof second published struct */
> -#define CLONE_ARGS_SIZE_VER2 88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER0  64 /* sizeof first published struct */
> +#define CLONE_ARGS_SIZE_VER1  80 /* sizeof second published struct */
> +#define CLONE_ARGS_SIZE_VER2  88 /* sizeof third published struct */
> +#define CLONE_ARGS_SIZE_VER3  96 /* sizeof fourth published struct */
>  
>  /*
>   * Scheduling policies
> diff --git a/kernel/fork.c b/kernel/fork.c
> index d6e1fb11eff9..0dbfc8828e08 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1907,6 +1907,51 @@ static bool need_futex_hash_allocate_default(u64 clone_flags)
>  	return true;
>  }
>  
> +static int shstk_validate_clone(struct task_struct *p,
> +				struct kernel_clone_args *args)
> +{
> +	struct mm_struct *mm;
> +	struct vm_area_struct *vma;
> +	struct page *page;
> +	unsigned long addr;
> +	int ret;
> +
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK))
> +		return 0;
> +
> +	if (!args->shstk_token)
> +		return 0;
> +
> +	mm = get_task_mm(p);
> +	if (!mm)
> +		return -EFAULT;
> +
> +	mmap_read_lock(mm);
> +
> +	addr = untagged_addr_remote(mm, args->shstk_token);
> +	page = get_user_page_vma_remote(mm, addr, FOLL_FORCE | FOLL_WRITE,
> +					&vma);
> +	if (IS_ERR(page)) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	if (!(vma->vm_flags & VM_SHADOW_STACK) ||
> +	    !(vma->vm_flags & VM_WRITE)) {
> +		ret = -EFAULT;
> +		goto out_page;
> +	}
> +
> +	ret = arch_shstk_validate_clone(p, vma, page, args);
> +
> +out_page:
> +	put_page(page);
> +out:
> +	mmap_read_unlock(mm);
> +	mmput(mm);
> +	return ret;
> +}
> +
>  /*
>   * This creates a new process as a copy of the old one,
>   * but does not actually start it yet.
> @@ -2182,6 +2227,9 @@ __latent_entropy struct task_struct *copy_process(
>  	if (retval)
>  		goto bad_fork_cleanup_namespaces;
>  	retval = copy_thread(p, args);
> +	if (retval)
> +		goto bad_fork_cleanup_io;
> +	retval = shstk_validate_clone(p, args);
>  	if (retval)
>  		goto bad_fork_cleanup_io;
>  
> @@ -2763,7 +2811,9 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		     CLONE_ARGS_SIZE_VER1);
>  	BUILD_BUG_ON(offsetofend(struct clone_args, cgroup) !=
>  		     CLONE_ARGS_SIZE_VER2);
> -	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER2);
> +	BUILD_BUG_ON(offsetofend(struct clone_args, shstk_token) !=
> +		     CLONE_ARGS_SIZE_VER3);
> +	BUILD_BUG_ON(sizeof(struct clone_args) != CLONE_ARGS_SIZE_VER3);
>  
>  	if (unlikely(usize > PAGE_SIZE))
>  		return -E2BIG;
> @@ -2796,16 +2846,17 @@ static noinline int copy_clone_args_from_user(struct kernel_clone_args *kargs,
>  		return -EINVAL;
>  
>  	*kargs = (struct kernel_clone_args){
> -		.flags		= args.flags,
> -		.pidfd		= u64_to_user_ptr(args.pidfd),
> -		.child_tid	= u64_to_user_ptr(args.child_tid),
> -		.parent_tid	= u64_to_user_ptr(args.parent_tid),
> -		.exit_signal	= args.exit_signal,
> -		.stack		= args.stack,
> -		.stack_size	= args.stack_size,
> -		.tls		= args.tls,
> -		.set_tid_size	= args.set_tid_size,
> -		.cgroup		= args.cgroup,
> +		.flags			= args.flags,
> +		.pidfd			= u64_to_user_ptr(args.pidfd),
> +		.child_tid		= u64_to_user_ptr(args.child_tid),
> +		.parent_tid		= u64_to_user_ptr(args.parent_tid),
> +		.exit_signal		= args.exit_signal,
> +		.stack			= args.stack,
> +		.stack_size		= args.stack_size,
> +		.tls			= args.tls,
> +		.set_tid_size		= args.set_tid_size,
> +		.cgroup			= args.cgroup,
> +		.shstk_token		= args.shstk_token,
>  	};
>  
>  	if (args.set_tid &&
> @@ -2846,6 +2897,24 @@ static inline bool clone3_stack_valid(struct kernel_clone_args *kargs)
>  	return true;
>  }
>  
> +/**
> + * clone3_shadow_stack_valid - check and prepare shadow stack
> + * @kargs: kernel clone args
> + *
> + * Verify that shadow stacks are only enabled if supported.
> + */
> +static inline bool clone3_shadow_stack_valid(struct kernel_clone_args *kargs)
> +{
> +	if (!kargs->shstk_token)
> +		return true;
> +
> +	if (!IS_ALIGNED(kargs->shstk_token, sizeof(void *)))
> +		return false;
> +
> +	/* Fail if the kernel wasn't built with shadow stacks */
> +	return IS_ENABLED(CONFIG_ARCH_HAS_USER_SHADOW_STACK);
> +}
> +
>  static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  {
>  	/* Verify that no unknown flags are passed along. */
> @@ -2868,7 +2937,7 @@ static bool clone3_args_valid(struct kernel_clone_args *kargs)
>  	    kargs->exit_signal)
>  		return false;
>  
> -	if (!clone3_stack_valid(kargs))
> +	if (!clone3_stack_valid(kargs) || !clone3_shadow_stack_valid(kargs))
>  		return false;
>  
>  	return true;
> 
> -- 
> 2.47.2
> 

^ permalink raw reply

* Re: [PATCH RESEND 28/62] init: alpha, arc, arm, arm64, csky, m68k, microblaze, mips, nios2, openrisc, parisc, powerpc, s390, sh, sparc, um, x86, xtensa: rename initrd_{start,end} to virt_external_initramfs_{start,end}
From: Rob Herring @ 2025-09-16  3:09 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <20250913003842.41944-29-safinaskar@gmail.com>

On Sat, Sep 13, 2025 at 12:38:07AM +0000, Askar Safin wrote:
> Rename initrd_start to virt_external_initramfs_start and
> initrd_end to virt_external_initramfs_end.

There's not really any point in listing every arch in the subject.

Rob

^ permalink raw reply

* Re: [PATCH 00/62] initrd: remove classic initrd support
From: Dave Young @ 2025-09-16  1:48 UTC (permalink / raw)
  To: Askar Safin
  Cc: linux-fsdevel, linux-kernel, Linus Torvalds, Greg Kroah-Hartman,
	Christian Brauner, Al Viro, Jan Kara, Christoph Hellwig,
	Jens Axboe, Andy Shevchenko, Aleksa Sarai, Thomas Weißschuh,
	Julian Stecklina, Gao Xiang, Art Nikpal, Andrew Morton,
	Eric Curtin, Alexander Graf, Rob Landley, Lennart Poettering,
	linux-arch, linux-alpha, linux-snps-arc, linux-arm-kernel,
	linux-csky, linux-hexagon, loongarch, linux-m68k, linux-mips,
	linux-openrisc, linux-parisc, linuxppc-dev, linux-riscv,
	linux-s390, linux-sh, sparclinux, linux-um, x86, Ingo Molnar,
	linux-block, initramfs, linux-api, linux-doc, linux-efi,
	linux-ext4, Theodore Y . Ts'o, linux-acpi, Michal Simek,
	devicetree, Luis Chamberlain, Kees Cook, Thorsten Blum,
	Heiko Carstens, patches
In-Reply-To: <20250912223937.3735076-1-safinaskar@zohomail.com>

Hi,

On Sat, 13 Sept 2025 at 06:42, Askar Safin <safinaskar@zohomail.com> wrote:
>
> Intro
> ====
> This patchset removes classic initrd (initial RAM disk) support,
> which was deprecated in 2020.
> Initramfs still stays, and RAM disk itself (brd) still stays, too.

There is one initrd use case in my mind, it can be extended to co-work
with overlayfs as a kernel built-in solution for initrd(compressed fs
image)+overlayfs.   Currently we can use compressed fs images
(squashfs or erofs) within initramfs,  and kernel loop mount together
with overlayfs, this works fine but extra pre-mount phase is needed.

Thanks
Dave


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox