Linux userland API discussions
 help / color / mirror / Atom feed
* [PATCH v5 1/8] man/man2/fsopen.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-24 15:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai
In-Reply-To: <20250925-new-mount-api-v5-0-028fb88023f2@cyphar.com>

This is loosely based on the original documentation written by David
Howells and later maintained by Christian Brauner, but has been
rewritten to be more from a user perspective (as well as fixing a few
critical mistakes).

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
 man/man2/fsopen.2 | 385 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 385 insertions(+)

diff --git a/man/man2/fsopen.2 b/man/man2/fsopen.2
new file mode 100644
index 0000000000000000000000000000000000000000..7fbc6c3d28e2e741cd9003c105621b4242abd487
--- /dev/null
+++ b/man/man2/fsopen.2
@@ -0,0 +1,385 @@
+.\" Copyright, the authors of the Linux man-pages project
+.\"
+.\" SPDX-License-Identifier: Linux-man-pages-copyleft
+.\"
+.TH fsopen 2 (date) "Linux man-pages (unreleased)"
+.SH NAME
+fsopen \- create a new filesystem context
+.SH LIBRARY
+Standard C library
+.RI ( libc ,\~ \-lc )
+.SH SYNOPSIS
+.nf
+.B #include <sys/mount.h>
+.P
+.BI "int fsopen(const char *" fsname ", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR fsopen ()
+system call is part of
+the suite of file-descriptor-based mount facilities in Linux.
+.P
+.BR fsopen ()
+creates a blank filesystem configuration context within the kernel
+for the filesystem named by
+.I fsname
+and places it into creation mode.
+A new file descriptor
+associated with the filesystem configuration context
+is then returned.
+The calling process must have the
+.B \%CAP_SYS_ADMIN
+capability in order to create a new filesystem configuration context.
+.P
+A filesystem configuration context is
+an in-kernel representation of a pending transaction,
+containing a set of configuration parameters that are to be applied
+when creating a new instance of a filesystem
+(or modifying the configuration of an existing filesystem instance,
+such as when using
+.BR fspick (2)).
+.P
+After obtaining a filesystem configuration context with
+.BR fsopen (),
+the general workflow for operating on the context looks like the following:
+.IP (1) 5
+Pass the filesystem context file descriptor to
+.BR fsconfig (2)
+to specify any desired filesystem parameters.
+This may be done as many times as necessary.
+.IP (2)
+Pass the same filesystem context file descriptor to
+.BR fsconfig (2)
+with
+.B \%FSCONFIG_CMD_CREATE
+to create an instance of the configured filesystem.
+.IP (3)
+Pass the same filesystem context file descriptor to
+.BR fsmount (2)
+to create a new detached mount object for
+the root of the filesystem instance,
+which is then attached to a new file descriptor.
+(This also places the filesystem context file descriptor into
+reconfiguration mode,
+similar to the mode produced by
+.BR fspick (2).)
+Once a mount object has been created with
+.BR fsmount (2),
+the filesystem context file descriptor can be safely closed.
+.IP (4)
+Now that a mount object has been created,
+you may
+.RS
+.IP \[bu] 3
+use the detached mount object file descriptor as a
+.I dirfd
+argument to "*at()" system calls;
+and/or
+.IP \[bu]
+attach the mount object to a mount point
+by passing the mount object file descriptor to
+.BR move_mount (2).
+This will also prevent the mount object from
+being unmounted and destroyed when
+the mount object file descriptor is closed.
+.RE
+.IP
+The mount object file descriptor will
+remain associated with the mount object
+even after doing the above operations,
+so you may repeatedly use the mount object file descriptor with
+.BR move_mount (2)
+and/or "*at()" system calls
+as many times as necessary.
+.P
+A filesystem context will move between different modes
+throughout its lifecycle
+(such as the creation phase
+when created with
+.BR fsopen (),
+the reconfiguration phase
+when an existing filesystem instance is selected with
+.BR fspick (2),
+and the intermediate "awaiting-mount" phase
+.\" FS_CONTEXT_AWAITING_MOUNT is the term the kernel uses for this.
+between
+.B \%FSCONFIG_CMD_CREATE
+and
+.BR fsmount (2)),
+which has an impact on
+what operations are permitted on the filesystem context.
+.P
+The file descriptor returned by
+.BR fsopen ()
+also acts as a channel for filesystem drivers to
+provide more comprehensive diagnostic information
+than is normally provided through the standard
+.BR errno (3)
+interface for system calls.
+If an error occurs at any time during the workflow mentioned above,
+calling
+.BR read (2)
+on the filesystem context file descriptor
+will retrieve any ancillary information about the encountered errors.
+(See the "Message retrieval interface" section
+for more details on the message format.)
+.P
+.I flags
+can be used to control aspects of
+the creation of the filesystem configuration context file descriptor.
+A value for
+.I flags
+is constructed by bitwise ORing
+zero or more of the following constants:
+.RS
+.TP
+.B FSOPEN_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.RE
+.P
+A list of filesystems supported by the running kernel
+(and thus a list of valid values for
+.IR fsname )
+can be obtained from
+.IR /proc/filesystems .
+(See also
+.BR proc_filesystems (5).)
+.SS Message retrieval interface
+When doing operations on a filesystem configuration context,
+the filesystem driver may choose to provide
+ancillary information to userspace
+in the form of message strings.
+.P
+The filesystem context file descriptors returned by
+.BR fsopen ()
+and
+.BR fspick (2)
+may be queried for message strings at any time by calling
+.BR read (2)
+on the file descriptor.
+Each call to
+.BR read (2)
+will return a single message,
+prefixed to indicate its class:
+.RS
+.TP
+.BI e\~ message
+An error message was logged.
+This is usually associated with an error being returned
+from the corresponding system call which triggered this message.
+.TP
+.BI w\~ message
+A warning message was logged.
+.TP
+.BI i\~ message
+An informational message was logged.
+.RE
+.P
+Messages are removed from the queue as they are read.
+Note that the message queue has limited depth,
+so it is possible for messages to get lost.
+If there are no messages in the message queue,
+.B read(2)
+will return \-1 and
+.I errno
+will be set to
+.BR \%ENODATA .
+If the
+.I buf
+argument to
+.BR read (2)
+is not large enough to contain the entire message,
+.BR read (2)
+will return \-1 and
+.I errno
+will be set to
+.BR \%EMSGSIZE .
+(See BUGS.)
+.P
+If there are multiple filesystem contexts
+referencing the same filesystem instance
+(such as if you call
+.BR fspick (2)
+multiple times for the same mount),
+each one gets its own independent message queue.
+This does not apply to multiple file descriptors that are
+tied to the same underlying open file description
+(such as those created with
+.BR dup (2)).
+.P
+Message strings will usually be prefixed by
+the name of the filesystem or kernel subsystem
+that logged the message,
+though this may not always be the case.
+See the Linux kernel source code for details.
+.SH RETURN VALUE
+On success, a new file descriptor is returned.
+On error, \-1 is returned, and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EFAULT
+.I fsname
+is NULL
+or a pointer to a location
+outside the calling process's accessible address space.
+.TP
+.B EINVAL
+.I flags
+had an invalid flag set.
+.TP
+.B EMFILE
+The calling process has too many open files to create more.
+.TP
+.B ENFILE
+The system has too many open files to create more.
+.TP
+.B ENODEV
+The filesystem named by
+.I fsname
+is not supported by the kernel.
+.TP
+.B ENOMEM
+The kernel could not allocate sufficient memory to complete the operation.
+.TP
+.B EPERM
+The calling process does not have the required
+.B \%CAP_SYS_ADMIN
+capability.
+.SH STANDARDS
+Linux.
+.SH HISTORY
+Linux 5.2.
+.\" commit 24dcb3d90a1f67fe08c68a004af37df059d74005
+.\" commit 400913252d09f9cfb8cce33daee43167921fc343
+glibc 2.36.
+.SH BUGS
+.SS Message retrieval interface and \fB\%EMSGSIZE\fP
+As described in the "Message retrieval interface" subsection above,
+calling
+.BR read (2)
+with too small a buffer to contain
+the next pending message in the message queue
+for the filesystem configuration context
+will cause
+.BR read (2)
+to return \-1 and set
+.BR errno (3)
+to
+.BR \%EMSGSIZE .
+.P
+However,
+this failed operation still
+consumes the message from the message queue.
+This effectively discards the message silently,
+as no data is copied into the
+.BR read (2)
+buffer.
+.P
+Programs should take care to ensure that
+their buffers are sufficiently large
+to contain any reasonable message string,
+in order to avoid silently losing valuable diagnostic information.
+.\" Aleksa Sarai
+.\"   This unfortunate behaviour has existed since this feature was merged, but
+.\"   I have sent a patchset which will finally fix it.
+.\"   <https://lore.kernel.org/r/20250807-fscontext-log-cleanups-v3-1-8d91d6242dc3@cyphar.com/>
+.SH EXAMPLES
+To illustrate the workflow for creating a new mount,
+the following is an example of how to mount an
+.BR ext4 (5)
+filesystem stored on
+.I /dev/sdb1
+onto
+.IR /mnt .
+.P
+.in +4n
+.EX
+int fsfd, mntfd;
+\&
+fsfd = fsopen("ext4", FSOPEN_CLOEXEC);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "ro", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_PATH, "source", "/dev/sdb1", AT_FDCWD);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "noatime", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "acl", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "user_xattr", NULL, 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "iversion", NULL, 0)
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, MOUNT_ATTR_RELATIME);
+move_mount(mntfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
+.EE
+.in
+.P
+First,
+an ext4 configuration context is created and attached to the file descriptor
+.IR fsfd .
+Then, a series of parameters
+(such as the source of the filesystem)
+are provided using
+.BR fsconfig (2),
+followed by the filesystem instance being created with
+.BR \%FSCONFIG_CMD_CREATE .
+.BR fsmount (2)
+is then used to create a new mount object attached to the file descriptor
+.IR mntfd ,
+which is then attached to the intended mount point using
+.BR move_mount (2).
+.P
+The above procedure is functionally equivalent to
+the following mount operation using
+.BR mount (2):
+.P
+.in +4n
+.EX
+mount("/dev/sdb1", "/mnt", "ext4", MS_RELATIME,
+      "ro,noatime,acl,user_xattr,iversion");
+.EE
+.in
+.P
+And here's an example of creating a mount object
+of an NFS server share
+and setting a Smack security module label.
+However, instead of attaching it to a mount point,
+the program uses the mount object directly
+to open a file from the NFS share.
+.P
+.in +4n
+.EX
+int fsfd, mntfd, fd;
+\&
+fsfd = fsopen("nfs", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "source", "example.com/pub", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "nfsvers", "3", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "rsize", "65536", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "wsize", "65536", 0);
+fsconfig(fsfd, FSCONFIG_SET_STRING, "smackfsdef", "foolabel", 0);
+fsconfig(fsfd, FSCONFIG_SET_FLAG, "rdma", NULL, 0);
+fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
+mntfd = fsmount(fsfd, 0, MOUNT_ATTR_NODEV);
+fd = openat(mntfd, "src/linux-5.2.tar.xz", O_RDONLY);
+.EE
+.in
+.P
+Unlike the previous example,
+this operation has no trivial equivalent with
+.BR mount (2),
+as it was not previously possible to create a mount object
+that is not attached to any mount point.
+.SH SEE ALSO
+.BR fsconfig (2),
+.BR fsmount (2),
+.BR fspick (2),
+.BR mount (2),
+.BR mount_setattr (2),
+.BR move_mount (2),
+.BR open_tree (2),
+.BR mount_namespaces (7)

-- 
2.51.0


^ permalink raw reply related

* [PATCH v5 0/8] man2: document "new" mount API
From: Aleksa Sarai @ 2025-09-24 15:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner, Aleksa Sarai

Back in 2019, the new mount API was merged[1]. David Howells then set
about writing man pages for these new APIs, and sent some patches back
in 2020[2].

Unfortunately, these patches were never merged, which meant that these
APIs were practically undocumented for many years -- arguably this has
been a contributing factor to the relatively slow adoption of these new
(far better) APIs. For instance, I have often discovered that many folks
are unaware of the read(2)-based message retrieval interface provided by
filesystem context file descriptors.

In 2024, Christian Brauner adapted David Howell's original man pages
into the easier-to-edit Markdown format and published them on GitHub[3].
These have been maintained since, including updated information on new
features added since David Howells's 2020 draft pages (such as
MOVE_MOUNT_BENEATH).

While this was a welcome improvement to the previous status quo (that
had lasted over 6 years), speaking personally my experience is that not
having access to these man pages from the terminal has been a fairly
common painpoint.

So, this is a modern version of the man pages for these APIs, in the
hopes that we can finally (6 years later) get proper documentation for
these APIs in the man-pages project.

One important thing to note is that most of these were re-written by me,
with very minimal copying from the versions available from Christian[2].
The reasons for this are two-fold:

 * Both Howells's original version and Christian's maintained versions
   contain crucial mistakes that I have been bitten by in the past (the
   most obvious being that all of these APIs were merged in Linux 5.2,
   but the man pages all claim they were merged in different versions.)

 * As the man pages appear to have been written from Howells's
   perspective while implementing them, some of the wording is a little
   too tied to the implementation (or appears to describe features that
   don't really exist in the merged versions of these APIs).

 * The original versions of the man-pages lacked bigger-picture
   explanations of the reasoning behind the API, which would make it
   easier for readers to understand what operations are doing.

I decided that the best way to resolve these issues is to rewrite them
from the perspective of an actual user of these APIs (me), and check
that we do not repeat the mistakes I found in the originals. I have also
done my best to resolve the issues raised by Michael Kerrisk on the
original patchset sent by Howells[1].

In addition, I have also included a man page for open_tree_attr(2) (as a
subsection of the new open_tree(2) man page), which was merged in Linux
6.15.

[1]: https://lore.kernel.org/all/20190507204921.GL23075@ZenIV.linux.org.uk/
[2]: https://lore.kernel.org/linux-man/159680892602.29015.6551860260436544999.stgit@warthog.procyon.org.uk/
[3]: https://github.com/brauner/man-pages-md

Co-authored-by: David Howells <dhowells@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Co-authored-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
---
Changes in v5:
- `sed -i s|file descriptor based|file-descriptor-based|`.
  [Alejandro Colomar]
- fsconfig(2): use bullets instead of ordered list for workflow
  description. [Alejandro Colomar]
- mount_setattr(2): fix minor wording nit in new attribute-parameter
  subsection.
- fsopen(2): remove brackets around "message" for message retrieval
  interface description. [Alejandro Colomar]
- {move_mount,fspick}(2): fix remaining incorrect no-automount text.
  [Askar Safin]
- {fsmount,open_tree}(2): `sed -i s|MOUNT_DETACH|MNT_DETACH|g`.
  [Askar Safin]
- mount_setattr(2): fix copy-paste snafu in attribute-parameter
  subsection. [Askar Safin]
- *: clean `make -R build-catman-troff`. [Alejandro Colomar]
- *: switch to \[em]\c where appropriate.
- open_tree(2): clean up MNT_DETACH-on-close description and make it
  slightly more prominent. [Alejandro Colomar]
- open_tree(2): mention the distinction from open(O_PATH) with regards
  to automounts. Askar suggested it be put in the section about
  ~OPEN_TREE_CLONE, but the change in behaviour also applies to
  OPEN_TREE_CLONE and it looked awkward to include it in the
  dentry_open() case because O_PATH only gets mentioned in the following
  paragraph (where I've put the text now). [Askar Safin]
- {move_mount,open_tree{,_attr}}(2): fix column-width-related "make -R
  check" failures.
- *: fix remaining "make -R lint" failures.
- open_tree_attr(2): add example using MOUNT_ATTR_IDMAP.
- v4: <https://lore.kernel.org/r/20250919-new-mount-api-v4-0-1261201ab562@cyphar.com>

Changes in v4:
- `sed -i s|\\% |\\%|g`.
- Remove unneeded quotes in SYNOPSIS. [Alejandro Colomar]
- open_tree(2): fix leftover confusing usages of "attach" when referring
  to file descriptors being associated with mount objects.
- open_tree(2): rename "Anonymous mount namespaces" NOTES subsection to
  the far more informative "Mount propagation" and clean up the wording
  a little.
- open_tree_attr(2): add a code comment about
  <https://lore.kernel.org/all/20250808-open_tree_attr-bugfix-idmap-v1-0-0ec7bc05646c@cyphar.com/>
- {fsconfig,open_tree_attr}(2): use _Nullable.
- {fsmount,open_tree}(2): mention the the unmount-on-close behaviour is
  actually lazy (a-la MNT_DETACH).
- {fsconfig,mount_setattr}(2): improve "mount attributes and filesystem
  parameters" wording to make it clearer that superblock and mount flags
  are sibling properties, not the same thing.
- open_tree(2): mention that any mount propagation events while the
  mount object is detached are completely lost -- i.e., they don't get
  replayed once you attach the mount somewhere.
- fsconfig(2): fix minor grammatical / missing joining word issues.
- fsconfig(2): fix final leftover `.IR A " and " B` cases.
- fsconfig(2): explain that failed fsconfig(FSCONFIG_CMD_*) operations
  render the filesystem context invalid.
- fsconfig(2): rework the description of superblock reuse, as the
  previous text was very wrong. (Though there has been discussion about
  changing this behaviour...)
- fsconfig(2): remove misleading wording in FSCONFIG_CMD_CREATE_EXCL
  about how we are requesting a new filesystem instance -- in theory
  filesystems could take this request into account but in practice none
  do (and it seems unlikely any ever will).
- fsconfig(2): mention that key, value, and aux must be 0 or NULL for
  FSCONFIG_CMD_RECONF.
- fsmount(2): fix usage of "filesystem instance" in relation to
  fsmount() and open_tree() comparison. [Askar Safin]
- move_mount(2): "as attached" -> "as a detached" [Askar Safin]
- fspick(2): add note about filesystem parameter list being copied
  rather than reset with FSCONFIG_CMD_RECONFIGURE. [Askar Safin]
- v3: <https://lore.kernel.org/r/20250809-new-mount-api-v3-0-f61405c80f34@cyphar.com>

Changes in v3:
- `sed -i s|Co-developed-by|Co-authored-by|g`. [Alejandro Colomar]
  - Add Signed-off-by for co-authors. [Christian Brauner]
- `sed -i s|needs-mount|awaiting-mount|g`, to match the kernel parlance.
- Fix VERSIONS/HISTORY mixup in mount_attr(2type) that was copied from
  open_how(2type). [Alejandro Colomar]
- Fix incorrect .BR usage in SYNOPSIS.
- Some more semantic newlines fixes. [Alejandro Colomar]
- Minor fixes suggested by Alejandro. [Alejandro Colomar]
- open_tree_attr(2): heavily reword everything to be better formatted
  and more explicit about its behaviour.
- open_tree(2): write proper explanatory paragraphs for the EXAMPLES.
- mount_setattr(2): fix stray doublequote in SYNOPSIS. [Askar Safin]
- fsopen(2): rework structure of the DESCRIPTION introduction.
- fsopen(2): explicitly say that read(2) errors in the message retrieval
  interface are actual errors, not return 0. [Askar Safin]
- fsopen(2): add BUGS section to describe the unfortunate -ENODATA
  message dropping behaviour that should be fixed by
  <https://lore.kernel.org/r/20250807-fscontext-log-cleanups-v3-0-8d91d6242dc3@cyphar.com/>.
- fsconfig(2): add a NOTES subsection about generic filesystem
  parameters.
- fsconfig(2): add comment about the weirdness surrounding
  FSCONFIG_SET_PATH.
- {fspick,open_tree}(2): Correct AT_NO_AUTOMOUNT description (copied
  from David, who probably copied it from statx(2)) -- AT_NO_AUTOMOUNT
  applies to all path components, not just the final one. [Christian
  Brauner]
- statx(2): fix AT_NO_AUTOMOUNT documentation.
- open_tree(2): swap open(2) reference for openat(2) when saying that
  the result is identical. [Askar Safin]
- fsmount(2): fix DESCRIPTION introduction, and rework attr_flags
  description to better reference mount_setattr(2).
- {fsopen,fspick,fsmount,open_tree}(2): don't use "attach" when talking
  about the file descriptors we return that reference in-kernel objects,
  to avoid confusing readers with mount object attachment status.
- fsconfig(2): remove pidns argument example, as it was kind of unclear
  and referenced kernel features not yet merged.
- fsconfig(2): remove rambling FSCONFIG_SET_PATH_EMPTY text (which
  mostly describes an academic issue that doesn't apply to any existing
  filesystem), and instead add a CAVEATS section which touches on the
  weird type behaviour of fsconfig(2).
- v2: <https://lore.kernel.org/r/20250807-new-mount-api-v2-0-558a27b8068c@cyphar.com>

Changes in v2:
- `make -R lint-man`. [Alejandro Colomar]
- `sed -i s|Glibc|glibc|g`. [Alejandro Colomar]
- `sed -i s|pathname|path|g` [Alejandro Colomar]
- Clean up macro usage, example code, and synopsis. [Alejandro Colomar]
- Try to use semantic newlines. [Alejandro Colomar]
- Make sure the usage of "filesystem context", "filesystem instance",
  and "mount object" are consistent. [Askar Safin]
- Avoid referring to these syscalls without an "at" suffix as "*at()
  syscalls". [Askar Safin]
- Use \% to avoid hyphenation of constants. [Askar Safin, G. Branden Robinson]
- Add a new subsection to mount_setattr(2) to describe the distinction
  between mount attributes and filesystem parameters.
- (Under protest) double-space-after-period formatted commit messages.
- v1: <https://lore.kernel.org/r/20250806-new-mount-api-v1-0-8678f56c6ee0@cyphar.com>

---
Aleksa Sarai (8):
      man/man2/fsopen.2: document "new" mount API
      man/man2/fspick.2: document "new" mount API
      man/man2/fsconfig.2: document "new" mount API
      man/man2/fsmount.2: document "new" mount API
      man/man2/move_mount.2: document "new" mount API
      man/man2/open_tree.2: document "new" mount API
      man/man2/open_tree{,_attr}.2: document new open_tree_attr() API
      man/man2/{fsconfig,mount_setattr}.2: add note about attribute-parameter distinction

 man/man2/fsconfig.2       | 741 ++++++++++++++++++++++++++++++++++++++++++++++
 man/man2/fsmount.2        | 231 +++++++++++++++
 man/man2/fsopen.2         | 385 ++++++++++++++++++++++++
 man/man2/fspick.2         | 343 +++++++++++++++++++++
 man/man2/mount_setattr.2  |  39 +++
 man/man2/move_mount.2     | 646 ++++++++++++++++++++++++++++++++++++++++
 man/man2/open_tree.2      | 709 ++++++++++++++++++++++++++++++++++++++++++++
 man/man2/open_tree_attr.2 |   1 +
 8 files changed, 3095 insertions(+)
---
base-commit: f17990c243eafc1891ff692f90b6ce42e6449be8
change-id: 20250802-new-mount-api-436db984f432


Kind regards,
-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/


^ permalink raw reply

* Re: [PATCH v4 00/10] man2: document "new" mount API
From: Aleksa Sarai @ 2025-09-24 11:11 UTC (permalink / raw)
  To: Askar Safin
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
	viro
In-Reply-To: <2025-09-21-eldest-expert-wrists-cuddle-CQWTLx@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 1928 bytes --]

On 2025-09-21, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2025-09-21, Askar Safin <safinaskar@gmail.com> wrote:
> > * open_tree(2) still says:
> > > If flags does not contain OPEN_TREE_CLONE, open_tree() returns a file descriptor
> > > that is exactly equivalent to one produced by openat(2) when called with the same dirfd and path.
> > 
> > This is not true if automounts are involved. I suggest adding "modulo automounts". But you may
> > keep everything, of course.
> 
> Hmmm. As we discussed last time, this sentence is more intended to
> indicate that the file descriptor is just a regular open file (with no
> dissolve_on_fput() + FMODE_NEED_UNMOUNT magic) rather than the exact
> behaviour you get with regards to path lookup.
> 
> I would honestly prefer to remove "when called with the same dirfd and
> path" rather than add caveats, but I think it makes the sentence less
> readable... I'll think about it and try to fix this wording up somehow
> for v5.

I've gone with the following:

   In either case, the resultant file descriptor
   acts the same as one produced by
   .BR open (2)
   with
   .BR O_PATH ,
   meaning it can also be used as a
   .I dirfd
   argument to
   "*at()" system calls.
  +However,
  +unlike
  +.BR open (2)
  +called with
  +.BR O_PATH ,
  +automounts will
  +by default
  +be triggered by
  +.BR open_tree ()
  +unless
  +.B \%AT_NO_AUTOMOUNT
  +is included in
  +.IR flags .

After looking at it a few times, I decided adding it to the proceeding
paragraph (as you suggested) didn't really make sense since the O_PATH
equivalence is only mentioned in this following paragraph.

Also, the automount behaviour also applies to OPEN_TREE_CLONE, so it's
best to not mislead a reader into thinking it only applies to one of the
cases.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4 04/10] man/man2/fsconfig.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-24  8:52 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-09-24-sterile-elderly-drone-sum-LHA7Fs@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 1575 bytes --]

Hi Aleksa,

On Wed, Sep 24, 2025 at 04:41:16PM +1000, Aleksa Sarai wrote:
> On 2025-09-21, Alejandro Colomar <alx@kernel.org> wrote:
> > On Fri, Sep 19, 2025 at 11:59:45AM +1000, Aleksa Sarai wrote:
> > > +The list of valid
> > > +.I cmd
> > > +values are:
> > 
> > I think I would have this page split into one page per command.
> > 
> > I would keep an overview in this page, of the main system call, and the
> > descriptions of each subcommand would go into each separate page.
> > 
> > You could have a look at fcntl(2), which has been the most recent page
> > split, and let me know what you think.
> 
> To be honest, I think this makes the page less useful to most readers.
> 
> I get that you want to try to improve the "wall of text" problem but as
> a very regular reader of man-pages, I find indirections annoying every
> time I have to do deal with them. Maybe there is an argument for
> fcntl(2) to undergo this treatment (as it has a menagerie of disparate
> commands) but this applies even less to fsconfig(2) in my view.
> 
> If you feel strongly that fsconfig(2) needs this treatment, it would
> probably be better for you to do it instead. In particular, I would've
> expected to only have two extra pages if we went that route (one for
> FSCONFIG_SET_* commands and one for FSCONFIG_CMD_* commands) so I'm not
> quite sure what you'd like the copy to look like for 10 man-pages...

Okay, let's keep it as a single page for now.


Cheers,
Alex

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 09/10] man/man2/open_tree{,_attr}.2: document new open_tree_attr() API
From: Alejandro Colomar @ 2025-09-24  8:51 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-09-24-unsafe-movable-perms-actress-zoAIgs@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]

Hi Aleksa,

On Wed, Sep 24, 2025 at 04:31:15PM +1000, Aleksa Sarai wrote:
> On 2025-09-21, Alejandro Colomar <alx@kernel.org> wrote:
> > On Fri, Sep 19, 2025 at 11:59:50AM +1000, Aleksa Sarai wrote:
> > > diff --git a/man/man2/open_tree.2 b/man/man2/open_tree.2
> > > index 7f85df08b43c7b48a9d021dbbeb2c60092a2b2d4..60de4313a9d5be4ef3ff1217051f252506a2ade9 100644
> > > --- a/man/man2/open_tree.2
> > > +++ b/man/man2/open_tree.2
> > > @@ -15,7 +15,19 @@ .SH SYNOPSIS
> > >  .B #include <sys/mount.h>
> > >  .P
> > >  .BI "int open_tree(int " dirfd ", const char *" path ", unsigned int " flags );
> > > +.P
> > > +.BR "#include <sys/syscall.h>" "    /* Definition of " SYS_* " constants */"
> > > +.P
> > > +.BI "int syscall(SYS_open_tree_attr, int " dirfd ", const char *" path ,
> > > +.BI "            unsigned int " flags ", struct mount_attr *_Nullable " attr ", \
> > > +size_t " size );
> > 
> > Do we maybe want to move this to its own separate page?
> > 
> > The separate page could perfectly contain the same exact text you're
> > adding here; you don't need to repeat open_tree() descriptions.
> > 
> > In general, I feel that while this improves discoverability of related
> > functions, it produces more complex pages.
> 
> I tried it and I don't think it is a better experience as a reader when
> split into two pages because of the huge overlap between the two
> syscalls.

Okay.  Thanks!


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 04/10] man/man2/fsconfig.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-24  6:41 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <e4jtqbymqguq64zup5qr6rnppwjyveqdzvqdbnz3c7v55zplbs@6bpdfbv6sh7d>

[-- Attachment #1: Type: text/plain, Size: 1400 bytes --]

On 2025-09-21, Alejandro Colomar <alx@kernel.org> wrote:
> On Fri, Sep 19, 2025 at 11:59:45AM +1000, Aleksa Sarai wrote:
> > +The list of valid
> > +.I cmd
> > +values are:
> 
> I think I would have this page split into one page per command.
> 
> I would keep an overview in this page, of the main system call, and the
> descriptions of each subcommand would go into each separate page.
> 
> You could have a look at fcntl(2), which has been the most recent page
> split, and let me know what you think.

To be honest, I think this makes the page less useful to most readers.

I get that you want to try to improve the "wall of text" problem but as
a very regular reader of man-pages, I find indirections annoying every
time I have to do deal with them. Maybe there is an argument for
fcntl(2) to undergo this treatment (as it has a menagerie of disparate
commands) but this applies even less to fsconfig(2) in my view.

If you feel strongly that fsconfig(2) needs this treatment, it would
probably be better for you to do it instead. In particular, I would've
expected to only have two extra pages if we went that route (one for
FSCONFIG_SET_* commands and one for FSCONFIG_CMD_* commands) so I'm not
quite sure what you'd like the copy to look like for 10 man-pages...

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4 09/10] man/man2/open_tree{,_attr}.2: document new open_tree_attr() API
From: Aleksa Sarai @ 2025-09-24  6:31 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <vc2xa2tuqqnkuoyg4hrgt6akt23ap6hxho5qs5hfcbc5nsaosv@idi6hwvyo7r5>

[-- Attachment #1: Type: text/plain, Size: 1389 bytes --]

On 2025-09-21, Alejandro Colomar <alx@kernel.org> wrote:
> On Fri, Sep 19, 2025 at 11:59:50AM +1000, Aleksa Sarai wrote:
> > diff --git a/man/man2/open_tree.2 b/man/man2/open_tree.2
> > index 7f85df08b43c7b48a9d021dbbeb2c60092a2b2d4..60de4313a9d5be4ef3ff1217051f252506a2ade9 100644
> > --- a/man/man2/open_tree.2
> > +++ b/man/man2/open_tree.2
> > @@ -15,7 +15,19 @@ .SH SYNOPSIS
> >  .B #include <sys/mount.h>
> >  .P
> >  .BI "int open_tree(int " dirfd ", const char *" path ", unsigned int " flags );
> > +.P
> > +.BR "#include <sys/syscall.h>" "    /* Definition of " SYS_* " constants */"
> > +.P
> > +.BI "int syscall(SYS_open_tree_attr, int " dirfd ", const char *" path ,
> > +.BI "            unsigned int " flags ", struct mount_attr *_Nullable " attr ", \
> > +size_t " size );
> 
> Do we maybe want to move this to its own separate page?
> 
> The separate page could perfectly contain the same exact text you're
> adding here; you don't need to repeat open_tree() descriptions.
> 
> In general, I feel that while this improves discoverability of related
> functions, it produces more complex pages.

I tried it and I don't think it is a better experience as a reader when
split into two pages because of the huge overlap between the two
syscalls.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4 07/10] man/man2/open_tree.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-24  1:34 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <aqhcwkln4fls44e2o6pwnepex6yec6lg2jnngrtck3g5pc6q5d@7zibx3l2vrjw>

[-- Attachment #1: Type: text/plain, Size: 1344 bytes --]

On 2025-09-22, Alejandro Colomar <alx@kernel.org> wrote:
> Hi Aleksa,
> 
> On Mon, Sep 22, 2025 at 08:09:47PM +1000, Aleksa Sarai wrote:
> > > > +is lazy\[em]akin to calling
> > > 
> > > I prefer em dashes in both sides of the parenthetical; it more clearly
> > > denotes where it ends.
> > > 
> > > 	is lazy
> > > 	\[em]akin to calling
> > > 	.BR umount2 (2)
> > > 	with
> > > 	.BR MOUNT_DETACH \[em];
> > 
> > An \[em] next to a ";"? Let me see if I can rewrite it to avoid this...
> 
> You could use parentheses, maybe.

I tried it a few different ways and I think it reads best with a single
em dash as a parenthetical -- since ";" indicates the end of a clause I
don't think you need to "close" the parenthetical with a corresponding
em dash.

Here is the parentheses version, but I plan to just keep the em dash
version in the patchset. If you really prefer the parenthesis version
feel free to replace it.

  This implicit unmount operation is lazy
  (akin to calling
  .BR umount2 (2)
  with
  .BR MNT_DETACH );
  thus,
  any existing open references to files
  from the mount object
  will continue to work,
  and the mount object will only be completely destroyed
  once it ceases to be busy.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Thematic Funds Letter Of Intent
From: Al Sayyid Sultan @ 2025-09-23 23:54 UTC (permalink / raw)
  To: linux-api

To: linux-api@vger.kernel.org
Date: 24-09-2025
Thematic Funds Letter Of Intent

It's a pleasure to connect with you

Having been referred to your investment by my team, we would be 
honored to review your available investment projects for onward 
referral to my principal investors who can allocate capital for 
the financing of it.

kindly advise at your convenience

Best Regards,

Respectfully,
Al Sayyid Sultan Yarub Al Busaidi
Director

^ permalink raw reply

* Re: [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
From: Pratyush Yadav @ 2025-09-23 13:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <CA+CK2bD_-xwwUBnF4TBCBuX33uL6+V_1nN=0Q8_NXwhubTc8yA@mail.gmail.com>

On Mon, Sep 22 2025, Pasha Tatashin wrote:

> On Wed, Aug 27, 2025 at 11:34 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>>
>> Hi Pasha,
>>
>> On Thu, Aug 07 2025, Pasha Tatashin wrote:
>>
>> > Currently, a file descriptor registered for preservation via the remains
>> > globally registered with LUO until it is explicitly unregistered. This
>> > creates a potential for resource leaks into the next kernel if the
>> > userspace agent crashes or exits without proper cleanup before a live
>> > update is fully initiated.
>> >
>> > This patch ties the lifetime of FD preservation requests to the lifetime
>> > of the open file descriptor for /dev/liveupdate, creating an implicit
>> > "session".
>> >
>> > When the /dev/liveupdate file descriptor is closed (either explicitly
>> > via close() or implicitly on process exit/crash), the .release
>> > handler, luo_release(), is now called. This handler invokes the new
>> > function luo_unregister_all_files(), which iterates through all FDs
>> > that were preserved through that session and unregisters them.
>>
>> Why special case files here? Shouldn't you undo all the serialization
>> done for all the subsystems?
>
> Good point, subsystems should also be cancelled, and system should be
> brought back to normal state. However, with session support, we will
> be dropping only FDs that belong to a specific session when its FD is
> closed, or all FDs+subsystems when closing /dev/liveupdate.

Yeah, that makes sense.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Christophe Leroy @ 2025-09-23 12:04 UTC (permalink / raw)
  To: Askar Safin, linux-fsdevel, linux-kernel
  Cc: Linus Torvalds, Greg Kroah-Hartman, Christian Brauner, Al Viro,
	Jan Kara, Christoph Hellwig, Jens Axboe, Andy Shevchenko,
	Aleksa Sarai, Thomas Weißschuh, Julian Stecklina, Gao Xiang,
	Art Nikpal, Andrew Morton, Eric Curtin, Alexander Graf,
	Rob Landley, Lennart Poettering, linux-arch, linux-alpha,
	linux-snps-arc, linux-arm-kernel, linux-csky, linux-hexagon,
	loongarch, linux-m68k, linux-mips, linux-openrisc, linux-parisc,
	linuxppc-dev, linux-riscv, linux-s390, linux-sh, sparclinux,
	linux-um, x86, Ingo Molnar, linux-block, initramfs, linux-api,
	linux-doc, linux-efi, linux-ext4, Theodore Y . Ts'o,
	linux-acpi, Michal Simek, devicetree, Luis Chamberlain, Kees Cook,
	Thorsten Blum, Heiko Carstens, patches
In-Reply-To: <20250913003842.41944-1-safinaskar@gmail.com>



Le 13/09/2025 à 02:37, Askar Safin a écrit :
> [Vous ne recevez pas souvent de courriers de safinaskar@gmail.com. Découvrez pourquoi ceci est important à https://aka.ms/LearnAboutSenderIdentification ]
> 
> Intro
> ====
> This patchset removes classic initrd (initial RAM disk) support,
> which was deprecated in 2020.
> Initramfs still stays, and RAM disk itself (brd) still stays, too.
> init/do_mounts* and init/*initramfs* are listed in VFS entry in
> MAINTAINERS, so I think this patchset should go through VFS tree.
> This patchset touchs every subdirectory in arch/, so I tested it
> on 8 (!!!) archs in Qemu (see details below).
> Warning: this patchset renames CONFIG_BLK_DEV_INITRD (!!!) to CONFIG_INITRAMFS
> and CONFIG_RD_* to CONFIG_INITRAMFS_DECOMPRESS_* (for example,
> CONFIG_RD_GZIP to CONFIG_INITRAMFS_DECOMPRESS_GZIP).
> If you still use initrd, see below for workaround.

Apologise if my question looks stupid, but I'm using QEMU for various 
tests, and the way QEMU is started is something like:

qemu-system-ppc -kernel ./vmlinux -cpu g4 -M mac99 -initrd 
./qemu/rootfs.cpio.gz

I was therefore expecting (and fearing) it to fail with your series 
applied, but surprisingly it still works.

Therefore is it really initrd you are removing or just some corner case 
? If it is really initrd, then how does QEMU still work with that 
-initrd parameter ?

Thanks
Christophe

> 
> Details
> ====
> I not only removed initrd, I also removed a lot of code, which
> became dead, including a lot of code in arch/.
> 
> Still I think the only two architectures I touched in non-trivial
> way are sh and 32-bit arm.
> 
> Also I renamed some files, functions and variables (which became misnomers) to proper names,
> moved some code around, removed a lot of mentions of initrd
> in code and comments. Also I cleaned up some docs.
> 
> For example, I renamed the following global variables:
> 
> __initramfs_start
> __initramfs_size
> phys_initrd_start
> phys_initrd_size
> initrd_start
> initrd_end
> 
> to:
> 
> __builtin_initramfs_start
> __builtin_initramfs_size
> phys_external_initramfs_start
> phys_external_initramfs_size
> virt_external_initramfs_start
> virt_external_initramfs_end
> 
> New names precisely capture meaning of these variables.
> 
> Also I renamed CONFIG_BLK_DEV_INITRD (which became total misnomer)
> to CONFIG_INITRAMFS. And CONFIG_RD_* to CONFIG_INITRAMFS_DECOMPRESS_*.
> This will break all configs out there (update your configs!).
> Still I think this is okay,
> because config names never were part of stable API.
> Still, I don't have strong opinion here, so I can drop these renamings
> if needed.
> 
> Other user-visible changes:
> 
> - Removed kernel command line parameters "load_ramdisk" and
> "prompt_ramdisk", which did nothing and were deprecated
> - Removed kernel command line parameter "ramdisk_start",
> which was used for initrd only (not for initramfs)
> - Removed kernel command line parameter "noinitrd",
> which was inconsistent: it controlled initrd only
> (not initramfs), except for EFI boot, where it
> controlled both initramfs and initrd. EFI users
> still can disable initramfs simply by not passing it
> - Removed kernel command line parameter "ramdisk_size",
> which used for controlling ramdisk (brd), but only
> in non-modular mode. Use brd.rd_size instead, it
> always works
> - Removed /proc/sys/kernel/real-root-dev . It was used
> for initrd only
> 
> This patchset is based on v6.17-rc5.
> 
> Testing
> ====
> I tested my patchset on many architectures in Qemu using my Rust
> program, heavily based on mkroot [1].
> 
> I used the following cross-compilers:
> 
> aarch64-linux-musleabi
> armv4l-linux-musleabihf
> armv5l-linux-musleabihf
> armv7l-linux-musleabihf
> i486-linux-musl
> i686-linux-musl
> mips-linux-musl
> mips64-linux-musl
> mipsel-linux-musl
> powerpc-linux-musl
> powerpc64-linux-musl
> powerpc64le-linux-musl
> riscv32-linux-musl
> riscv64-linux-musl
> s390x-linux-musl
> sh4-linux-musl
> sh4eb-linux-musl
> x86_64-linux-musl
> 
> taken from this directory [2].
> 
> So, as you can see, there are 18 triplets, which correspond to 8 subdirs in arch/.
> 
> And note that this list contains two archs (arm and sh) touched in non-trivial way.
> 
> For every triplet I tested that:
> - Initramfs still works (both builtin and external)
> - Direct boot from disk still works
> 
> Workaround
> ====
> If "retain_initrd" is passed to kernel, then initramfs/initrd,
> passed by bootloader, is retained and becomes available after boot
> as read-only magic file /sys/firmware/initrd [3].
> 
> No copies are involved. I. e. /sys/firmware/initrd is simply
> a reference to original blob passed by bootloader.
> 
> This works even if initrd/initramfs is not recognized by kernel
> in any way, i. e. even if it is not valid cpio archive, nor
> a fs image supported by classic initrd.
> 
> This works both with my patchset and without it.
> 
> This means that you can emulate classic initrd so:
> link builtin initramfs to kernel. In /init in this initramfs
> copy /sys/firmware/initrd to some file in / and loop-mount it.
> 
> This is even better than classic initrd, because:
> - You can use fs not supported by classic initrd, for example erofs
> - One copy is involved (from /sys/firmware/initrd to some file in /)
> as opposed to two when using classic initrd
> 
> Still, I don't recommend using this workaround, because
> I want everyone to migrate to proper modern initramfs.
> But still you can use this workaround if you want.
> 
> Also: it is not possible to directly loop-mount
> /sys/firmware/initrd . Theoretically kernel can be changed
> to allow this (and/or to make it writable), but I think nobody needs this.
> And I don't want to implement this.
> 
> P. S. When I sent this patchset first time, zoho mail banned me for
> too much email. So I resend this using gmail. The only change is
> email change, there are no other changes
> 
> [1] https://github.com/landley/toybox/tree/master/mkroot
> [2] https://landley.net/toybox/downloads/binaries/toolchains/latest
> [3] https://lore.kernel.org/all/20231207235654.16622-1-graf@amazon.com/
> 
> Askar Safin (62):
>    init: remove deprecated "load_ramdisk" command line parameter, which
>      does nothing
>    init: remove deprecated "prompt_ramdisk" command line parameter, which
>      does nothing
>    init: sh, sparc, x86: remove unused constants RAMDISK_PROMPT_FLAG and
>      RAMDISK_LOAD_FLAG
>    init: x86, arm, sh, sparc: remove variable rd_image_start, which
>      controls starting block number of initrd
>    init: remove "ramdisk_start" command line parameter, which controls
>      starting block number of initrd
>    arm: init: remove special logic for setting brd.rd_size
>    arm: init: remove ATAG_RAMDISK
>    arm: init: remove FLAG_RDLOAD and FLAG_RDPROMPT
>    arm: init: document rd_start (in param_struct) as obsolete
>    initrd: remove initrd (initial RAM disk) support
>    init, efi: remove "noinitrd" command line parameter
>    init: remove /proc/sys/kernel/real-root-dev
>    ext2: remove ext2_image_size and associated code
>    init: m68k, mips, powerpc, s390, sh: remove Root_RAM0
>    doc: modernize Documentation/admin-guide/blockdev/ramdisk.rst
>    brd: remove "ramdisk_size" command line parameter
>    doc: modernize Documentation/filesystems/ramfs-rootfs-initramfs.rst
>    doc: modernize
>      Documentation/driver-api/early-userspace/early_userspace_support.rst
>    init: remove mentions of "ramdisk=" command line parameter
>    doc: remove Documentation/power/swsusp-dmcrypt.rst
>    init: remove all mentions of root=/dev/ram*
>    doc: remove obsolete mentions of pivot_root
>    init: rename __initramfs_{start,size} to
>      __builtin_initramfs_{start,size}
>    init: remove wrong comment
>    init: rename phys_initrd_{start,size} to
>      phys_external_initramfs_{start,size}
>    init: move phys_external_initramfs_{start,size} to init/initramfs.c
>    init: alpha: remove "extern unsigned long initrd_start, initrd_end"
>    init: alpha, arc, arm, arm64, csky, m68k, microblaze, mips, nios2,
>      openrisc, parisc, powerpc, s390, sh, sparc, um, x86, xtensa: rename
>      initrd_{start,end} to virt_external_initramfs_{start,end}
>    init: move virt_external_initramfs_{start,end} to init/initramfs.c
>    doc: remove documentation for block device 4 0
>    init: rename initrd_below_start_ok to initramfs_below_start_ok
>    init: move initramfs_below_start_ok to init/initramfs.c
>    init: remove init/do_mounts_initrd.c
>    init: inline create_dev into the only caller
>    init: make mount_root_generic static
>    init: make mount_root static
>    init: remove root_mountflags from init/do_mounts.h
>    init: remove most headers from init/do_mounts.h
>    init: make console_on_rootfs static
>    init: rename free_initrd_mem to free_initramfs_mem
>    init: rename reserve_initrd_mem to reserve_initramfs_mem
>    init: rename <linux/initrd.h> to <linux/initramfs.h>
>    setsid: inline ksys_setsid into the only caller
>    doc: kernel-parameters: remove [RAM] from reserve_mem=
>    doc: kernel-parameters: replace [RAM] with [INITRAMFS]
>    init: edit docs for initramfs-related configs
>    init: fix typo: virtul => virtual
>    init: fix comment
>    init: rename ramdisk_execute_command to initramfs_execute_command
>    init: rename ramdisk_command_access to initramfs_command_access
>    init: rename get_boot_config_from_initrd to
>      get_boot_config_from_initramfs
>    init: rename do_retain_initrd to retain_initramfs
>    init: rename kexec_free_initrd to kexec_free_initramfs
>    init: arm, x86: deal with some references to initrd
>    init: rename CONFIG_BLK_DEV_INITRD to CONFIG_INITRAMFS
>    init: rename CONFIG_RD_GZIP to CONFIG_INITRAMFS_DECOMPRESS_GZIP
>    init: rename CONFIG_RD_BZIP2 to CONFIG_INITRAMFS_DECOMPRESS_BZIP2
>    init: rename CONFIG_RD_LZMA to CONFIG_INITRAMFS_DECOMPRESS_LZMA
>    init: rename CONFIG_RD_XZ to CONFIG_INITRAMFS_DECOMPRESS_XZ
>    init: rename CONFIG_RD_LZO to CONFIG_INITRAMFS_DECOMPRESS_LZO
>    init: rename CONFIG_RD_LZ4 to CONFIG_INITRAMFS_DECOMPRESS_LZ4
>    init: rename CONFIG_RD_ZSTD to CONFIG_INITRAMFS_DECOMPRESS_ZSTD
> 
>   .../admin-guide/blockdev/ramdisk.rst          | 104 +----
>   .../admin-guide/device-mapper/dm-init.rst     |   4 +-
>   Documentation/admin-guide/devices.txt         |  12 -
>   Documentation/admin-guide/index.rst           |   1 -
>   Documentation/admin-guide/initrd.rst          | 383 ------------------
>   .../admin-guide/kernel-parameters.rst         |   4 +-
>   .../admin-guide/kernel-parameters.txt         |  38 +-
>   Documentation/admin-guide/nfs/nfsroot.rst     |   4 +-
>   Documentation/admin-guide/sysctl/kernel.rst   |   6 -
>   Documentation/arch/arm/ixp4xx.rst             |   4 +-
>   Documentation/arch/arm/setup.rst              |   6 +-
>   Documentation/arch/m68k/kernel-options.rst    |  29 +-
>   Documentation/arch/x86/boot.rst               |   4 +-
>   .../early_userspace_support.rst               |  18 +-
>   .../filesystems/ramfs-rootfs-initramfs.rst    |  20 +-
>   Documentation/power/index.rst                 |   1 -
>   Documentation/power/swsusp-dmcrypt.rst        | 140 -------
>   Documentation/security/ipe.rst                |   2 +-
>   .../translations/zh_CN/power/index.rst        |   1 -
>   arch/alpha/kernel/core_irongate.c             |  12 +-
>   arch/alpha/kernel/proto.h                     |   2 +-
>   arch/alpha/kernel/setup.c                     |  32 +-
>   arch/arc/configs/axs101_defconfig             |   2 +-
>   arch/arc/configs/axs103_defconfig             |   2 +-
>   arch/arc/configs/axs103_smp_defconfig         |   2 +-
>   arch/arc/configs/haps_hs_defconfig            |   2 +-
>   arch/arc/configs/haps_hs_smp_defconfig        |   2 +-
>   arch/arc/configs/hsdk_defconfig               |   2 +-
>   arch/arc/configs/nsim_700_defconfig           |   2 +-
>   arch/arc/configs/nsimosci_defconfig           |   2 +-
>   arch/arc/configs/nsimosci_hs_defconfig        |   2 +-
>   arch/arc/configs/nsimosci_hs_smp_defconfig    |   2 +-
>   arch/arc/configs/tb10x_defconfig              |   4 +-
>   arch/arc/configs/vdk_hs38_defconfig           |   2 +-
>   arch/arc/configs/vdk_hs38_smp_defconfig       |   2 +-
>   arch/arc/mm/init.c                            |  14 +-
>   arch/arm/Kconfig                              |   2 +-
>   arch/arm/boot/dts/arm/integratorap.dts        |   2 +-
>   arch/arm/boot/dts/arm/integratorcp.dts        |   2 +-
>   .../dts/aspeed/aspeed-bmc-facebook-cmm.dts    |   2 +-
>   .../aspeed/aspeed-bmc-facebook-galaxy100.dts  |   2 +-
>   .../aspeed/aspeed-bmc-facebook-minipack.dts   |   2 +-
>   .../aspeed/aspeed-bmc-facebook-wedge100.dts   |   2 +-
>   .../aspeed/aspeed-bmc-facebook-wedge40.dts    |   2 +-
>   .../dts/aspeed/aspeed-bmc-facebook-yamp.dts   |   2 +-
>   .../ast2600-facebook-netbmc-common.dtsi       |   2 +-
>   arch/arm/boot/dts/hisilicon/hi3620-hi4511.dts |   2 +-
>   .../ixp/intel-ixp42x-welltech-epbx100.dts     |   2 +-
>   arch/arm/boot/dts/nspire/nspire-classic.dtsi  |   2 +-
>   arch/arm/boot/dts/nspire/nspire-cx.dts        |   2 +-
>   .../boot/dts/samsung/exynos4210-origen.dts    |   2 +-
>   .../boot/dts/samsung/exynos4210-smdkv310.dts  |   2 +-
>   .../boot/dts/samsung/exynos4412-smdk4412.dts  |   2 +-
>   .../boot/dts/samsung/exynos5250-smdk5250.dts  |   2 +-
>   arch/arm/boot/dts/st/ste-nomadik-nhk15.dts    |   2 +-
>   arch/arm/boot/dts/st/ste-nomadik-s8815.dts    |   2 +-
>   arch/arm/boot/dts/st/stm32429i-eval.dts       |   2 +-
>   arch/arm/boot/dts/st/stm32746g-eval.dts       |   2 +-
>   arch/arm/boot/dts/st/stm32f429-disco.dts      |   2 +-
>   arch/arm/boot/dts/st/stm32f469-disco.dts      |   2 +-
>   arch/arm/boot/dts/st/stm32f746-disco.dts      |   2 +-
>   arch/arm/boot/dts/st/stm32f769-disco.dts      |   2 +-
>   arch/arm/boot/dts/st/stm32h743i-disco.dts     |   2 +-
>   arch/arm/boot/dts/st/stm32h743i-eval.dts      |   2 +-
>   arch/arm/boot/dts/st/stm32h747i-disco.dts     |   2 +-
>   arch/arm/boot/dts/st/stm32h750i-art-pi.dts    |   2 +-
>   arch/arm/configs/aspeed_g4_defconfig          |   8 +-
>   arch/arm/configs/aspeed_g5_defconfig          |   8 +-
>   arch/arm/configs/assabet_defconfig            |   4 +-
>   arch/arm/configs/at91_dt_defconfig            |   4 +-
>   arch/arm/configs/axm55xx_defconfig            |   2 +-
>   arch/arm/configs/bcm2835_defconfig            |   2 +-
>   arch/arm/configs/clps711x_defconfig           |   4 +-
>   arch/arm/configs/collie_defconfig             |   4 +-
>   arch/arm/configs/davinci_all_defconfig        |   2 +-
>   arch/arm/configs/exynos_defconfig             |   4 +-
>   arch/arm/configs/footbridge_defconfig         |   2 +-
>   arch/arm/configs/gemini_defconfig             |   2 +-
>   arch/arm/configs/h3600_defconfig              |   2 +-
>   arch/arm/configs/hisi_defconfig               |   4 +-
>   arch/arm/configs/imx_v4_v5_defconfig          |   2 +-
>   arch/arm/configs/imx_v6_v7_defconfig          |   4 +-
>   arch/arm/configs/integrator_defconfig         |   2 +-
>   arch/arm/configs/ixp4xx_defconfig             |   2 +-
>   arch/arm/configs/keystone_defconfig           |   2 +-
>   arch/arm/configs/lpc18xx_defconfig            |  12 +-
>   arch/arm/configs/lpc32xx_defconfig            |   4 +-
>   arch/arm/configs/milbeaut_m10v_defconfig      |   2 +-
>   arch/arm/configs/multi_v4t_defconfig          |   2 +-
>   arch/arm/configs/multi_v5_defconfig           |   2 +-
>   arch/arm/configs/multi_v7_defconfig           |   2 +-
>   arch/arm/configs/mvebu_v7_defconfig           |   2 +-
>   arch/arm/configs/mxs_defconfig                |   2 +-
>   arch/arm/configs/neponset_defconfig           |   4 +-
>   arch/arm/configs/nhk8815_defconfig            |   2 +-
>   arch/arm/configs/omap1_defconfig              |   2 +-
>   arch/arm/configs/omap2plus_defconfig          |   2 +-
>   arch/arm/configs/pxa910_defconfig             |   2 +-
>   arch/arm/configs/pxa_defconfig                |   4 +-
>   arch/arm/configs/qcom_defconfig               |   2 +-
>   arch/arm/configs/rpc_defconfig                |   2 +-
>   arch/arm/configs/s3c6400_defconfig            |   4 +-
>   arch/arm/configs/s5pv210_defconfig            |   4 +-
>   arch/arm/configs/sama5_defconfig              |   4 +-
>   arch/arm/configs/sama7_defconfig              |   2 +-
>   arch/arm/configs/shmobile_defconfig           |   2 +-
>   arch/arm/configs/socfpga_defconfig            |   2 +-
>   arch/arm/configs/sp7021_defconfig             |  12 +-
>   arch/arm/configs/spear13xx_defconfig          |   2 +-
>   arch/arm/configs/spear3xx_defconfig           |   2 +-
>   arch/arm/configs/spear6xx_defconfig           |   2 +-
>   arch/arm/configs/spitz_defconfig              |   2 +-
>   arch/arm/configs/stm32_defconfig              |   2 +-
>   arch/arm/configs/sunxi_defconfig              |   2 +-
>   arch/arm/configs/tegra_defconfig              |   2 +-
>   arch/arm/configs/u8500_defconfig              |   4 +-
>   arch/arm/configs/versatile_defconfig          |   2 +-
>   arch/arm/configs/vexpress_defconfig           |   2 +-
>   arch/arm/configs/vf610m4_defconfig            |  10 +-
>   arch/arm/configs/vt8500_v6_v7_defconfig       |   2 +-
>   arch/arm/configs/wpcm450_defconfig            |   2 +-
>   arch/arm/include/uapi/asm/setup.h             |  10 -
>   arch/arm/kernel/atags_compat.c                |  10 -
>   arch/arm/kernel/atags_parse.c                 |  16 +-
>   arch/arm/kernel/setup.c                       |   2 +-
>   arch/arm/mm/init.c                            |  24 +-
>   arch/arm64/configs/defconfig                  |   2 +-
>   arch/arm64/kernel/setup.c                     |   2 +-
>   arch/arm64/mm/init.c                          |  17 +-
>   arch/csky/kernel/setup.c                      |  24 +-
>   arch/csky/mm/init.c                           |   2 +-
>   arch/hexagon/configs/comet_defconfig          |   2 +-
>   arch/loongarch/configs/loongson3_defconfig    |   2 +-
>   arch/loongarch/kernel/mem.c                   |   2 +-
>   arch/loongarch/kernel/setup.c                 |   4 +-
>   arch/m68k/configs/amiga_defconfig             |   2 +-
>   arch/m68k/configs/apollo_defconfig            |   2 +-
>   arch/m68k/configs/atari_defconfig             |   2 +-
>   arch/m68k/configs/bvme6000_defconfig          |   2 +-
>   arch/m68k/configs/hp300_defconfig             |   2 +-
>   arch/m68k/configs/mac_defconfig               |   2 +-
>   arch/m68k/configs/multi_defconfig             |   2 +-
>   arch/m68k/configs/mvme147_defconfig           |   2 +-
>   arch/m68k/configs/mvme16x_defconfig           |   2 +-
>   arch/m68k/configs/q40_defconfig               |   2 +-
>   arch/m68k/configs/stmark2_defconfig           |   2 +-
>   arch/m68k/configs/sun3_defconfig              |   2 +-
>   arch/m68k/configs/sun3x_defconfig             |   2 +-
>   arch/m68k/kernel/setup_mm.c                   |  12 +-
>   arch/m68k/kernel/setup_no.c                   |  12 +-
>   arch/m68k/kernel/uboot.c                      |  17 +-
>   arch/microblaze/kernel/cpu/mb.c               |   2 +-
>   arch/microblaze/kernel/setup.c                |   2 +-
>   arch/microblaze/mm/init.c                     |  12 +-
>   arch/mips/ath79/prom.c                        |  12 +-
>   arch/mips/configs/ath25_defconfig             |  12 +-
>   arch/mips/configs/ath79_defconfig             |   4 +-
>   arch/mips/configs/bcm47xx_defconfig           |   2 +-
>   arch/mips/configs/bigsur_defconfig            |   2 +-
>   arch/mips/configs/bmips_be_defconfig          |   2 +-
>   arch/mips/configs/bmips_stb_defconfig         |  14 +-
>   arch/mips/configs/cavium_octeon_defconfig     |   2 +-
>   arch/mips/configs/eyeq5_defconfig             |   2 +-
>   arch/mips/configs/eyeq6_defconfig             |   2 +-
>   arch/mips/configs/generic_defconfig           |   2 +-
>   arch/mips/configs/gpr_defconfig               |   2 +-
>   arch/mips/configs/lemote2f_defconfig          |   2 +-
>   arch/mips/configs/loongson2k_defconfig        |   2 +-
>   arch/mips/configs/loongson3_defconfig         |   2 +-
>   arch/mips/configs/malta_defconfig             |   2 +-
>   arch/mips/configs/mtx1_defconfig              |   2 +-
>   arch/mips/configs/rb532_defconfig             |   2 +-
>   arch/mips/configs/rbtx49xx_defconfig          |   2 +-
>   arch/mips/configs/rt305x_defconfig            |   4 +-
>   arch/mips/configs/sb1250_swarm_defconfig      |   2 +-
>   arch/mips/configs/xway_defconfig              |   4 +-
>   arch/mips/kernel/setup.c                      |  53 ++-
>   arch/mips/mm/init.c                           |   2 +-
>   arch/mips/sibyte/common/cfe.c                 |  36 +-
>   arch/mips/sibyte/swarm/setup.c                |   2 +-
>   arch/nios2/kernel/setup.c                     |  20 +-
>   arch/openrisc/configs/or1klitex_defconfig     |   2 +-
>   arch/openrisc/configs/or1ksim_defconfig       |   4 +-
>   arch/openrisc/configs/simple_smp_defconfig    |  14 +-
>   arch/openrisc/configs/virt_defconfig          |   2 +-
>   arch/openrisc/kernel/setup.c                  |  24 +-
>   arch/openrisc/kernel/vmlinux.h                |   2 +-
>   arch/parisc/boot/compressed/misc.c            |   2 +-
>   arch/parisc/configs/generic-32bit_defconfig   |   2 +-
>   arch/parisc/configs/generic-64bit_defconfig   |   2 +-
>   arch/parisc/defpalo.conf                      |   2 +-
>   arch/parisc/kernel/pdt.c                      |   6 +-
>   arch/parisc/kernel/setup.c                    |   8 +-
>   arch/parisc/mm/init.c                         |  32 +-
>   arch/powerpc/configs/44x/akebono_defconfig    |   2 +-
>   arch/powerpc/configs/44x/arches_defconfig     |   2 +-
>   arch/powerpc/configs/44x/bamboo_defconfig     |   2 +-
>   arch/powerpc/configs/44x/bluestone_defconfig  |   2 +-
>   .../powerpc/configs/44x/canyonlands_defconfig |   2 +-
>   arch/powerpc/configs/44x/ebony_defconfig      |   2 +-
>   arch/powerpc/configs/44x/eiger_defconfig      |   2 +-
>   arch/powerpc/configs/44x/fsp2_defconfig       |  10 +-
>   arch/powerpc/configs/44x/icon_defconfig       |   2 +-
>   arch/powerpc/configs/44x/iss476-smp_defconfig |   2 +-
>   arch/powerpc/configs/44x/katmai_defconfig     |   2 +-
>   arch/powerpc/configs/44x/rainier_defconfig    |   2 +-
>   arch/powerpc/configs/44x/redwood_defconfig    |   2 +-
>   arch/powerpc/configs/44x/sam440ep_defconfig   |   2 +-
>   arch/powerpc/configs/44x/sequoia_defconfig    |   2 +-
>   arch/powerpc/configs/44x/taishan_defconfig    |   2 +-
>   arch/powerpc/configs/44x/warp_defconfig       |   2 +-
>   arch/powerpc/configs/52xx/cm5200_defconfig    |   2 +-
>   arch/powerpc/configs/52xx/lite5200b_defconfig |   2 +-
>   arch/powerpc/configs/52xx/motionpro_defconfig |   2 +-
>   arch/powerpc/configs/52xx/tqm5200_defconfig   |   2 +-
>   arch/powerpc/configs/83xx/asp8347_defconfig   |   2 +-
>   .../configs/83xx/mpc8313_rdb_defconfig        |   2 +-
>   .../configs/83xx/mpc8315_rdb_defconfig        |   2 +-
>   .../configs/83xx/mpc832x_rdb_defconfig        |   2 +-
>   .../configs/83xx/mpc834x_itx_defconfig        |   2 +-
>   .../configs/83xx/mpc834x_itxgp_defconfig      |   2 +-
>   .../configs/83xx/mpc836x_rdk_defconfig        |   2 +-
>   .../configs/83xx/mpc837x_rdb_defconfig        |   2 +-
>   arch/powerpc/configs/85xx/ge_imp3a_defconfig  |   2 +-
>   arch/powerpc/configs/85xx/ksi8560_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/socrates_defconfig  |   2 +-
>   arch/powerpc/configs/85xx/stx_gp3_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/tqm8540_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/tqm8541_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/tqm8548_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/tqm8555_defconfig   |   2 +-
>   arch/powerpc/configs/85xx/tqm8560_defconfig   |   2 +-
>   .../configs/85xx/xes_mpc85xx_defconfig        |   2 +-
>   arch/powerpc/configs/amigaone_defconfig       |   2 +-
>   arch/powerpc/configs/cell_defconfig           |   2 +-
>   arch/powerpc/configs/chrp32_defconfig         |   2 +-
>   arch/powerpc/configs/fsl-emb-nonhw.config     |   2 +-
>   arch/powerpc/configs/g5_defconfig             |   2 +-
>   arch/powerpc/configs/gamecube_defconfig       |   2 +-
>   arch/powerpc/configs/holly_defconfig          |   2 +-
>   arch/powerpc/configs/linkstation_defconfig    |   2 +-
>   arch/powerpc/configs/mgcoge_defconfig         |   4 +-
>   arch/powerpc/configs/microwatt_defconfig      |   2 +-
>   arch/powerpc/configs/mpc512x_defconfig        |   2 +-
>   arch/powerpc/configs/mpc5200_defconfig        |   2 +-
>   arch/powerpc/configs/mpc83xx_defconfig        |   2 +-
>   arch/powerpc/configs/pasemi_defconfig         |   2 +-
>   arch/powerpc/configs/pmac32_defconfig         |   2 +-
>   arch/powerpc/configs/powernv_defconfig        |   2 +-
>   arch/powerpc/configs/ppc44x_defconfig         |   2 +-
>   arch/powerpc/configs/ppc64_defconfig          |   2 +-
>   arch/powerpc/configs/ppc64e_defconfig         |   2 +-
>   arch/powerpc/configs/ppc6xx_defconfig         |   2 +-
>   arch/powerpc/configs/ps3_defconfig            |   2 +-
>   arch/powerpc/configs/skiroot_defconfig        |  12 +-
>   arch/powerpc/configs/wii_defconfig            |   2 +-
>   arch/powerpc/kernel/prom.c                    |  22 +-
>   arch/powerpc/kernel/prom_init.c               |   6 +-
>   arch/powerpc/kernel/setup-common.c            |  25 +-
>   arch/powerpc/kernel/setup_32.c                |   2 +-
>   arch/powerpc/kernel/setup_64.c                |   2 +-
>   arch/powerpc/mm/init_32.c                     |   2 +-
>   arch/powerpc/platforms/52xx/lite5200.c        |   2 +-
>   arch/powerpc/platforms/83xx/km83xx.c          |   2 +-
>   arch/powerpc/platforms/85xx/mpc85xx_mds.c     |   2 +-
>   arch/powerpc/platforms/chrp/setup.c           |   2 +-
>   .../platforms/embedded6xx/linkstation.c       |   2 +-
>   .../platforms/embedded6xx/storcenter.c        |   2 +-
>   arch/powerpc/platforms/powermac/setup.c       |   8 +-
>   arch/riscv/configs/defconfig                  |   2 +-
>   arch/riscv/configs/nommu_k210_defconfig       |  16 +-
>   arch/riscv/configs/nommu_virt_defconfig       |  12 +-
>   arch/riscv/mm/init.c                          |   4 +-
>   arch/s390/boot/ipl_parm.c                     |   2 +-
>   arch/s390/boot/startup.c                      |   4 +-
>   arch/s390/configs/zfcpdump_defconfig          |   2 +-
>   arch/s390/kernel/setup.c                      |  10 +-
>   arch/s390/mm/init.c                           |   2 +-
>   arch/sh/configs/apsh4a3a_defconfig            |   2 +-
>   arch/sh/configs/apsh4ad0a_defconfig           |   2 +-
>   arch/sh/configs/ecovec24-romimage_defconfig   |   2 +-
>   arch/sh/configs/edosk7760_defconfig           |   2 +-
>   arch/sh/configs/kfr2r09-romimage_defconfig    |   2 +-
>   arch/sh/configs/kfr2r09_defconfig             |   2 +-
>   arch/sh/configs/magicpanelr2_defconfig        |   2 +-
>   arch/sh/configs/migor_defconfig               |   2 +-
>   arch/sh/configs/rsk7201_defconfig             |   2 +-
>   arch/sh/configs/rsk7203_defconfig             |   2 +-
>   arch/sh/configs/sdk7786_defconfig             |   8 +-
>   arch/sh/configs/se7206_defconfig              |   2 +-
>   arch/sh/configs/se7705_defconfig              |   2 +-
>   arch/sh/configs/se7722_defconfig              |   2 +-
>   arch/sh/configs/se7751_defconfig              |   2 +-
>   arch/sh/configs/secureedge5410_defconfig      |   2 +-
>   arch/sh/configs/sh03_defconfig                |   2 +-
>   arch/sh/configs/sh7757lcr_defconfig           |   2 +-
>   arch/sh/configs/titan_defconfig               |   2 +-
>   arch/sh/configs/ul2_defconfig                 |   2 +-
>   arch/sh/configs/urquell_defconfig             |   2 +-
>   arch/sh/include/asm/setup.h                   |   1 -
>   arch/sh/kernel/head_32.S                      |   2 +-
>   arch/sh/kernel/setup.c                        |  27 +-
>   arch/sparc/boot/piggyback.c                   |   4 +-
>   arch/sparc/configs/sparc32_defconfig          |   2 +-
>   arch/sparc/configs/sparc64_defconfig          |   2 +-
>   arch/sparc/kernel/head_32.S                   |   4 +-
>   arch/sparc/kernel/head_64.S                   |   6 +-
>   arch/sparc/kernel/setup_32.c                  |   9 +-
>   arch/sparc/kernel/setup_64.c                  |   9 +-
>   arch/sparc/mm/init_32.c                       |  22 +-
>   arch/sparc/mm/init_64.c                       |  20 +-
>   arch/um/kernel/Makefile                       |   2 +-
>   arch/um/kernel/initrd.c                       |   6 +-
>   arch/x86/Kconfig                              |   2 +-
>   arch/x86/boot/header.S                        |   2 +-
>   arch/x86/boot/startup/sme.c                   |   2 +-
>   arch/x86/configs/i386_defconfig               |   2 +-
>   arch/x86/configs/x86_64_defconfig             |   2 +-
>   arch/x86/include/uapi/asm/bootparam.h         |   7 +-
>   arch/x86/kernel/cpu/microcode/amd.c           |   2 +-
>   arch/x86/kernel/cpu/microcode/core.c          |  12 +-
>   arch/x86/kernel/cpu/microcode/intel.c         |   2 +-
>   arch/x86/kernel/cpu/microcode/internal.h      |   2 +-
>   arch/x86/kernel/devicetree.c                  |   2 +-
>   arch/x86/kernel/setup.c                       |  39 +-
>   arch/x86/mm/init.c                            |   8 +-
>   arch/x86/mm/init_32.c                         |   2 +-
>   arch/x86/mm/init_64.c                         |   2 +-
>   arch/x86/tools/relocs.c                       |   2 +-
>   arch/xtensa/Kconfig                           |   2 +-
>   arch/xtensa/boot/dts/csp.dts                  |   2 +-
>   arch/xtensa/configs/audio_kc705_defconfig     |   2 +-
>   arch/xtensa/configs/cadence_csp_defconfig     |  12 +-
>   arch/xtensa/configs/generic_kc705_defconfig   |   2 +-
>   arch/xtensa/configs/nommu_kc705_defconfig     |  12 +-
>   arch/xtensa/configs/smp_lx200_defconfig       |   2 +-
>   arch/xtensa/configs/virt_defconfig            |   2 +-
>   arch/xtensa/configs/xip_kc705_defconfig       |   2 +-
>   arch/xtensa/kernel/setup.c                    |  26 +-
>   drivers/acpi/Kconfig                          |   2 +-
>   drivers/acpi/tables.c                         |  10 +-
>   drivers/base/firmware_loader/main.c           |   2 +-
>   drivers/block/Kconfig                         |   8 +-
>   drivers/block/brd.c                           |  20 +-
>   drivers/firmware/efi/efi.c                    |  10 +-
>   .../firmware/efi/libstub/efi-stub-helper.c    |   5 +-
>   drivers/gpu/drm/ci/arm.config                 |   2 +-
>   drivers/gpu/drm/ci/arm64.config               |   2 +-
>   drivers/gpu/drm/ci/x86_64.config              |   2 +-
>   drivers/of/fdt.c                              |  18 +-
>   fs/ext2/ext2.h                                |   9 -
>   fs/init.c                                     |  14 -
>   include/asm-generic/vmlinux.lds.h             |   8 +-
>   include/linux/ext2_fs.h                       |  13 -
>   include/linux/init_syscalls.h                 |   1 -
>   include/linux/initramfs.h                     |  26 ++
>   include/linux/initrd.h                        |  37 --
>   include/linux/root_dev.h                      |   1 -
>   include/linux/syscalls.h                      |   1 -
>   include/uapi/linux/sysctl.h                   |   1 -
>   init/.kunitconfig                             |   2 +-
>   init/Kconfig                                  |  28 +-
>   init/Makefile                                 |   6 +-
>   init/do_mounts.c                              |  28 +-
>   init/do_mounts.h                              |  42 --
>   init/do_mounts_initrd.c                       | 154 -------
>   init/do_mounts_rd.c                           | 334 ---------------
>   init/initramfs.c                              | 152 ++++---
>   init/main.c                                   |  66 +--
>   kernel/sys.c                                  |   7 +-
>   kernel/sysctl.c                               |   2 +-
>   kernel/umh.c                                  |   2 +-
>   scripts/package/builddeb                      |   2 +-
>   .../ktest/examples/bootconfigs/tracing.bconf  |   3 -
>   tools/testing/selftests/bpf/config.aarch64    |   2 +-
>   tools/testing/selftests/bpf/config.ppc64el    |   2 +-
>   tools/testing/selftests/bpf/config.riscv64    |   2 +-
>   tools/testing/selftests/bpf/config.s390x      |   2 +-
>   tools/testing/selftests/kho/vmtest.sh         |   2 +-
>   .../testing/selftests/nolibc/Makefile.nolibc  |   4 +-
>   tools/testing/selftests/vsock/config          |   2 +-
>   .../selftests/wireguard/qemu/kernel.config    |   2 +-
>   usr/Kconfig                                   |  70 ++--
>   usr/Makefile                                  |   2 +-
>   usr/initramfs_data.S                          |   4 +-
>   385 files changed, 969 insertions(+), 2346 deletions(-)
>   delete mode 100644 Documentation/admin-guide/initrd.rst
>   delete mode 100644 Documentation/power/swsusp-dmcrypt.rst
>   create mode 100644 include/linux/initramfs.h
>   delete mode 100644 include/linux/initrd.h
>   delete mode 100644 init/do_mounts_initrd.c
>   delete mode 100644 init/do_mounts_rd.c
> 
> 
> base-commit: 76eeb9b8de9880ca38696b2fb56ac45ac0a25c6c
> --
> 2.47.2
> 
> 


^ permalink raw reply

* Re: [PATCH v3 18/30] liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state management
From: Pasha Tatashin @ 2025-09-22 23:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814140252.GF802098@nvidia.com>

On Thu, Aug 14, 2025 at 10:02 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:24AM +0000, Pasha Tatashin wrote:
> > +struct liveupdate_ioctl_get_fd_state {
> > +     __u32           size;
> > +     __u8            incoming;
> > +     __aligned_u64   token;
> > +     __u32           state;
> > +};
>
> Same remark about explicit padding and checking padding for 0

Done

> > + * luo_file_get_state - Get the preservation state of a specific file.
> > + * @token: The token of the file to query.
> > + * @statep: Output pointer to store the file's current live update state.
> > + * @incoming: If true, query the state of a restored file from the incoming
> > + *            (previous kernel's) set. If false, query a file being prepared
> > + *            for preservation in the current set.
> > + *
> > + * Finds the file associated with the given @token in either the incoming
> > + * or outgoing tracking arrays and returns its current LUO state
> > + * (NORMAL, PREPARED, FROZEN, UPDATED).
> > + *
> > + * Return: 0 on success, -ENOENT if the token is not found.
> > + */
> > +int luo_file_get_state(u64 token, enum liveupdate_state *statep, bool incoming)
> > +{
> > +     struct luo_file *luo_file;
> > +     struct xarray *target_xa;
> > +     int ret = 0;
> > +
> > +     luo_state_read_enter();
>
> Less globals, at this point everything should be within memory
> attached to the file descriptor and not in globals. Doing this will
> promote good maintainable structure and not a spaghetti
>
> Also I think a BKL design is not a good idea for new code. We've had
> so many bad experiences with this pattern promoting uncontrolled
> incomprehensible locking.
>
> The xarray already has a lock, why not have reasonable locking inside
> the luo_file? Probably just a refcount?
>
> > +     target_xa = incoming ? &luo_files_xa_in : &luo_files_xa_out;
> > +     luo_file = xa_load(target_xa, token);
> > +
> > +     if (!luo_file) {
> > +             ret = -ENOENT;
> > +             goto out_unlock;
> > +     }
> > +
> > +     scoped_guard(mutex, &luo_file->mutex)
> > +             *statep = luo_file->state;
> > +
> > +out_unlock:
> > +     luo_state_read_exit();
>
> If we are using cleanup.h then use it for this too..
>
> But it seems kind of weird, why not just
>
> xa_lock()
> xa_load()
> *statep = READ_ONCE(luo_file->state);
> xa_unlock()
>
> ?

Yes, we can simplify with xa_lock(), thank you for your suggestion.

>
> > +static int luo_ioctl_set_fd_event(struct luo_ucmd *ucmd)
> > +{
> > +     struct liveupdate_ioctl_set_fd_event *argp = ucmd->cmd;
> > +     int ret;
> > +
> > +     switch (argp->event) {
> > +     case LIVEUPDATE_PREPARE:
> > +             ret = luo_file_prepare(argp->token);
> > +             break;
> > +     case LIVEUPDATE_FREEZE:
> > +             ret = luo_file_freeze(argp->token);
> > +             break;
> > +     case LIVEUPDATE_FINISH:
> > +             ret = luo_file_finish(argp->token);
> > +             break;
> > +     case LIVEUPDATE_CANCEL:
> > +             ret = luo_file_cancel(argp->token);
> > +             break;
>
> The token should be converted to a file here instead of duplicated in
> each function

struct luo_file is preivate to luo_file.c, and I think it makes sense
to keep it that way, amount of duplicated code is trivial.00

> >  static int luo_open(struct inode *inodep, struct file *filep)
> >  {
> >       if (atomic_cmpxchg(&luo_device_in_use, 0, 1))
> > @@ -149,6 +191,8 @@ union ucmd_buffer {
> >       struct liveupdate_ioctl_fd_restore      restore;
> >       struct liveupdate_ioctl_get_state       state;
> >       struct liveupdate_ioctl_set_event       event;
> > +     struct liveupdate_ioctl_get_fd_state    fd_state;
> > +     struct liveupdate_ioctl_set_fd_event    fd_event;
> >  };
> >
> >  struct luo_ioctl_op {
> > @@ -179,6 +223,10 @@ static const struct luo_ioctl_op luo_ioctl_ops[] = {
> >                struct liveupdate_ioctl_get_state, state),
> >       IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
> >                struct liveupdate_ioctl_set_event, event),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_GET_FD_STATE, luo_ioctl_get_fd_state,
> > +              struct liveupdate_ioctl_get_fd_state, token),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_SET_FD_EVENT, luo_ioctl_set_fd_event,
> > +              struct liveupdate_ioctl_set_fd_event, token),
> >  };
>
> Keep sorted

Done

^ permalink raw reply

* Re: [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
From: Pasha Tatashin @ 2025-09-22 21:23 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <mafs07byoye0q.fsf@kernel.org>

On Wed, Aug 27, 2025 at 11:34 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> Hi Pasha,
>
> On Thu, Aug 07 2025, Pasha Tatashin wrote:
>
> > Currently, a file descriptor registered for preservation via the remains
> > globally registered with LUO until it is explicitly unregistered. This
> > creates a potential for resource leaks into the next kernel if the
> > userspace agent crashes or exits without proper cleanup before a live
> > update is fully initiated.
> >
> > This patch ties the lifetime of FD preservation requests to the lifetime
> > of the open file descriptor for /dev/liveupdate, creating an implicit
> > "session".
> >
> > When the /dev/liveupdate file descriptor is closed (either explicitly
> > via close() or implicitly on process exit/crash), the .release
> > handler, luo_release(), is now called. This handler invokes the new
> > function luo_unregister_all_files(), which iterates through all FDs
> > that were preserved through that session and unregisters them.
>
> Why special case files here? Shouldn't you undo all the serialization
> done for all the subsystems?

Good point, subsystems should also be cancelled, and system should be
brought back to normal state. However, with session support, we will
be dropping only FDs that belong to a specific session when its FD is
closed, or all FDs+subsystems when closing /dev/liveupdate.

> Anyway, this is buggy. I found this when testing the memfd patches. If
> you preserve a memfd and close the /dev/liveupdate FD before reboot,
> luo_unregister_all_files() calls the cancel callback, which calls
> kho_unpreserve_folio(). But kho_unpreserve_folio() fails because KHO is
> still in finalized state. This doesn't happen when cancelling explicitly
> because luo_cancel() calls kho_abort().

Yes, KHO still has its states, that break the LUO logic. I think,
there is going to be some limitations until "stateless" kho patches
land.

> I think you should just make the release go through the cancel flow,
> since the operation is essentially a cancel anyway. There are subtle
> differences here though, since the release might be called before
> prepare, so we need to be careful of that.

Makes sense.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v3 16/30] liveupdate: luo_ioctl: add userpsace interface
From: Pasha Tatashin @ 2025-09-22 21:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814134917.GE802098@nvidia.com>

> > + *  - EINVAL: Everything about the IOCTL was understood, but a field is not
> > + *    correct.
> > + *  - ENOENT: An ID or IOVA provided does not exist.
>                     ^^^^^^^^^
>
> Maybe this should be 'token' ?

Yes, replaced with token. :-)

> > +struct liveupdate_ioctl_fd_unpreserve {
> > +       __u32           size;
> > +       __aligned_u64   token;
> > +};
>
> It is best to explicitly pad, so add a __u32 reserved between size and
> token
>
> Then you need to also check that the reserved is 0 when parsing it,
> return -EOPNOTSUPP otherwise.

Done.

>
> > +static atomic_t luo_device_in_use = ATOMIC_INIT(0);
>
> I suggest you bundle this together into one struct with the misc_dev
> and the other globals and largely pretend it is not global, eg refer
> to it through container_of, etc
>
> Following practices like this make it harder to abuse the globals.

Done, good suggestion.

> > +struct luo_ucmd {
> > +     void __user *ubuffer;
> > +     u32 user_size;
> > +     void *cmd;
> > +};
> > +
> > +static int luo_ioctl_fd_preserve(struct luo_ucmd *ucmd)
> > +{
> > +     struct liveupdate_ioctl_fd_preserve *argp = ucmd->cmd;
> > +     int ret;
> > +
> > +     ret = luo_register_file(argp->token, argp->fd);
> > +     if (!ret)
> > +             return ret;
> > +
> > +     if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> > +             return -EFAULT;
>
> This will overflow memory, ucmd->user_size may be > sizeof(*argp)
>
> The respond function is an important part of this scheme:
>
> static inline int iommufd_ucmd_respond(struct iommufd_ucmd *ucmd,
>                                        size_t cmd_len)
> {
>         if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
>                          min_t(size_t, ucmd->user_size, cmd_len)))
>                 return -EFAULT;
>
> The min (sizeof(*argp) in this case) can't be skipped!

Done, thank you for catching this.

> > +static int luo_ioctl_fd_restore(struct luo_ucmd *ucmd)
> > +{
> > +     struct liveupdate_ioctl_fd_restore *argp = ucmd->cmd;
> > +     struct file *file;
> > +     int ret;
> > +
> > +     argp->fd = get_unused_fd_flags(O_CLOEXEC);
> > +     if (argp->fd < 0) {
> > +             pr_err("Failed to allocate new fd: %d\n", argp->fd);
>
> No need

Removed

> > +             return argp->fd;
> > +     }
> > +
> > +     ret = luo_retrieve_file(argp->token, &file);
> > +     if (ret < 0) {
> > +             put_unused_fd(argp->fd);
> > +
> > +             return ret;
> > +     }
> > +
> > +     fd_install(argp->fd, file);
> > +
> > +     if (copy_to_user(ucmd->ubuffer, argp, ucmd->user_size))
> > +             return -EFAULT;
>
> Wrong order, fd_install must be last right before return 0. Failing
> system calls should not leave behind installed FDs.

Fixed.

>
> > +static int luo_ioctl_set_event(struct luo_ucmd *ucmd)
> > +{
> > +     struct liveupdate_ioctl_set_event *argp = ucmd->cmd;
> > +     int ret;
> > +
> > +     switch (argp->event) {
> > +     case LIVEUPDATE_PREPARE:
> > +             ret = luo_prepare();
> > +             break;
> > +     case LIVEUPDATE_FINISH:
> > +             ret = luo_finish();
> > +             break;
> > +     case LIVEUPDATE_CANCEL:
> > +             ret = luo_cancel();
> > +             break;
> > +     default:
> > +             ret = -EINVAL;
>
> EOPNOTSUPP

Ack.

>
> > +union ucmd_buffer {
> > +     struct liveupdate_ioctl_fd_preserve     preserve;
> > +     struct liveupdate_ioctl_fd_unpreserve   unpreserve;
> > +     struct liveupdate_ioctl_fd_restore      restore;
> > +     struct liveupdate_ioctl_get_state       state;
> > +     struct liveupdate_ioctl_set_event       event;
> > +};
>
> I discourage the column alignment. Also sort by name.

Done

>
> > +static const struct luo_ioctl_op luo_ioctl_ops[] = {
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_FD_PRESERVE, luo_ioctl_fd_preserve,
> > +              struct liveupdate_ioctl_fd_preserve, token),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_FD_UNPRESERVE, luo_ioctl_fd_unpreserve,
> > +              struct liveupdate_ioctl_fd_unpreserve, token),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_FD_RESTORE, luo_ioctl_fd_restore,
> > +              struct liveupdate_ioctl_fd_restore, token),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_GET_STATE, luo_ioctl_get_state,
> > +              struct liveupdate_ioctl_get_state, state),
> > +     IOCTL_OP(LIVEUPDATE_IOCTL_SET_EVENT, luo_ioctl_set_event,
> > +              struct liveupdate_ioctl_set_event, event),
>
> Sort by name

Done

^ permalink raw reply

* Re: [PATCH v3 10/30] liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
From: Pasha Tatashin @ 2025-09-22 15:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814133151.GD802098@nvidia.com>

On Thu, Aug 14, 2025 at 9:32 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:16AM +0000, Pasha Tatashin wrote:
> > --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> > +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> > @@ -383,6 +383,8 @@ Code  Seq#    Include File                                             Comments
> >  0xB8  01-02  uapi/misc/mrvl_cn10k_dpi.h                                Marvell CN10K DPI driver
> >  0xB8  all    uapi/linux/mshv.h                                         Microsoft Hyper-V /dev/mshv driver
> >                                                                         <mailto:linux-hyperv@vger.kernel.org>
> > +0xBA  all    uapi/linux/liveupdate.h                                   Pasha Tatashin
> > +                                                                       <mailto:pasha.tatashin@soleen.com>
>
> Let's not be greedy ;) Just take 00-0F for the moment

Done.

Pasha

>
> Jason

^ permalink raw reply

* Re: [PATCH v3 08/30] kho: don't unpreserve memory during abort
From: Pasha Tatashin @ 2025-09-22 14:57 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250814133009.GC802098@nvidia.com>

On Thu, Aug 14, 2025 at 9:30 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:14AM +0000, Pasha Tatashin wrote:
> >  static int __kho_abort(void)
> >  {
> > -     int err = 0;
> > -     unsigned long order;
> > -     struct kho_mem_phys *physxa;
> > -
> > -     xa_for_each(&kho_out.track.orders, order, physxa) {
> > -             struct kho_mem_phys_bits *bits;
> > -             unsigned long phys;
> > -
> > -             xa_for_each(&physxa->phys_bits, phys, bits)
> > -                     kfree(bits);
> > -
> > -             xa_destroy(&physxa->phys_bits);
> > -             kfree(physxa);
> > -     }
> > -     xa_destroy(&kho_out.track.orders);
>
> Now nothing ever cleans this up :\

It is solved with stateless KHO. The current implementation is broken,
dropping everything in abort should never happen for stuff that was
independently preserved.

> Are you sure the issue isn't in the caller that it shouldn't be
> calling kho abort until all the other stuff is cleaned up first?
>
> I feel like this is another case of absuing globals gives an unclear
> lifecycle model.

Yes. But, we have a fix for that.

Pasha

^ permalink raw reply

* Re: [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate
From: Pasha Tatashin @ 2025-09-22 14:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <aLK3trXYYYIUaV4Q@kernel.org>

On Sat, Aug 30, 2025 at 4:35 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:15AM +0000, Pasha Tatashin wrote:
> > Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
> > core kernel related files to the same place.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
> >
> > ---
> > diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
> > new file mode 100644
> > index 000000000000..72cf7a8e6739
> > --- /dev/null
> > +++ b/kernel/liveupdate/Makefile
> > @@ -0,0 +1,7 @@
> > +# SPDX-License-Identifier: GPL-2.0
> > +#
> > +# Makefile for the linux kernel.
>
> Nit: this line does not provide much, let's drop it

Done.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH RESEND 00/62] initrd: remove classic initrd support
From: Nicolas Schichan @ 2025-09-22 14:28 UTC (permalink / raw)
  To: Askar Safin
  Cc: akpm, andy.shevchenko, axboe, brauner, cyphar, devicetree,
	ecurtin, email2tema, graf, gregkh, hca, hch, hsiangkao, initramfs,
	jack, julian.stecklina, kees, linux-acpi, linux-alpha, linux-api,
	linux-arch, linux-arm-kernel, linux-block, linux-csky, linux-doc,
	linux-efi, linux-ext4, linux-fsdevel, linux-hexagon, linux-kernel,
	linux-m68k, linux-mips, linux-openrisc, linux-parisc, linux-riscv,
	linux-s390, linux-sh, linux-snps-arc, linux-um, linuxppc-dev,
	loongarch, mcgrof, mingo, monstr, mzxreary, patches, rob,
	sparclinux, thomas.weissschuh, thorsten.blum, torvalds, tytso,
	viro, x86
In-Reply-To: <CAPnZJGDwETQVVURezSRxZB8ZAwBETQ5fwbXyeMpfDLuLW4rVdg@mail.gmail.com>

[resending to the lists and Cc, sorry I initially replied only to Askar]

On Sat, Sep 20, 2025 at 5:55 AM Askar Safin <safinaskar@gmail.com> wrote:
> On Fri, Sep 19, 2025 at 6:25 PM Nicolas Schichan <nschichan@freebox.fr> wrote:
> > Considering that the deprecation message didn't get displayed in some
> > configurations, maybe it's a bit early at the very least.
>
> I changed my opinion.
> Breaking users, who did not see a deprecation message at all,
> is unfair.
> I will send a patchset soon, which will remove initrd codepath,
> which currently contains deprecation notice. And I will put
> deprecation notice to
> other codepath.

Thanks

> Then in September 2026 I will fully remove initrd.

Is there a way to find some kind of middle ground here ?

I'm lead to believe that the main issue with the current code is that
it needs to parse the superblocks of the ramdisk image in order to get
the amount to data to copy into /dev/ram0.

It looks like it is partly because of the ramdisk_start= kernel
command line parameter which looks to be a remnant of the time it was
possible to boot on floppy disk on x86.

This kernel command line allows to look for a rootfs image at an
offset into the initrd data.

If we assume now that the rootfs image data starts at the beginning of
the initrd image and is the only part of the initrd image this would
indeed remove a lot of complexity.

Maybe it would be possible to remove the identify_ramdisk_image()
function and just copy the actual size of /initrd.image into
/dev/ram0. This would allow any file system to be used in an initrd
image (no just romfs, cramfs, minixfs, ext2fs and squashfs), and this
would simplify the code in init/do_mounts_rd.c greatly, with just the
function rd_load_image() and nr_blocks() remaining in this file.

I can send a patch for that but first I need to sort out my SMTP
issues from the other day.

Regards,

-- 
Nicolas Schichan

^ permalink raw reply

* Re: [PATCH v4 07/10] man/man2/open_tree.2: document "new" mount API
From: Alejandro Colomar @ 2025-09-22 13:22 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-09-22-sneaky-similar-mind-cilantro-u1EJJ2@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 1748 bytes --]

Hi Aleksa,

On Mon, Sep 22, 2025 at 08:09:47PM +1000, Aleksa Sarai wrote:
> > > +is lazy\[em]akin to calling
> > 
> > I prefer em dashes in both sides of the parenthetical; it more clearly
> > denotes where it ends.
> > 
> > 	is lazy
> > 	\[em]akin to calling
> > 	.BR umount2 (2)
> > 	with
> > 	.BR MOUNT_DETACH \[em];
> 
> An \[em] next to a ";"? Let me see if I can rewrite it to avoid this...

You could use parentheses, maybe.

> > > +.IR "mount --bind" )
> > 
> > You need to escape dashes in manual pages.  Otherwise, they're formatted
> > as hyphens, which can't be pasted into the terminal (and another
> > consequence is not being able to search for them in the man(1) reader
> > with literal dashes).
> > 
> > Depending on your system, you might be able to search for them or paste
> > them to the terminal, because some distros patch this in
> > /etc/local/an.tmac, at the expense of generating lower quality pages,
> > but in general don't rely on that.
> > 
> > I've noticed now, but this probably also happens in previous pages in
> > this patch set.
> > 
> > While at it, you should also use a non-breaking space, to keep the
> > entire command in the same line.
> > 
> > 	.IR \%mount\~\-\-bind )
> 
> My bad, I think my terminal font doesn't distinguish between them well
> enough for it to be obvious. I'll go through and fix up all of these
> cases.

I should probably add an automated diagnostic.  At least the case of two
'--' together, which I've never seen useful unescaped, should be
diagnosed.  I'll add a make(1) 'lint-man-dash' target that catches this
with a regex.


Have a lovely day!
Alex

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v4 07/10] man/man2/open_tree.2: document "new" mount API
From: Aleksa Sarai @ 2025-09-22 10:09 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: Michael T. Kerrisk, Alexander Viro, Jan Kara, Askar Safin,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <gyhtwwu7kgkaz5l5h46ll3voypfk74cahpfpmagbngj3va3x7c@pm3pssyst2al>

[-- Attachment #1: Type: text/plain, Size: 17051 bytes --]

On 2025-09-21, Alejandro Colomar <alx@kernel.org> wrote:
> Hi Aleksa,
> 
> On Fri, Sep 19, 2025 at 11:59:48AM +1000, Aleksa Sarai wrote:
> > This is loosely based on the original documentation written by David
> > Howells and later maintained by Christian Brauner, but has been
> > rewritten to be more from a user perspective (as well as fixing a few
> > critical mistakes).
> > 
> > Co-authored-by: David Howells <dhowells@redhat.com>
> > Signed-off-by: David Howells <dhowells@redhat.com>
> > Co-authored-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Christian Brauner <brauner@kernel.org>
> > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > ---
> >  man/man2/open_tree.2 | 498 +++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 498 insertions(+)
> > 
> > diff --git a/man/man2/open_tree.2 b/man/man2/open_tree.2
> > new file mode 100644
> > index 0000000000000000000000000000000000000000..7f85df08b43c7b48a9d021dbbeb2c60092a2b2d4
> > --- /dev/null
> > +++ b/man/man2/open_tree.2
> > @@ -0,0 +1,498 @@
> > +.\" Copyright, the authors of the Linux man-pages project
> > +.\"
> > +.\" SPDX-License-Identifier: Linux-man-pages-copyleft
> > +.\"
> > +.TH open_tree 2 (date) "Linux man-pages (unreleased)"
> > +.SH NAME
> > +open_tree \- open path or create detached mount object and attach to fd
> > +.SH LIBRARY
> > +Standard C library
> > +.RI ( libc ,\~ \-lc )
> > +.SH SYNOPSIS
> > +.nf
> > +.BR "#define _GNU_SOURCE         " "/* See feature_test_macros(7) */"
> > +.BR "#include <fcntl.h>" "          /* Definition of " AT_* " constants */"
> > +.B #include <sys/mount.h>
> > +.P
> > +.BI "int open_tree(int " dirfd ", const char *" path ", unsigned int " flags );
> > +.fi
> > +.SH DESCRIPTION
> > +The
> > +.BR open_tree ()
> > +system call is part of
> > +the suite of file descriptor based mount facilities in Linux.
> > +.IP \[bu] 3
> > +If
> > +.I flags
> > +contains
> > +.BR \%OPEN_TREE_CLONE ,
> > +.BR open_tree ()
> > +creates a detached mount object
> > +which consists of a bind-mount of
> > +the path specified by the
> > +.IR path .
> > +A new file descriptor
> > +associated with the detached mount object
> > +is then returned.
> > +The mount object is equivalent to a bind-mount
> > +that would be created by
> > +.BR mount (2)
> > +called with
> > +.BR MS_BIND ,
> > +except that it is tied to a file descriptor
> > +and is not mounted onto the filesystem.
> > +.IP
> > +As with file descriptors returned from
> > +.BR fsmount (2),
> > +the resultant file descriptor can then be used with
> > +.BR move_mount (2),
> > +.BR mount_setattr (2),
> > +or other such system calls to do further mount operations.
> > +This mount object will be unmounted and destroyed
> > +when the file descriptor is closed
> > +if it was not otherwise attached to a mount point
> > +by calling
> > +.BR move_mount (2).
> > +(Note that the unmount operation on
> 
> Maybe I would make this note a paragraph of its own; this would give it
> more visibility, I think.  And I'd remove 'Note that', and start
> directly with the noted contents (everything in a manual page must be
> noteworthy, in general).
> 
> > +.BR close (2)
> 
> I'm a bit confused by the reference to close(2).  The previous text
> mentions closing, but not close(2), so I'm not sure if this refers to
> that or if it is comparing it to close(2).  Would you mind having a look
> at the wording of this entire paragraph?

Well, it's more that these kinds of file descriptors are marked with
FMODE_NEEDS_UMOUNT which will cause dissolve_on_fput() to be called when
they have no more references.

So this could be through close(2) or any other condition that causes a
file descriptor to be closed (dup2(2), process death, execve with
O_CLOEXEC, etc). Maybe it's better to not mention close(2) explicitly...

> > +is lazy\[em]akin to calling
> 
> I prefer em dashes in both sides of the parenthetical; it more clearly
> denotes where it ends.
> 
> 	is lazy
> 	\[em]akin to calling
> 	.BR umount2 (2)
> 	with
> 	.BR MOUNT_DETACH \[em];

An \[em] next to a ";"? Let me see if I can rewrite it to avoid this...

> (I assume that's where it ends.)
> 
> > +.BR umount2 (2)
> > +with
> > +.BR MOUNT_DETACH ;
> > +any existing open references to files
> > +from the mount object
> > +will continue to work,
> > +and the mount object will only be completely destroyed
> > +once it ceases to be busy.)
> > +.IP \[bu]
> > +If
> > +.I flags
> > +does not contain
> > +.BR \%OPEN_TREE_CLONE ,
> > +.BR open_tree ()
> > +returns a file descriptor
> > +that is exactly equivalent to
> > +one produced by
> > +.BR openat (2)
> > +when called with the same
> > +.I dirfd
> > +and
> > +.IR path .
> > +.P
> > +In either case, the resultant file descriptor
> > +acts the same as one produced by
> > +.BR open (2)
> > +with
> > +.BR O_PATH ,
> > +meaning it can also be used as a
> > +.I dirfd
> > +argument to
> > +"*at()" system calls.
> > +.P
> > +As with "*at()" system calls,
> > +.BR open_tree ()
> > +uses the
> > +.I dirfd
> > +argument in conjunction with the
> > +.I path
> > +argument to determine the path to operate on, as follows:
> > +.IP \[bu] 3
> > +If the pathname given in
> > +.I path
> > +is absolute, then
> > +.I dirfd
> > +is ignored.
> > +.IP \[bu]
> > +If the pathname given in
> > +.I path
> > +is relative and
> > +.I dirfd
> > +is the special value
> > +.BR \%AT_FDCWD ,
> > +then
> > +.I path
> > +is interpreted relative to
> > +the current working directory
> > +of the calling process (like
> > +.BR open (2)).
> > +.IP \[bu]
> > +If the pathname given in
> > +.I path
> > +is relative,
> > +then it is interpreted relative to
> > +the directory referred to by the file descriptor
> > +.I dirfd
> > +(rather than relative to
> > +the current working directory
> > +of the calling process,
> > +as is done by
> > +.BR open (2)
> > +for a relative pathname).
> > +In this case,
> > +.I dirfd
> > +must be a directory
> > +that was opened for reading
> > +.RB ( O_RDONLY )
> > +or using the
> > +.B O_PATH
> > +flag.
> > +.IP \[bu]
> > +If
> > +.I path
> > +is an empty string,
> > +and
> > +.I flags
> > +contains
> > +.BR \%AT_EMPTY_PATH ,
> > +then the file descriptor
> > +.I dirfd
> > +is operated on directly.
> > +In this case,
> > +.I dirfd
> > +may refer to any type of file,
> > +not just a directory.
> > +.P
> > +See
> > +.BR openat (2)
> > +for an explanation of why the
> > +.I dirfd
> > +argument is useful.
> > +.P
> > +.I flags
> > +can be used to control aspects of the path lookup
> > +and properties of the returned file descriptor.
> > +A value for
> > +.I flags
> > +is constructed by bitwise ORing
> > +zero or more of the following constants:
> > +.RS
> > +.TP
> > +.B \%AT_EMPTY_PATH
> > +If
> > +.I path
> > +is an empty string, operate on the file referred to by
> > +.I dirfd
> > +(which may have been obtained from
> > +.BR open (2),
> > +.BR fsmount(2),
> > +or from another
> > +.BR open_tree ()
> > +call).
> > +In this case,
> > +.I dirfd
> > +may refer to any type of file, not just a directory.
> > +If
> > +.I dirfd
> > +is
> > +.BR \%AT_FDCWD ,
> > +.BR open_tree ()
> > +will operate on the current working directory
> > +of the calling process.
> > +This flag is Linux-specific; define
> > +.B \%_GNU_SOURCE
> > +to obtain its definition.
> > +.TP
> > +.B \%AT_NO_AUTOMOUNT
> > +Do not automount the terminal ("basename") component of
> > +.I path
> > +if it is a directory that is an automount point.
> > +This allows you to create a handle to the automount point itself,
> > +rather than the location it would mount.
> > +This flag has no effect if the mount point has already been mounted over.
> > +This flag is Linux-specific; define
> > +.B \%_GNU_SOURCE
> > +to obtain its definition.
> > +.TP
> > +.B \%AT_SYMLINK_NOFOLLOW
> > +If
> > +.I path
> > +is a symbolic link, do not dereference it; instead,
> > +create either a handle to the link itself
> > +or a bind-mount of it.
> > +The resultant file descriptor is indistinguishable from one produced by
> > +.BR openat (2)
> > +with
> > +.BR \%O_PATH | O_NOFOLLLOW .
> > +.TP
> > +.B \%OPEN_TREE_CLOEXEC
> > +Set the close-on-exec
> > +.RB ( FD_CLOEXEC )
> > +flag on the new file descriptor.
> > +See the description of the
> > +.B O_CLOEXEC
> > +flag in
> > +.BR open (2)
> > +for reasons why this may be useful.
> > +.TP
> > +.B \%OPEN_TREE_CLONE
> > +Rather than creating an
> > +.BR openat (2)-style
> > +.B O_PATH
> > +file descriptor,
> > +create a bind-mount of
> > +.I path
> > +(akin to
> > +.IR "mount --bind" )
> 
> You need to escape dashes in manual pages.  Otherwise, they're formatted
> as hyphens, which can't be pasted into the terminal (and another
> consequence is not being able to search for them in the man(1) reader
> with literal dashes).
> 
> Depending on your system, you might be able to search for them or paste
> them to the terminal, because some distros patch this in
> /etc/local/an.tmac, at the expense of generating lower quality pages,
> but in general don't rely on that.
> 
> I've noticed now, but this probably also happens in previous pages in
> this patch set.
> 
> While at it, you should also use a non-breaking space, to keep the
> entire command in the same line.
> 
> 	.IR \%mount\~\-\-bind )

My bad, I think my terminal font doesn't distinguish between them well
enough for it to be obvious. I'll go through and fix up all of these
cases.

Thanks.

> Cheers,
> Alex
> 
> > +as a detached mount object.
> > +In order to do this operation,
> > +the calling process must have the
> > +.BR \%CAP_SYS_ADMIN
> > +capability.
> > +.TP
> > +.B \%AT_RECURSIVE
> > +Create a recursive bind-mount of the path
> > +(akin to
> > +.IR "mount --rbind" )
> > +as a detached mount object.
> > +This flag is only permitted in conjunction with
> > +.BR \%OPEN_TREE_CLONE .
> > +.SH RETURN VALUE
> > +On success, a new file descriptor is returned.
> > +On error, \-1 is returned, and
> > +.I errno
> > +is set to indicate the error.
> > +.SH ERRORS
> > +.TP
> > +.B EACCES
> > +Search permission is denied for one of the directories
> > +in the path prefix of
> > +.IR path .
> > +(See also
> > +.BR path_resolution (7).)
> > +.TP
> > +.B EBADF
> > +.I path
> > +is relative but
> > +.I dirfd
> > +is neither
> > +.B \%AT_FDCWD
> > +nor a valid file descriptor.
> > +.TP
> > +.B EFAULT
> > +.I path
> > +is NULL
> > +or a pointer to a location
> > +outside the calling process's accessible address space.
> > +.TP
> > +.B EINVAL
> > +Invalid flag specified in
> > +.IR flags .
> > +.TP
> > +.B ELOOP
> > +Too many symbolic links encountered when resolving
> > +.IR path .
> > +.TP
> > +.B EMFILE
> > +The calling process has too many open files to create more.
> > +.TP
> > +.B ENAMETOOLONG
> > +.I path
> > +is longer than
> > +.BR PATH_MAX .
> > +.TP
> > +.B ENFILE
> > +The system has too many open files to create more.
> > +.TP
> > +.B ENOENT
> > +A component of
> > +.I path
> > +does not exist, or is a dangling symbolic link.
> > +.TP
> > +.B ENOENT
> > +.I path
> > +is an empty string, but
> > +.B AT_EMPTY_PATH
> > +is not specified in
> > +.IR flags .
> > +.TP
> > +.B ENOTDIR
> > +A component of the path prefix of
> > +.I path
> > +is not a directory, or
> > +.I path
> > +is relative and
> > +.I dirfd
> > +is a file descriptor referring to a file other than a directory.
> > +.TP
> > +.B ENOSPC
> > +The "anonymous" mount namespace
> > +necessary to contain the
> > +.B \%OPEN_TREE_CLONE
> > +detached bind-mount mount object
> > +could not be allocated,
> > +as doing so would exceed
> > +the configured per-user limit on
> > +the number of mount namespaces in the current user namespace.
> > +(See also
> > +.BR namespaces (7).)
> > +.TP
> > +.B ENOMEM
> > +The kernel could not allocate sufficient memory to complete the operation.
> > +.TP
> > +.B EPERM
> > +.I flags
> > +contains
> > +.B \%OPEN_TREE_CLONE
> > +but the calling process does not have the required
> > +.B CAP_SYS_ADMIN
> > +capability.
> > +.SH STANDARDS
> > +Linux.
> > +.SH HISTORY
> > +Linux 5.2.
> > +.\" commit a07b20004793d8926f78d63eb5980559f7813404
> > +.\" commit 400913252d09f9cfb8cce33daee43167921fc343
> > +glibc 2.36.
> > +.SH NOTES
> > +.SS Mount propagation
> > +The bind-mount mount objects created by
> > +.BR open_tree ()
> > +with
> > +.B \%OPEN_TREE_CLONE
> > +are not associated with
> > +the mount namespace of the calling process.
> > +Instead, each mount object is placed
> > +in a newly allocated "anonymous" mount namespace
> > +associated with the calling process.
> > +.P
> > +One of the side-effects of this is that
> > +(unlike bind-mounts created with
> > +.BR mount (2)),
> > +mount propagation
> > +(as described in
> > +.BR mount_namespaces (7))
> > +will not be applied to bind-mounts created by
> > +.BR open_tree ()
> > +until the bind-mount is attached with
> > +.BR move_mount (2),
> > +at which point the mount object
> > +will be associated with the mount namespace
> > +where it was attached
> > +and mount propagation will resume.
> > +Note that any mount propagation events that occurred
> > +before the mount object was attached
> > +will
> > +.I not
> > +be propagated to the mount object,
> > +even after it is attached.
> > +.SH EXAMPLES
> > +The following examples show how
> > +.BR open_tree ()
> > +can be used in place of more traditional
> > +.BR mount (2)
> > +calls with
> > +.BR MS_BIND .
> > +.P
> > +.in +4n
> > +.EX
> > +int srcfd = open_tree(AT_FDCWD, "/var", OPEN_TREE_CLONE);
> > +move_mount(srcfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.EE
> > +.in
> > +.P
> > +First,
> > +a detached bind-mount mount object of
> > +.I /var
> > +is created
> > +and associated with the file descriptor
> > +.IR srcfd .
> > +Then, the mount object is attached to
> > +.I /mnt
> > +using
> > +.BR move_mount (2)
> > +with
> > +.B \%MOVE_MOUNT_F_EMPTY_PATH
> > +to request that the detached mount object
> > +associated with the file descriptor
> > +.I srcfd
> > +be moved (and thus attached) to
> > +.IR /mnt .
> > +.P
> > +The above procedure is functionally equivalent to
> > +the following mount operation using
> > +.BR mount (2):
> > +.P
> > +.in +4n
> > +.EX
> > +mount("/var", "/mnt", NULL, MS_BIND, NULL);
> > +.EE
> > +.in
> > +.P
> > +.B \%OPEN_TREE_CLONE
> > +can be combined with
> > +.B \%AT_RECURSIVE
> > +to create recursive detached bind-mount mount objects,
> > +which in turn can be attached to mount points
> > +to create recursive bind-mounts.
> > +.P
> > +.in +4n
> > +.EX
> > +int srcfd = open_tree(AT_FDCWD, "/var", OPEN_TREE_CLONE | AT_RECURSIVE);
> > +move_mount(srcfd, "", AT_FDCWD, "/mnt", MOVE_MOUNT_F_EMPTY_PATH);
> > +.EE
> > +.in
> > +.P
> > +The above procedure is functionally equivalent to
> > +the following mount operation using
> > +.BR mount (2):
> > +.P
> > +.in +4n
> > +.EX
> > +mount("/var", "/mnt", NULL, MS_BIND | MS_REC, NULL);
> > +.EE
> > +.in
> > +.P
> > +One of the primary benefits of using
> > +.BR open_tree ()
> > +and
> > +.BR move_mount (2)
> > +over the traditional
> > +.BR mount (2)
> > +is that operating with
> > +.IR dirfd -style
> > +file descriptors is far easier and more intuitive.
> > +.P
> > +.in +4n
> > +.EX
> > +int srcfd = open_tree(100, "", AT_EMPTY_PATH | OPEN_TREE_CLONE);
> > +move_mount(srcfd, "", 200, "foo", MOVE_MOUNT_F_EMPTY_PATH);
> > +.EE
> > +.in
> > +.P
> > +The above procedure is roughly equivalent to
> > +the following mount operation using
> > +.BR mount (2):
> > +.P
> > +.in +4n
> > +.EX
> > +mount("/proc/self/fd/100", "/proc/self/fd/200/foo", NULL, MS_BIND, NULL);
> > +.EE
> > +.in
> > +.P
> > +In addition, you can use the file descriptor returned by
> > +.BR open_tree ()
> > +as the
> > +.I dirfd
> > +argument to any "*at()" system calls:
> > +.P
> > +.in +4n
> > +.EX
> > +int dirfd, fd;
> > +\&
> > +dirfd = open_tree(AT_FDCWD, "/etc", OPEN_TREE_CLONE);
> > +fd = openat(dirfd, "passwd", O_RDONLY);
> > +fchmodat(dirfd, "shadow", 0000, 0);
> > +close(dirfd);
> > +close(fd);
> > +/* The bind-mount is now destroyed. */
> > +.EE
> > +.in
> > +.SH SEE ALSO
> > +.BR fsconfig (2),
> > +.BR fsmount (2),
> > +.BR fsopen (2),
> > +.BR fspick (2),
> > +.BR mount (2),
> > +.BR mount_setattr (2),
> > +.BR move_mount (2),
> > +.BR mount_namespaces (7)
> > 
> > -- 
> > 2.51.0
> > 
> > 
> 
> -- 
> <https://www.alejandro-colomar.es>
> Use port 80 (that is, <...:80/>).



-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v4 05/10] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-09-22  1:10 UTC (permalink / raw)
  To: safinaskar
  Cc: alx, brauner, cyphar, dhowells, g.branden.robinson, jack,
	linux-api, linux-fsdevel, linux-kernel, linux-man, mtk.manpages,
	safinaskar, viro
In-Reply-To: <20250921230824.92612-1-safinaskar@gmail.com>

> MNT_DETACH, not MOUNT_DETACH

Same for open_tree and open_tree_attr

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v4 10/10] man/man2/{fsconfig,mount_setattr}.2: add note about attribute-parameter distinction
From: Askar Safin @ 2025-09-22  1:06 UTC (permalink / raw)
  To: cyphar
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
	viro
In-Reply-To: <20250919-new-mount-api-v4-10-1261201ab562@cyphar.com>

> Some mount attributes (traditionally associated with mount(8)-style options) have a sibling mount attribute with superficially similar user-facing behaviour

"Some mount attributes... have a sibling mount attribute"

Something is wrong here.

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v4 03/10] man/man2/fspick.2: document "new" mount API
From: Askar Safin @ 2025-09-22  0:25 UTC (permalink / raw)
  To: cyphar
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
	viro
In-Reply-To: <20250919-new-mount-api-v4-3-1261201ab562@cyphar.com>

> With the notable caveat that in this example, mount(2) will clear all other filesystem parameters (such as MS_NOSUID or MS_NOEXEC); fsconfig(2) will only modify the ro parameter.

MS_NOSUID and MS_NOEXEC are not filesystem parameters. They can be set per-mount, but not
per-filesystem. Here is list of all filesystem-agnostic per-superblock parameters:

https://elixir.bootlin.com/linux/v6.17-rc6/source/fs/namespace.c#L4103

Note that these SB_* constants are equal to corresponding MS_* constants.

As you can see, there is no NOSUID and NOEXEC in that list.

Also, SB_NOSUID does exist:
https://elixir.bootlin.com/linux/v6.17-rc6/source/include/linux/fs.h#L1240
.

So, it seems that "NOSUID superblock" does exist as a concept. But, thanks to
code in path_mount (provided above) user cannot (in filesystem-agnostic way)
make given superblock NOSUID.

So, from user point of view, NOSUID and NOEXEC are not filesystem parameters.

If you need some example of filesystem parameter, I suggest MS_SYNCHRONOUS,
I used it here:
https://lore.kernel.org/all/198d1f2e189.11dbac16b2998.3847935512688537521@zohomail.com/

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v4 05/10] man/man2/fsmount.2: document "new" mount API
From: Askar Safin @ 2025-09-21 23:08 UTC (permalink / raw)
  To: cyphar
  Cc: alx, brauner, dhowells, g.branden.robinson, jack, linux-api,
	linux-fsdevel, linux-kernel, linux-man, mtk.manpages, safinaskar,
	viro
In-Reply-To: <20250919-new-mount-api-v4-5-1261201ab562@cyphar.com>

> Note that the unmount operation on close(2) is lazy—akin to calling umount2(2) with MOUNT_DETACH

MNT_DETACH, not MOUNT_DETACH

-- 
Askar Safin

^ permalink raw reply

* Re: [PATCH v3 07/30] kho: add interfaces to unpreserve folios and physical memory ranges
From: Pasha Tatashin @ 2025-09-21 22:20 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Mike Rapoport, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250818135509.GK802098@nvidia.com>

On Mon, Aug 18, 2025 at 9:55 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Aug 15, 2025 at 12:12:10PM +0300, Mike Rapoport wrote:
> > > Which is perhaps another comment, if this __get_free_pages() is going
> > > to be a common pattern (and I guess it will be) then the API should be
> > > streamlined alot more:
> > >
> > >  void *kho_alloc_preserved_memory(gfp, size);
> > >  void kho_free_preserved_memory(void *);
> >
> > This looks backwards to me. KHO should not deal with memory allocation,
> > it's responsibility to preserve/restore memory objects it supports.
>
> Then maybe those are luo_ helpers
>
> But having users open code __get_free_pages() and convert to/from
> struct page, phys, etc is not a great idea.

I added:

void *luo_contig_alloc_preserve(size_t size);
void luo_contig_free_unpreserve(void *mem, size_t size);

Allocate contiguous, zeroed, and preserved memory.

Pasha

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox