Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pasha Tatashin @ 2025-08-08 19:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, tj, yoann.congal, mmaurer, roman.gushchin, chenridong,
	axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250808120616.40842e9a9fdc056c9eb74123@linux-foundation.org>

> > Thanks Pratyush, I will make this simplification change if Andrew does
> > not take this patch in before the next revision.
> >
>
> Yes please on the simplification - the original has an irritating
> amount of kinda duplication of things from other places.  Perhaps a bit
> of a redo of these functions would clean things up.  But later.
>
> Can we please have this as a standalone hotfix patch with a cc:stable?
> As Pratyush helpfully suggested in
> https://lkml.kernel.org/r/mafs0sei2aw80.fsf@kernel.org.

I think we should take the first three patches as hotfixes.

Let me send them as a separate series in the next 15 minutes.

Pasha

^ permalink raw reply

* Re: [PATCH v2 05/11] fsconfig.2: document 'new' mount api
From: Aleksa Sarai @ 2025-08-08 19:07 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <2025-08-08.1754666161-creaky-taboo-miso-cuff-mKwsCC@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 2996 bytes --]

On 2025-08-09, Aleksa Sarai <cyphar@cyphar.com> wrote:
> On 2025-08-08, Askar Safin <safinaskar@zohomail.com> wrote:
> > Let's consider this example:
> > 
> >            int fsfd, mntfd, nsfd, nsdirfd;
> > 
> >            nsfd = open("/proc/self/ns/pid", O_PATH);
> >            nsdirfd = open("/proc/1/ns", O_DIRECTORY);
> > 
> >            fsfd = fsopen("proc", FSOPEN_CLOEXEC);
> >            /* "pidns" changes the value each time. */
> >            fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "/proc/self/ns/pid", AT_FDCWD);
> >            fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "pid", NULL, nsdirfd);
> >            fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "pidns", "", nsfd);
> >            fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
> >            fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
> >            mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
> >            move_mount(mntfd, "", AT_FDCWD, "/proc", MOVE_MOUNT_F_EMPTY_PATH);
> > 
> > I don't like it. /proc/self/ns/pid is our namespace, which is default anyway.
> > I. e. setting pidns to /proc/self/ns/pid is no-op (assuming that "pidns" option is implemented in our kernel, of course).
> > Moreover, if /proc is mounted properly, then /proc/1/ns/pid refers to our namespace, too!

This slightly depends on what you mean by "properly". If you deal with
namespaces a lot, running into a situation whether the current process's
pidns doesn't match /proc is quite common (we run into it with container
runtimes all the time).

A proper example with provably different pidns values (such as the
selftests for the pidns parameter) would make for a very lengthy example
program with very little use for readers.

I'm tempted to just delete this example.

> > Thus, *all* these fsconfig(FSCONFIG_SET_...) calls are no-op.
> > Thus it is bad example.
> > 
> > I suggest using, say, /proc/2/ns/pid . It has actual chance to refer to some other namespace.
> > 
> > Also, sentence '"pidns" changes the value each time' is a lie: as I explained, all these calls are no-ops,
> > they don't really change anything.
> 
> Right, I see your point.
> 
> One other problem with this example is that there is no
> currently-existing parameter which accepts all of FSCONFIG_SET_PATH,
> FSCONFIG_SET_PATH_EMPTY, FSCONFIG_SET_FD, and FSCONFIG_SET_STRING so
> this example is by necessity a little contrived. I suspect that it'd be
> better to remove this and re-add it once we actually something that
> works this way...
> 
> You've replied to the pidns parameter patchset so I shouldn't repeat
> myself here too much, but supporting this completely is my plan for the
> next version I send. It's just not a thing that exists today (ditto for
> overlayfs).
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> https://www.cyphar.com/



-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Andrew Morton @ 2025-08-08 19:06 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, tj, yoann.congal, mmaurer, roman.gushchin, chenridong,
	axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <CA+CK2bBoMNEfyFKgvKR0JvECpZrGKP1mEbC_fo8SqystEBAQUA@mail.gmail.com>

On Fri, 8 Aug 2025 14:00:08 +0000 Pasha Tatashin <pasha.tatashin@soleen.com> wrote:

> > > I suppose this could be simplified a bit to:
> > >
> > >       err = xa_err(physxa);
> > >         if (err || physxa) {
> > >               xa_destroy(&new_physxa->phys_bits);
> > >                 kfree(new_physxa);
> > >
> > >               if (err)
> > >                       return err;
> > >       } else {
> > >               physxa = new_physxa;
> > >       }
> >
> > My email client completely messed the whitespace up so this is a bit
> > unreadable. Here is what I meant:
> >
> >         err = xa_err(physxa);
> >         if (err || physxa) {
> >                 xa_destroy(&new_physxa->phys_bits);
> >                 kfree(new_physxa);
> >
> >                 if (err)
> >                         return err;
> >         } else {
> >                 physxa = new_physxa;
> >         }
> >
> > [...]
> 
> Thanks Pratyush, I will make this simplification change if Andrew does
> not take this patch in before the next revision.
> 

Yes please on the simplification - the original has an irritating
amount of kinda duplication of things from other places.  Perhaps a bit
of a redo of these functions would clean things up.  But later.

Can we please have this as a standalone hotfix patch with a cc:stable? 
As Pratyush helpfully suggested in
https://lkml.kernel.org/r/mafs0sei2aw80.fsf@kernel.org.

Thanks.

^ permalink raw reply

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
From: Aleksa Sarai @ 2025-08-08 15:51 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Askar Safin, amir73il, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro
In-Reply-To: <20250808-kurswechsel-angekauft-ec6bfc2efa79@brauner>

[-- Attachment #1: Type: text/plain, Size: 5470 bytes --]

On 2025-08-08, Christian Brauner <brauner@kernel.org> wrote:
> On Thu, Aug 07, 2025 at 05:17:56PM +1000, Aleksa Sarai wrote:
> > On 2025-08-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > > > > I just realised that we probably also want to support FSCONFIG_SET_PATH
> > > > 
> > > > I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> > > > Moreover, fsparam_path macro is present since 5.1. And for all this
> > > > time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> > > > used it, so this will not break anything.
> > > > 
> > > > If you okay with that, I can submit patch, removing it.
> > > 
> > > I would prefer you didn't -- "*at()" semantics are very useful to a lot
> > > of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
> > > to support it, and probably also overlayfs...
> > > 
> > > I suspect the primary issue is that when migrating to the new mount API,
> > > filesystem devs just went with the easiest thing to use
> > > (FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
> > > suspect the lack of documentation around fsconfig(2) played a part too.
> > > 
> > > My impression is that interest in the minutia about fsconfig(2) is quite
> > > low on the list of priorities for most filesystem devs, and so the neat
> > > aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
> > > we struggled to come to an agreement on how filesystems should use the
> > > read(2)-based error interface.)
> > > 
> > > We can very easily move fsparam_string() or fsparam_file_or_string()
> > > parameters to fsparam_path() and a future fsparam_file_or_path(). I
> > > would much prefer that as a user.
> > 
> > Actually, fsparam_bdev() accepts FSCONFIG_SET_PATH in a very roundabout
> > way (and the checker doesn't verify anything...?). So there is at least
> > one user (ext4's "journal_path"), it's just not well-documented (which
> > I'm trying to fix ;]).
> > 
> > My plan is to update fs_lookup_param() to be more useful for the (fairly
> > common) use-case of wanting to support paths and file descriptors, and
> > going through to clean up some of these unused fsparam_* helpers (or
> > fsparam_* helpers being abused to implement stuff that the fs_parser
> > core already supports).
> > 
> > At the very least, overlayfs, ext4, and this procfs patchset can make
> > use of it.
> 
> I've never bothered with actually iplementing FSCONFIG_SET_PATH
> semantics because I think it's really weird to allow *at semantics when
> setting filesystem parameters. I always thought it's better to force
> userspace to provide a file descriptor for the final destination instead
> of doing some arcane lookup variant for mount configuration. But I'm
> happy to be convinced of its usefulness...

I do think it's useful, and here's my thought process...

Most filesystems have to take string path parameters in order to support
mount(2) and work with mount(8). Yes, fsparam_fd() will accept
FSCONFIG_SET_STRING by parsing it as a decimal string, but there are
only two users of fsparam_fd() and honestly I'm not convinced this is a
particularly sane API for anything other than strict backcompat reasons
(the API only makes sense as a file descriptor and you want mount(8) to
be able to use it).

So you end up with most parameters supporting paths set using
FSCONFIG_SET_STRING anyway, meaning in-kernel lookups can't be taken off
the table. And if we accept paths for lookup, then (for the same reason
we have *at(2) syscalls) it is preferable to allow specifying dirfds. So
FSCONFIG_SET_PATH should also be supported.

And as there is no infrastructure to block FSCONFIG_SET_PATH_EMPTY
arguments (yes, you can do it manually, but the *only* user of
fs_lookup_param() doesn't), then anything that accepts FSCONFIG_SET_PATH
currently also accepts FSCONFIG_SET_PATH_EMPTY which is "morally
equivalent" to FSCONFIG_SET_FD. So unless you block
FSCONFIG_SET_PATH_EMPTY then FSCONFIG_SET_FD should probably also be
supported (there is the re-opening distinction, of course, but that is
not relevant if you use filename_lookup() -- which is what filesystems
will do in practice).

So my impression is that most users (if they had an fsconfig(2) man page
to read...) would expect parameters that accept paths to either:

* Work with FSCONFIG_SET_STRING and FSCONFIG_SET_PATH only; or
* Work with FSCONFIG_SET_STRING, FSCONFIG_SET_PATH,
  FSCONFIG_SET_PATH_EMPTY, and FSCONFIG_SET_FD.

Currently, none of our parameters work that way.

 * ext4's journal_path takes FSCONFIG_SET_STRING, FSCONFIG_SET_PATH, and
   FSCONFIG_SET_PATH_EMPTY.
 * overlayfs takes FSCONFIG_SET_FD and FSCONFIG_SET_STRING.

I only fully realised how inconsistent this is while working on the
fsconfig(2) man pages -- at the moment I have a very long paragraph
explaining that there is this distinction in-kernel, but this really
doesn't seem intentional to me. I would be very confused as a user that
FSCONFIG_SET_PATH is useless for most filesystem *path* parameters, even
though the filesystem accepts them as FSCONFIG_SET_STRING.

As for practical uses, it would be nice to not have to open 500 files in
order to create a 500-layer overlayfs.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 05/11] fsconfig.2: document 'new' mount api
From: Aleksa Sarai @ 2025-08-08 15:22 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <19889fbe690.e80d252e42280.4347614991285137048@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2258 bytes --]

On 2025-08-08, Askar Safin <safinaskar@zohomail.com> wrote:
> Let's consider this example:
> 
>            int fsfd, mntfd, nsfd, nsdirfd;
> 
>            nsfd = open("/proc/self/ns/pid", O_PATH);
>            nsdirfd = open("/proc/1/ns", O_DIRECTORY);
> 
>            fsfd = fsopen("proc", FSOPEN_CLOEXEC);
>            /* "pidns" changes the value each time. */
>            fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "/proc/self/ns/pid", AT_FDCWD);
>            fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "pid", NULL, nsdirfd);
>            fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "pidns", "", nsfd);
>            fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
>            fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
>            mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
>            move_mount(mntfd, "", AT_FDCWD, "/proc", MOVE_MOUNT_F_EMPTY_PATH);
> 
> I don't like it. /proc/self/ns/pid is our namespace, which is default anyway.
> I. e. setting pidns to /proc/self/ns/pid is no-op (assuming that "pidns" option is implemented in our kernel, of course).
> Moreover, if /proc is mounted properly, then /proc/1/ns/pid refers to our namespace, too!
> Thus, *all* these fsconfig(FSCONFIG_SET_...) calls are no-op.
> Thus it is bad example.
> 
> I suggest using, say, /proc/2/ns/pid . It has actual chance to refer to some other namespace.
> 
> Also, sentence '"pidns" changes the value each time' is a lie: as I explained, all these calls are no-ops,
> they don't really change anything.

Right, I see your point.

One other problem with this example is that there is no
currently-existing parameter which accepts all of FSCONFIG_SET_PATH,
FSCONFIG_SET_PATH_EMPTY, FSCONFIG_SET_FD, and FSCONFIG_SET_STRING so
this example is by necessity a little contrived. I suspect that it'd be
better to remove this and re-add it once we actually something that
works this way...

You've replied to the pidns parameter patchset so I shouldn't repeat
myself here too much, but supporting this completely is my plan for the
next version I send. It's just not a thing that exists today (ditto for
overlayfs).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v4 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Christian Brauner @ 2025-08-08 14:12 UTC (permalink / raw)
  To: Randy Dunlap, Arnd Bergmann
  Cc: Aleksa Sarai, Alexander Viro, Jan Kara, Jonathan Corbet,
	Shuah Khan, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest
In-Reply-To: <1ea6f1d9-550d-4b81-bade-1a0ca14c27c6@infradead.org>

On Wed, Aug 06, 2025 at 11:57:42AM -0700, Randy Dunlap wrote:
> 
> 
> On 8/6/25 11:02 AM, Aleksa Sarai wrote:
> > On 2025-08-05, Randy Dunlap <rdunlap@infradead.org> wrote:
> >>
> >>
> >> On 8/4/25 10:45 PM, Aleksa Sarai wrote:
> >>> /proc has historically had very opaque semantics about PID namespaces,
> >>> which is a little unfortunate for container runtimes and other programs
> >>> that deal with switching namespaces very often. One common issue is that
> >>> of converting between PIDs in the process's namespace and PIDs in the
> >>> namespace of /proc.
> >>>
> >>> In principle, it is possible to do this today by opening a pidfd with
> >>> pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> >>> contain a PID value translated to the pid namespace associated with that
> >>> procfs superblock). However, allocating a new file for each PID to be
> >>> converted is less than ideal for programs that may need to scan procfs,
> >>> and it is generally useful for userspace to be able to finally get this
> >>> information from procfs.
> >>>
> >>> So, add a new API to get the pid namespace of a procfs instance, in the
> >>> form of an ioctl(2) you can call on the root directory of said procfs.
> >>> The returned file descriptor will have O_CLOEXEC set. This acts as a
> >>> sister feature to the new "pidns" mount option, finally allowing
> >>> userspace full control of the pid namespaces associated with procfs
> >>> instances.
> >>>
> >>> The permission model for this is a bit looser than that of the "pidns"
> >>> mount option (and also setns(2)) because /proc/1/ns/pid provides the
> >>> same information, so as long as you have access to that magic-link (or
> >>> something equivalently reasonable such as being in an ancestor pid
> >>> namespace) it makes sense to allow userspace to grab a handle. Ideally
> >>> we would check for ptrace-read access against all processes in the pidns
> >>> (which is very likely to be true for at least one process, as
> >>> SUID_DUMP_DISABLE is cleared on exec(2) and is rarely set by most
> >>> programs), but this would obviously not scale.
> >>>
> >>> setns(2) will still have their own permission checks, so being able to
> >>> open a pidns handle doesn't really provide too many other capabilities.
> >>>
> >>> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> >>> ---
> >>>  Documentation/filesystems/proc.rst |  4 +++
> >>>  fs/proc/root.c                     | 68 ++++++++++++++++++++++++++++++++++++--
> >>>  include/uapi/linux/fs.h            |  4 +++
> >>>  3 files changed, 74 insertions(+), 2 deletions(-)
> >>>
> >>
> >>
> >>> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> >>> index 0bd678a4a10e..68e65e6d7d6b 100644
> >>> --- a/include/uapi/linux/fs.h
> >>> +++ b/include/uapi/linux/fs.h
> >>> @@ -435,8 +435,12 @@ typedef int __bitwise __kernel_rwf_t;
> >>>  			 RWF_APPEND | RWF_NOAPPEND | RWF_ATOMIC |\
> >>>  			 RWF_DONTCACHE)
> >>>  
> >>> +/* This matches XSDFEC_MAGIC, so we need to allocate subvalues carefully. */
> >>>  #define PROCFS_IOCTL_MAGIC 'f'
> >>>  
> >>> +/* procfs root ioctls */
> >>> +#define PROCFS_GET_PID_NAMESPACE	_IO(PROCFS_IOCTL_MAGIC, 32)
> >>
> >> Since the _IO() nr here is 32, Documentation/userspace-api/ioctl/ioctl-number.rst
> >> should be updated like:
> >>
> >> -'f'   00-0F  linux/fs.h                                                conflict!
> >> +'f'   00-1F  linux/fs.h                                                conflict!
> > 
> > Should this be 00-20 (or 00-2F) instead?
> 
> Oops, yes, it should be one of those. Thanks.
> 
> > Also, is there a better value to use for this new ioctl? I'm not quite
> > sure what is the best practice to handle these kinds of conflicts...
> 
> I wouldn't worry about it. We have *many* conflicts.
> (unless Al or Christian are concerned)

We try to minimize conflicts but we unfortunately give no strong
guarantees in any way. I always defer to Arnd in such matters as he's
got a pretty good mental model of what is best to do for ioctls.

> 
> >> (17 is already used for PROCFS_IOCTL_MAGIC somewhere else, so that probably should
> >> have update the Doc/rst file.)

^ permalink raw reply

* Re: [PATCH v4 2/4] procfs: add "pidns" mount option
From: Christian Brauner @ 2025-08-08 14:09 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Askar Safin, amir73il, corbet, jack, linux-api, linux-doc,
	linux-fsdevel, linux-kernel, linux-kselftest, luto, shuah, viro
In-Reply-To: <2025-08-07.1754550206-glad-sneeze-upstate-sorts-swank-courts-YKmj7E@cyphar.com>

On Thu, Aug 07, 2025 at 05:17:56PM +1000, Aleksa Sarai wrote:
> On 2025-08-07, Aleksa Sarai <cyphar@cyphar.com> wrote:
> > On 2025-08-06, Askar Safin <safinaskar@zohomail.com> wrote:
> > > > I just realised that we probably also want to support FSCONFIG_SET_PATH
> > > 
> > > I just checked kernel code. Indeed nobody uses FSCONFIG_SET_PATH.
> > > Moreover, fsparam_path macro is present since 5.1. And for all this
> > > time nobody used it. So, let's just remove FSCONFIG_SET_PATH. Nobody
> > > used it, so this will not break anything.
> > > 
> > > If you okay with that, I can submit patch, removing it.
> > 
> > I would prefer you didn't -- "*at()" semantics are very useful to a lot
> > of programs (*especially* AT_EMPTY_PATH). I would like the pidns= stuff
> > to support it, and probably also overlayfs...
> > 
> > I suspect the primary issue is that when migrating to the new mount API,
> > filesystem devs just went with the easiest thing to use
> > (FSCONFIG_SET_STRING) even though FSCONFIG_SET_PATH would be better. I
> > suspect the lack of documentation around fsconfig(2) played a part too.
> > 
> > My impression is that interest in the minutia about fsconfig(2) is quite
> > low on the list of priorities for most filesystem devs, and so the neat
> > aspects of fsconfig(2) haven't been fully utilised. (In LPC last year,
> > we struggled to come to an agreement on how filesystems should use the
> > read(2)-based error interface.)
> > 
> > We can very easily move fsparam_string() or fsparam_file_or_string()
> > parameters to fsparam_path() and a future fsparam_file_or_path(). I
> > would much prefer that as a user.
> 
> Actually, fsparam_bdev() accepts FSCONFIG_SET_PATH in a very roundabout
> way (and the checker doesn't verify anything...?). So there is at least
> one user (ext4's "journal_path"), it's just not well-documented (which
> I'm trying to fix ;]).
> 
> My plan is to update fs_lookup_param() to be more useful for the (fairly
> common) use-case of wanting to support paths and file descriptors, and
> going through to clean up some of these unused fsparam_* helpers (or
> fsparam_* helpers being abused to implement stuff that the fs_parser
> core already supports).
> 
> At the very least, overlayfs, ext4, and this procfs patchset can make
> use of it.

I've never bothered with actually iplementing FSCONFIG_SET_PATH
semantics because I think it's really weird to allow *at semantics when
setting filesystem parameters. I always thought it's better to force
userspace to provide a file descriptor for the final destination instead
of doing some arcane lookup variant for mount configuration. But I'm
happy to be convinced of its usefulness...

^ permalink raw reply

* Re: [PATCH v3 02/30] kho: mm: Don't allow deferred struct page with KHO
From: Pasha Tatashin @ 2025-08-08 14:01 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <mafs0jz3eavci.fsf@kernel.org>

On Fri, Aug 8, 2025 at 11:47 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Thu, Aug 07 2025, Pasha Tatashin wrote:
>
> > KHO uses struct pages for the preserved memory early in boot, however,
> > with deferred struct page initialization, only a small portion of
> > memory has properly initialized struct pages.
> >
> > This problem was detected where vmemmap is poisoned, and illegal flag
> > combinations are detected.
> >
> > Don't allow them to be enabled together, and later we will have to
> > teach KHO to work properly with deferred struct page init kernel
> > feature.
> >
> > Fixes: 990a950fe8fd ("kexec: add config option for KHO")
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
>
> Nit: Drop the blank line before fixes. git interpret-trailers doesn't

Makes sense.

> seem to recognize the fixes otherwise, so this may break some tooling.
> Try it yourself:
>
>     $ git interpret-trailers --parse commit_message.txt
>
> Other than this,
>
> Acked-by: Pratyush Yadav <pratyush@kernel.org>

Thank you for the review.

Pasha

>
> > Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> > ---
> >  kernel/Kconfig.kexec | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> > index 2ee603a98813..1224dd937df0 100644
> > --- a/kernel/Kconfig.kexec
> > +++ b/kernel/Kconfig.kexec
> > @@ -97,6 +97,7 @@ config KEXEC_JUMP
> >  config KEXEC_HANDOVER
> >       bool "kexec handover"
> >       depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> > +     depends on !DEFERRED_STRUCT_PAGE_INIT
> >       select MEMBLOCK_KHO_SCRATCH
> >       select KEXEC_FILE
> >       select DEBUG_FS
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v2 05/11] fsconfig.2: document 'new' mount api
From: Askar Safin @ 2025-08-08 14:00 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250807-new-mount-api-v2-5-558a27b8068c@cyphar.com>

Let's consider this example:

           int fsfd, mntfd, nsfd, nsdirfd;

           nsfd = open("/proc/self/ns/pid", O_PATH);
           nsdirfd = open("/proc/1/ns", O_DIRECTORY);

           fsfd = fsopen("proc", FSOPEN_CLOEXEC);
           /* "pidns" changes the value each time. */
           fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "/proc/self/ns/pid", AT_FDCWD);
           fsconfig(fsfd, FSCONFIG_SET_PATH, "pidns", "pid", NULL, nsdirfd);
           fsconfig(fsfd, FSCONFIG_SET_PATH_EMPTY, "pidns", "", nsfd);
           fsconfig(fsfd, FSCONFIG_SET_FD, "pidns", NULL, nsfd);
           fsconfig(fsfd, FSCONFIG_CMD_CREATE, NULL, NULL, 0);
           mntfd = fsmount(fsfd, FSMOUNT_CLOEXEC, 0);
           move_mount(mntfd, "", AT_FDCWD, "/proc", MOVE_MOUNT_F_EMPTY_PATH);

I don't like it. /proc/self/ns/pid is our namespace, which is default anyway.
I. e. setting pidns to /proc/self/ns/pid is no-op (assuming that "pidns" option is implemented in our kernel, of course).
Moreover, if /proc is mounted properly, then /proc/1/ns/pid refers to our namespace, too!
Thus, *all* these fsconfig(FSCONFIG_SET_...) calls are no-op.
Thus it is bad example.

I suggest using, say, /proc/2/ns/pid . It has actual chance to refer to some other namespace.

Also, sentence '"pidns" changes the value each time' is a lie: as I explained, all these calls are no-ops,
they don't really change anything.

--
Askar Safin
https://types.pl/@safinaskar

^ permalink raw reply

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pasha Tatashin @ 2025-08-08 14:00 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <mafs0bjoqav4j.fsf@kernel.org>

On Fri, Aug 8, 2025 at 11:52 AM Pratyush Yadav <pratyush@kernel.org> wrote:
>
> On Fri, Aug 08 2025, Pratyush Yadav wrote:
> [...]
> >> @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
> >>                              unsigned int order)
> >>  {
> >>      struct kho_mem_phys_bits *bits;
> >> -    struct kho_mem_phys *physxa;
> >> +    struct kho_mem_phys *physxa, *new_physxa;
> >>      const unsigned long pfn_high = pfn >> order;
> >>
> >>      might_sleep();
> >>
> >> -    physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> >> -    if (IS_ERR(physxa))
> >> -            return PTR_ERR(physxa);
> >> +    physxa = xa_load(&track->orders, order);
> >> +    if (!physxa) {
> >> +            new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
> >> +            if (!new_physxa)
> >> +                    return -ENOMEM;
> >> +
> >> +            xa_init(&new_physxa->phys_bits);
> >> +            physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
> >> +                                GFP_KERNEL);
> >> +            if (xa_is_err(physxa)) {
> >> +                    int err = xa_err(physxa);
> >> +
> >> +                    xa_destroy(&new_physxa->phys_bits);
> >> +                    kfree(new_physxa);
> >> +
> >> +                    return err;
> >> +            }
> >> +            if (physxa) {
> >> +                    xa_destroy(&new_physxa->phys_bits);
> >> +                    kfree(new_physxa);
> >> +            } else {
> >> +                    physxa = new_physxa;
> >> +            }
> >
> > I suppose this could be simplified a bit to:
> >
> >       err = xa_err(physxa);
> >         if (err || physxa) {
> >               xa_destroy(&new_physxa->phys_bits);
> >                 kfree(new_physxa);
> >
> >               if (err)
> >                       return err;
> >       } else {
> >               physxa = new_physxa;
> >       }
>
> My email client completely messed the whitespace up so this is a bit
> unreadable. Here is what I meant:
>
>         err = xa_err(physxa);
>         if (err || physxa) {
>                 xa_destroy(&new_physxa->phys_bits);
>                 kfree(new_physxa);
>
>                 if (err)
>                         return err;
>         } else {
>                 physxa = new_physxa;
>         }
>
> [...]

Thanks Pratyush, I will make this simplification change if Andrew does
not take this patch in before the next revision.

Pasha

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-08 13:53 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: David Hildenbrand, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu, Hugh Dickins,
	Baolin Wang
In-Reply-To: <mafs07bzeatmf.fsf@kernel.org>

>
> And now that I think about it, I suppose patch 29 should also add
> memfd_luo.c under the SHMEM MAINTAINERS entry.

Right, let's update this in the next revision.

Thanks,
Pasha

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-08 13:52 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, joel.granados, rostedt, anna.schumaker, song,
	zhangguopeng, linux, linux-kernel, linux-doc, linux-mm, gregkh,
	tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	Hugh Dickins
In-Reply-To: <b227482a-31ec-4c92-a856-bd19f72217b7@redhat.com>

On Fri, Aug 8, 2025 at 12:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 07.08.25 03:44, Pasha Tatashin wrote:
> > This series introduces the LUO, a kernel subsystem designed to
> > facilitate live kernel updates with minimal downtime,
> > particularly in cloud delplyoments aiming to update without fully
> > disrupting running virtual machines.
> >
> > This series builds upon KHO framework by adding programmatic
> > control over KHO's lifecycle and leveraging KHO for persisting LUO's
> > own metadata across the kexec boundary. The git branch for this series
> > can be found at:
> >
> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> >
> > Changelog from v2:
> > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > - Release all preserved resources if /dev/liveupdate closes
> >    before reboot.
> > - With the above changes, sessions are not needed, and should be
> >    maintained by the user-agent itself, so removed support for
> >    sessions.
> > - Added support for changing per-FD state (i.e. some FDs can be
> >    prepared or finished before the global transition.
> > - All IOCTLs now follow iommufd/fwctl extendable design.
> > - Replaced locks with guards
> > - Added a callback for registered subsystems to be notified
> >    during boot: ops->boot().
> > - Removed args from callbacks, instead use container_of() to
> >    carry context specific data (see luo_selftests.c for example).
> > - removed patches for luolib, they are going to be introduced in
> >    a separate repository.
> >
> > What is Live Update?
> > Live Update is a kexec based reboot process where selected kernel
> > resources (memory, file descriptors, and eventually devices) are kept
> > operational or their state preserved across a kernel transition. For
> > certain resources, DMA and interrupt activity might continue with
> > minimal interruption during the kernel reboot.
> >
> > LUO provides a framework for coordinating live updates. It features:
> > State Machine: Manages the live update process through states:
> > NORMAL, PREPARED, FROZEN, UPDATED.
> >
> > KHO Integration:
> >
> > LUO programmatically drives KHO's finalization and abort sequences.
> > KHO's debugfs interface is now optional configured via
> > CONFIG_KEXEC_HANDOVER_DEBUG.
> >
> > LUO preserves its own metadata via KHO's kho_add_subtree and
> > kho_preserve_phys() mechanisms.
> >
> > Subsystem Participation: A callback API liveupdate_register_subsystem()
> > allows kernel subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register
> > handlers for LUO events (PREPARE, FREEZE, FINISH, CANCEL) and persist a
> > u64 payload via the LUO FDT.
> >
> > File Descriptor Preservation: Infrastructure
> > liveupdate_register_filesystem, luo_register_file, luo_retrieve_file to
> > allow specific types of file descriptors (e.g., memfd, vfio) to be
> > preserved and restored.
> >
> > Handlers for specific file types can be registered to manage their
> > preservation and restoration, storing a u64 payload in the LUO FDT.
> >
> > User-space Interface:
> >
> > ioctl (/dev/liveupdate): The primary control interface for
> > triggering LUO state transitions (prepare, freeze, finish, cancel)
> > and managing the preservation/restoration of file descriptors.
> > Access requires CAP_SYS_ADMIN.
> >
> > sysfs (/sys/kernel/liveupdate/state): A read-only interface for
> > monitoring the current LUO state. This allows userspace services to
> > track progress and coordinate actions.
> >
> > Selftests: Includes kernel-side hooks and userspace selftests to
> > verify core LUO functionality, particularly subsystem registration and
> > basic state transitions.
> >
> > LUO State Machine and Events:
> >
> > NORMAL:   Default operational state.
> > PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
> >            event. Subsystems have saved initial state.
> > FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
> >            event, just before kexec. Workloads must be suspended.
> > UPDATED:  Next kernel has booted via live update. Awaiting restoration
> >            and LIVEUPDATE_FINISH.
> >
> > Events:
> > LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
> > LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
> > LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
> > LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.
> >
> > v2: https://lore.kernel.org/all/20250723144649.1696299-1-pasha.tatashin@soleen.com
> > v1: https://lore.kernel.org/all/20250625231838.1897085-1-pasha.tatashin@soleen.com
> > RFC v2: https://lore.kernel.org/all/20250515182322.117840-1-pasha.tatashin@soleen.com
> > RFC v1: https://lore.kernel.org/all/20250320024011.2995837-1-pasha.tatashin@soleen.com
> >
> > Changyuan Lyu (1):
> >    kho: add interfaces to unpreserve folios and physical memory ranges
> >
> > Mike Rapoport (Microsoft) (1):
> >    kho: drop notifiers
> >
> > Pasha Tatashin (23):
> >    kho: init new_physxa->phys_bits to fix lockdep
> >    kho: mm: Don't allow deferred struct page with KHO
> >    kho: warn if KHO is disabled due to an error
> >    kho: allow to drive kho from within kernel
> >    kho: make debugfs interface optional
> >    kho: don't unpreserve memory during abort
> >    liveupdate: kho: move to kernel/liveupdate
> >    liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
> >    liveupdate: luo_core: integrate with KHO
> >    liveupdate: luo_subsystems: add subsystem registration
> >    liveupdate: luo_subsystems: implement subsystem callbacks
> >    liveupdate: luo_files: add infrastructure for FDs
> >    liveupdate: luo_files: implement file systems callbacks
> >    liveupdate: luo_ioctl: add userpsace interface
> >    liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
> >    liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state
> >      management
> >    liveupdate: luo_sysfs: add sysfs state monitoring
> >    reboot: call liveupdate_reboot() before kexec
> >    kho: move kho debugfs directory to liveupdate
> >    liveupdate: add selftests for subsystems un/registration
> >    selftests/liveupdate: add subsystem/state tests
> >    docs: add luo documentation
> >    MAINTAINERS: add liveupdate entry
> >
> > Pratyush Yadav (5):
> >    mm: shmem: use SHMEM_F_* flags instead of VM_* flags
> >    mm: shmem: allow freezing inode mapping
> >    mm: shmem: export some functions to internal.h
> >    luo: allow preserving memfd
> >    docs: add documentation for memfd preservation via LUO
>
> It's not clear from the description why these mm shmem changes are
> buried in this patch set. It's not even described above in the patch
> description.

Hi David,

Yes, I should update the cover letter to include memfd preservation work.

> I suggest sending that part out separately, so Hugh actually spots this.
> (is he even CC'ed?)

+cc hughd@google.com

While MM list is CCed, you are right, I have not specifically CCed
shmem maintainers. This will be fixed in the next revision.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v2 0/2] vfs: output mount_too_revealing() errors to fscontext
From: Christian Brauner @ 2025-08-08 13:27 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, David Howells, linux-api, linux-kernel,
	linux-fsdevel, Alexander Viro, Jan Kara
In-Reply-To: <20250806-errorfc-mount-too-revealing-v2-0-534b9b4d45bb@cyphar.com>

On Wed, 06 Aug 2025 16:07:04 +1000, Aleksa Sarai wrote:
> It makes little sense for fsmount() to output the warning message when
> mount_too_revealing() is violated to kmsg. Instead, the warning should
> be output (with a "VFS" prefix) to the fscontext log. In addition,
> include the same log message for mount_too_revealing() when doing a
> regular mount for consistency.
> 
> With the newest fsopen()-based mount(8) from util-linux, the error
> messages now look like
> 
> [...]

Nice, thank you!

---

Applied to the vfs-6.18.mount branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.mount branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.mount

[1/2] fscontext: add custom-prefix log helpers
      https://git.kernel.org/vfs/vfs/c/49e998eb0154
[2/2] vfs: output mount_too_revealing() errors to fscontext
      https://git.kernel.org/vfs/vfs/c/3441e1534e67

^ permalink raw reply

* Re: [PATCH v2 08/11] open_tree.2: document 'new' mount api
From: Aleksa Sarai @ 2025-08-08 13:26 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <19889ab0576.e4d2f37341528.6111844101094013469@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1000 bytes --]

On 2025-08-08, Askar Safin <safinaskar@zohomail.com> wrote:
> In "man open_tree":
> 
> > As with "*at()" system calls, fspick() uses the dirfd argument in conjunction
> 
> You meant "open_tree"
> 
> > If flags does not contain OPEN_TREE_CLONE, open_tree() returns
> > a file descriptor that is exactly equivalent to one produced by open(2).
> 
> Please, change "by open(2)" to "by openat(2) with O_PATH" (and other similar places).

I think the more common pattern in man-pages is to prefer to refer to
open(2) unless you are explicitly talking about openat(2) features (like
passing a dirfd). If it's just "a file descriptor with O_PATH" then most
man-pages I've seen reference open(2) even if they were written
post-openat(2).

Though in this case, since we are talking about open_tree(2) as an open
operation that takes a dirfd, you're right that openat(2) might be
better.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v4 0/2] man/man2/mremap.2: describe multiple mapping move, shrink
From: Lorenzo Stoakes @ 2025-08-08 13:15 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <cover.1754414738.git.lorenzo.stoakes@oracle.com>

On Tue, Aug 05, 2025 at 06:31:54PM +0100, Lorenzo Stoakes wrote:
> We have added new functionality to mremap() in Linux 6.17, permitting the
> move of multiple VMAs when performing a move alone (that is - providing
> MREMAP_MAYMOVE | MREMAP_FIXED flags and specifying old_size == new_size).
>
> We document this new feature.
>
> Additionally, we document previously undocumented behaviour around
> shrinking of input VMA ranges which permits the input range to span
> multiple VMAs.
>
> v4:
> * Update description of newly discovered mremap() behaviour to highlight the
>   fact that, if in-place, [old_address, old_address + new_length) may span
>   multiple VMAs also.
> * Fix up commit message for 2/2 to correct typo on specified range.
> * Added code sample to 1/2 as per Alejandro.
>
> v3:
> * Use more precise language around mremap() move description as per Jann.
> * Fix some typos in commit messages.
> https://lore.kernel.org/all/cover.1753795807.git.lorenzo.stoakes@oracle.com/
>
> v2:
> * Split out the two man page changes as requested by Alejandro.
> https://lore.kernel.org/all/cover.1753711160.git.lorenzo.stoakes@oracle.com/
>
> v1:
> https://lore.kernel.org/all/20250723174634.75054-1-lorenzo.stoakes@oracle.com/
>
> Lorenzo Stoakes (2):
>   man/man2/mremap.2: describe multiple mapping move
>   man/man2/mremap.2: describe previously undocumented shrink behaviour
>
>  man/man2/mremap.2 | 111 +++++++++++++++++++++++++++++++++++++++++-----
>  1 file changed, 100 insertions(+), 11 deletions(-)
>
> --
> 2.50.1

Hey Alejandro,

Just wondering if this has everything you need, let me know if there's
anything I need to do here!

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v2 00/11] man2: add man pages for 'new' mount API
From: Christian Brauner @ 2025-08-08 12:53 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	Askar Safin, G. Branden Robinson, linux-man, linux-api,
	linux-fsdevel, linux-kernel, David Howells
In-Reply-To: <20250807-new-mount-api-v2-0-558a27b8068c@cyphar.com>

On Thu, Aug 07, 2025 at 03:44:34AM +1000, Aleksa Sarai wrote:
> Back in 2019, the new mount API was merged into mainline[1]. David Howells
> then set about writing man pages for these new APIs, and sent some
> patches back in 2020[2]. Unfortunately, these patches were never merged,
> which meant that these APIs were practically undocumented for many
> years -- arguably this may have been a contributing factor to the
> relatively slow adoption of these new (far better) APIs. I have often
> discovered that many folks are unaware of the read(2)-based message
> retrieval interface provided by filesystem context file descriptors.
> 
> In 2024, Christian Brauner set aside some time to provide some
> documentation of these new APIs and so adapted David Howell's original
> man pages into the easier-to-edit Markdown format and published them on
> GitHub[3]. These have been maintained since, including updated
> information on new features added since David Howells's 2020 draft pages
> (such as MOVE_MOUNT_BENEATH).
> 
> While this was a welcome improvement to the previous status quo (that
> had lasted over 6 years), speaking personally my experience is that not
> having access to these man pages from the terminal has been a fairly
> common painpoint.
> 
> So, this is a modern version of the man pages for these APIs, in the hopes
> that we can finally (7 years later) get proper documentation for these
> APIs in the man-pages project.
> 
> One important thing to note is that most of these were re-written by me,
> with very minimal copying from the versions available from Christian[2].
> The reasons for this are two-fold:
> 
>  * Both Howells's original version and Christian's maintained versions
>    contain crucial mistakes that I have been bitten by in the past (the

"Lies, damned lies, and statistics."

>    most obvious being that all of these APIs were merged in Linux 5.2,
>    but the man pages all claim they were merged in different versions.)
> 
>  * As the man pages appear to have been written from Howells's
>    perspective while implementing them, some of the wording is a little
>    too tied to the implementation (or appears to describe features that
>    don't really exist in the merged versions of these APIs).
> 
> I decided that the best way to resolve these issues is to rewrite them
> from the perspective of an actual user of these APIs (me), and check
> that we do not repeat the mistakes I found in the originals.
> 
> I have also done my best to resolve the issues raised by Michael Kerrisk
> on the original patchset sent by Howells[1].
> 
> In addition, I have also included a man page for open_tree_attr(2) (as a
> subsection of the new open_tree(2) man page), which was merged in Linux
> 6.15.
> 
> [1]: https://lore.kernel.org/all/20190507204921.GL23075@ZenIV.linux.org.uk/
> [2]: https://lore.kernel.org/linux-man/159680892602.29015.6551860260436544999.stgit@warthog.procyon.org.uk/
> [3]: https://github.com/brauner/man-pages-md
> 
> Co-developed-by: David Howells <dhowells@redhat.com>
> Co-developed-by: Christian Brauner <brauner@kernel.org>
> Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> ---

Thanks for doing this! Just a point of order. If you add CdB you also
need to add SoB for all of them.

^ permalink raw reply

* Re: [PATCH v2 08/11] open_tree.2: document 'new' mount api
From: Askar Safin @ 2025-08-08 12:32 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <20250807-new-mount-api-v2-8-558a27b8068c@cyphar.com>

In "man open_tree":

> As with "*at()" system calls, fspick() uses the dirfd argument in conjunction

You meant "open_tree"

> If flags does not contain OPEN_TREE_CLONE, open_tree() returns
> a file descriptor that is exactly equivalent to one produced by open(2).

Please, change "by open(2)" to "by openat(2) with O_PATH" (and other similar places).

--
Askar Safin
https://types.pl/@safinaskar


^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-08-08 12:24 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, joel.granados, rostedt,
	anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu,
	Hugh Dickins, Baolin Wang
In-Reply-To: <b227482a-31ec-4c92-a856-bd19f72217b7@redhat.com>

On Fri, Aug 08 2025, David Hildenbrand wrote:

> On 07.08.25 03:44, Pasha Tatashin wrote:
>> This series introduces the LUO, a kernel subsystem designed to
>> facilitate live kernel updates with minimal downtime,
>> particularly in cloud delplyoments aiming to update without fully
>> disrupting running virtual machines.
>> This series builds upon KHO framework by adding programmatic
>> control over KHO's lifecycle and leveraging KHO for persisting LUO's
>> own metadata across the kexec boundary. The git branch for this series
>> can be found at:
>> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
>> Changelog from v2:
>> - Addressed comments from Mike Rapoport and Jason Gunthorpe
>> - Only one user agent (LiveupdateD) can open /dev/liveupdate
>> - Release all preserved resources if /dev/liveupdate closes
>>    before reboot.
>> - With the above changes, sessions are not needed, and should be
>>    maintained by the user-agent itself, so removed support for
>>    sessions.
>> - Added support for changing per-FD state (i.e. some FDs can be
>>    prepared or finished before the global transition.
>> - All IOCTLs now follow iommufd/fwctl extendable design.
>> - Replaced locks with guards
>> - Added a callback for registered subsystems to be notified
>>    during boot: ops->boot().
>> - Removed args from callbacks, instead use container_of() to
>>    carry context specific data (see luo_selftests.c for example).
>> - removed patches for luolib, they are going to be introduced in
>>    a separate repository.
>> What is Live Update?
>> Live Update is a kexec based reboot process where selected kernel
>> resources (memory, file descriptors, and eventually devices) are kept
>> operational or their state preserved across a kernel transition. For
>> certain resources, DMA and interrupt activity might continue with
>> minimal interruption during the kernel reboot.
>> LUO provides a framework for coordinating live updates. It features:
>> State Machine: Manages the live update process through states:
>> NORMAL, PREPARED, FROZEN, UPDATED.
>> KHO Integration:
>> LUO programmatically drives KHO's finalization and abort sequences.
>> KHO's debugfs interface is now optional configured via
>> CONFIG_KEXEC_HANDOVER_DEBUG.
>> LUO preserves its own metadata via KHO's kho_add_subtree and
>> kho_preserve_phys() mechanisms.
>> Subsystem Participation: A callback API liveupdate_register_subsystem()
>> allows kernel subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register
>> handlers for LUO events (PREPARE, FREEZE, FINISH, CANCEL) and persist a
>> u64 payload via the LUO FDT.
>> File Descriptor Preservation: Infrastructure
>> liveupdate_register_filesystem, luo_register_file, luo_retrieve_file to
>> allow specific types of file descriptors (e.g., memfd, vfio) to be
>> preserved and restored.
>> Handlers for specific file types can be registered to manage their
>> preservation and restoration, storing a u64 payload in the LUO FDT.
>> User-space Interface:
>> ioctl (/dev/liveupdate): The primary control interface for
>> triggering LUO state transitions (prepare, freeze, finish, cancel)
>> and managing the preservation/restoration of file descriptors.
>> Access requires CAP_SYS_ADMIN.
>> sysfs (/sys/kernel/liveupdate/state): A read-only interface for
>> monitoring the current LUO state. This allows userspace services to
>> track progress and coordinate actions.
>> Selftests: Includes kernel-side hooks and userspace selftests to
>> verify core LUO functionality, particularly subsystem registration and
>> basic state transitions.
>> LUO State Machine and Events:
>> NORMAL:   Default operational state.
>> PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
>>            event. Subsystems have saved initial state.
>> FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
>>            event, just before kexec. Workloads must be suspended.
>> UPDATED:  Next kernel has booted via live update. Awaiting restoration
>>            and LIVEUPDATE_FINISH.
>> Events:
>> LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
>> LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
>> LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
>> LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.
>> v2:
>> https://lore.kernel.org/all/20250723144649.1696299-1-pasha.tatashin@soleen.com
>> v1: https://lore.kernel.org/all/20250625231838.1897085-1-pasha.tatashin@soleen.com
>> RFC v2: https://lore.kernel.org/all/20250515182322.117840-1-pasha.tatashin@soleen.com
>> RFC v1: https://lore.kernel.org/all/20250320024011.2995837-1-pasha.tatashin@soleen.com
>> Changyuan Lyu (1):
>>    kho: add interfaces to unpreserve folios and physical memory ranges
>> Mike Rapoport (Microsoft) (1):
>>    kho: drop notifiers
>> Pasha Tatashin (23):
>>    kho: init new_physxa->phys_bits to fix lockdep
>>    kho: mm: Don't allow deferred struct page with KHO
>>    kho: warn if KHO is disabled due to an error
>>    kho: allow to drive kho from within kernel
>>    kho: make debugfs interface optional
>>    kho: don't unpreserve memory during abort
>>    liveupdate: kho: move to kernel/liveupdate
>>    liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
>>    liveupdate: luo_core: integrate with KHO
>>    liveupdate: luo_subsystems: add subsystem registration
>>    liveupdate: luo_subsystems: implement subsystem callbacks
>>    liveupdate: luo_files: add infrastructure for FDs
>>    liveupdate: luo_files: implement file systems callbacks
>>    liveupdate: luo_ioctl: add userpsace interface
>>    liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
>>    liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state
>>      management
>>    liveupdate: luo_sysfs: add sysfs state monitoring
>>    reboot: call liveupdate_reboot() before kexec
>>    kho: move kho debugfs directory to liveupdate
>>    liveupdate: add selftests for subsystems un/registration
>>    selftests/liveupdate: add subsystem/state tests
>>    docs: add luo documentation
>>    MAINTAINERS: add liveupdate entry
>> Pratyush Yadav (5):
>>    mm: shmem: use SHMEM_F_* flags instead of VM_* flags
>>    mm: shmem: allow freezing inode mapping
>>    mm: shmem: export some functions to internal.h
>>    luo: allow preserving memfd
>>    docs: add documentation for memfd preservation via LUO
>
> It's not clear from the description why these mm shmem changes are buried in
> this patch set. It's not even described above in the patch description.

Patches 26-30 describe the shmem changes in more detail, but you're
right, it should be mentioned in the cover as well.

The idea is, LUO is used to preserve kernel resources across kexec. One
of the most fundamental resources the kernel has is memory. Since LUO
does preservation based on file descriptors, memfd is the way to attach
a FD to memory. So we went with memfd as the first user of LUO. memfd
can be backed by shmem or hugetlb, but currently only shmem is
supported. We do plan to support hugetlb as well in the future.

The idea is to keep the serialization/live update logic out of the way
of the main subsystem. So we decided to keep the logic out in a separate
file.

>
> I suggest sending that part out separately, so Hugh actually spots this.
> (is he even CC'ed?)

Hmm, none of the shmem maintainers are included. I wonder why. The
patches do touch shmem.c and shmem_fs.h so the MAINTAINERS entry for
"TMPFS (SHMEM FILESYSTEM)" should have been hit. My guess is that the
shmem changes weren't part of the original RFC so perhaps Pasha forgot
to update the To/Cc list since then?

Either way, I've added Hugh and Baolin to this email. Hugh, Baolin, you
can find the shmem related patches at [0][1][2][3][4].

Pasha, can you please add them for later versions as well?

And now that I think about it, I suppose patch 29 should also add
memfd_luo.c under the SHMEM MAINTAINERS entry.

[0] https://lore.kernel.org/lkml/20250807014442.3829950-27-pasha.tatashin@soleen.com/
[1] https://lore.kernel.org/lkml/20250807014442.3829950-28-pasha.tatashin@soleen.com/
[2] https://lore.kernel.org/lkml/20250807014442.3829950-29-pasha.tatashin@soleen.com/
[3] https://lore.kernel.org/lkml/20250807014442.3829950-30-pasha.tatashin@soleen.com/
[4] https://lore.kernel.org/lkml/20250807014442.3829950-31-pasha.tatashin@soleen.com/

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: David Hildenbrand @ 2025-08-08 12:07 UTC (permalink / raw)
  To: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, joel.granados, rostedt,
	anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-1-pasha.tatashin@soleen.com>

On 07.08.25 03:44, Pasha Tatashin wrote:
> This series introduces the LUO, a kernel subsystem designed to
> facilitate live kernel updates with minimal downtime,
> particularly in cloud delplyoments aiming to update without fully
> disrupting running virtual machines.
> 
> This series builds upon KHO framework by adding programmatic
> control over KHO's lifecycle and leveraging KHO for persisting LUO's
> own metadata across the kexec boundary. The git branch for this series
> can be found at:
> 
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> 
> Changelog from v2:
> - Addressed comments from Mike Rapoport and Jason Gunthorpe
> - Only one user agent (LiveupdateD) can open /dev/liveupdate
> - Release all preserved resources if /dev/liveupdate closes
>    before reboot.
> - With the above changes, sessions are not needed, and should be
>    maintained by the user-agent itself, so removed support for
>    sessions.
> - Added support for changing per-FD state (i.e. some FDs can be
>    prepared or finished before the global transition.
> - All IOCTLs now follow iommufd/fwctl extendable design.
> - Replaced locks with guards
> - Added a callback for registered subsystems to be notified
>    during boot: ops->boot().
> - Removed args from callbacks, instead use container_of() to
>    carry context specific data (see luo_selftests.c for example).
> - removed patches for luolib, they are going to be introduced in
>    a separate repository.
> 
> What is Live Update?
> Live Update is a kexec based reboot process where selected kernel
> resources (memory, file descriptors, and eventually devices) are kept
> operational or their state preserved across a kernel transition. For
> certain resources, DMA and interrupt activity might continue with
> minimal interruption during the kernel reboot.
> 
> LUO provides a framework for coordinating live updates. It features:
> State Machine: Manages the live update process through states:
> NORMAL, PREPARED, FROZEN, UPDATED.
> 
> KHO Integration:
> 
> LUO programmatically drives KHO's finalization and abort sequences.
> KHO's debugfs interface is now optional configured via
> CONFIG_KEXEC_HANDOVER_DEBUG.
> 
> LUO preserves its own metadata via KHO's kho_add_subtree and
> kho_preserve_phys() mechanisms.
> 
> Subsystem Participation: A callback API liveupdate_register_subsystem()
> allows kernel subsystems (e.g., KVM, IOMMU, VFIO, PCI) to register
> handlers for LUO events (PREPARE, FREEZE, FINISH, CANCEL) and persist a
> u64 payload via the LUO FDT.
> 
> File Descriptor Preservation: Infrastructure
> liveupdate_register_filesystem, luo_register_file, luo_retrieve_file to
> allow specific types of file descriptors (e.g., memfd, vfio) to be
> preserved and restored.
> 
> Handlers for specific file types can be registered to manage their
> preservation and restoration, storing a u64 payload in the LUO FDT.
> 
> User-space Interface:
> 
> ioctl (/dev/liveupdate): The primary control interface for
> triggering LUO state transitions (prepare, freeze, finish, cancel)
> and managing the preservation/restoration of file descriptors.
> Access requires CAP_SYS_ADMIN.
> 
> sysfs (/sys/kernel/liveupdate/state): A read-only interface for
> monitoring the current LUO state. This allows userspace services to
> track progress and coordinate actions.
> 
> Selftests: Includes kernel-side hooks and userspace selftests to
> verify core LUO functionality, particularly subsystem registration and
> basic state transitions.
> 
> LUO State Machine and Events:
> 
> NORMAL:   Default operational state.
> PREPARED: Initial preparation complete after LIVEUPDATE_PREPARE
>            event. Subsystems have saved initial state.
> FROZEN:   Final "blackout window" state after LIVEUPDATE_FREEZE
>            event, just before kexec. Workloads must be suspended.
> UPDATED:  Next kernel has booted via live update. Awaiting restoration
>            and LIVEUPDATE_FINISH.
> 
> Events:
> LIVEUPDATE_PREPARE: Prepare for reboot, serialize state.
> LIVEUPDATE_FREEZE:  Final opportunity to save state before kexec.
> LIVEUPDATE_FINISH:  Post-reboot cleanup in the next kernel.
> LIVEUPDATE_CANCEL:  Abort prepare or freeze, revert changes.
> 
> v2: https://lore.kernel.org/all/20250723144649.1696299-1-pasha.tatashin@soleen.com
> v1: https://lore.kernel.org/all/20250625231838.1897085-1-pasha.tatashin@soleen.com
> RFC v2: https://lore.kernel.org/all/20250515182322.117840-1-pasha.tatashin@soleen.com
> RFC v1: https://lore.kernel.org/all/20250320024011.2995837-1-pasha.tatashin@soleen.com
> 
> Changyuan Lyu (1):
>    kho: add interfaces to unpreserve folios and physical memory ranges
> 
> Mike Rapoport (Microsoft) (1):
>    kho: drop notifiers
> 
> Pasha Tatashin (23):
>    kho: init new_physxa->phys_bits to fix lockdep
>    kho: mm: Don't allow deferred struct page with KHO
>    kho: warn if KHO is disabled due to an error
>    kho: allow to drive kho from within kernel
>    kho: make debugfs interface optional
>    kho: don't unpreserve memory during abort
>    liveupdate: kho: move to kernel/liveupdate
>    liveupdate: luo_core: luo_ioctl: Live Update Orchestrator
>    liveupdate: luo_core: integrate with KHO
>    liveupdate: luo_subsystems: add subsystem registration
>    liveupdate: luo_subsystems: implement subsystem callbacks
>    liveupdate: luo_files: add infrastructure for FDs
>    liveupdate: luo_files: implement file systems callbacks
>    liveupdate: luo_ioctl: add userpsace interface
>    liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
>    liveupdate: luo_files: luo_ioctl: Add ioctls for per-file state
>      management
>    liveupdate: luo_sysfs: add sysfs state monitoring
>    reboot: call liveupdate_reboot() before kexec
>    kho: move kho debugfs directory to liveupdate
>    liveupdate: add selftests for subsystems un/registration
>    selftests/liveupdate: add subsystem/state tests
>    docs: add luo documentation
>    MAINTAINERS: add liveupdate entry
> 
> Pratyush Yadav (5):
>    mm: shmem: use SHMEM_F_* flags instead of VM_* flags
>    mm: shmem: allow freezing inode mapping
>    mm: shmem: export some functions to internal.h
>    luo: allow preserving memfd
>    docs: add documentation for memfd preservation via LUO

It's not clear from the description why these mm shmem changes are 
buried in this patch set. It's not even described above in the patch 
description.

I suggest sending that part out separately, so Hugh actually spots this.
(is he even CC'ed?)

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v2 03/11] fsopen.2: document 'new' mount api
From: Aleksa Sarai @ 2025-08-08 11:57 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <19888ef84eb.11525d76e40004.7721042298577985399@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 626 bytes --]

On 2025-08-08, Askar Safin <safinaskar@zohomail.com> wrote:
>  > If there are no messages in the message queue,
>  > read(2) will return no data and errno will be set to ENODATA.
>  > If the buf argument to read(2) is not large enough to contain the message,
>  > read(2) will return no data and errno will be set to EMSGSIZE.
> 
> read(2) will return -1 in these cases? If yes, then, please, write this.

Yes (well, the syscall returns -EMSGSIZE). I'll try to add a note
without making the paragraph too wordy...

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v2 01/11] mount_setattr.2: document glibc >= 2.36 syscall wrappers
From: Aleksa Sarai @ 2025-08-08 11:55 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Michael T. Kerrisk, Alexander Viro, Jan Kara,
	G. Branden Robinson, linux-man, linux-api, linux-fsdevel,
	linux-kernel, David Howells, Christian Brauner
In-Reply-To: <19888fe1066.fcb132d640137.7051727418921685299@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1243 bytes --]

On 2025-08-08, Askar Safin <safinaskar@zohomail.com> wrote:
> When I render "mount_setattr" from this (v2) pathset, I see weird quote mark. I. e.:
> 
> $ MANWIDTH=10000 man /path/to/mount_setattr.2
> ...
> SYNOPSIS
>        #include <fcntl.h>       /* Definition of AT_* constants */
>        #include <sys/mount.h>
> 
>        int mount_setattr(int dirfd, const char *path, unsigned int flags,
>                          struct mount_attr *attr, size_t size);"
> ...

Ah, my bad. "make -R lint-man" told me to put end quotes on the synopsis
lines, but I missed that there was a separate quote missing. This should
fix it:

diff --git a/man/man2/mount_setattr.2 b/man/man2/mount_setattr.2
index d44fafc93a20..46fcba927dd8 100644
--- a/man/man2/mount_setattr.2
+++ b/man/man2/mount_setattr.2
@@ -14,7 +14,7 @@ .SH SYNOPSIS
 .B #include <sys/mount.h>
 .P
 .BI "int mount_setattr(int " dirfd ", const char *" path ", unsigned int " flags ","
-.BI "                  struct mount_attr *" attr ", size_t " size );"
+.BI "                  struct mount_attr *" attr ", size_t " size ");"
 .fi
 .SH DESCRIPTION
 The


-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply related

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pratyush Yadav @ 2025-08-08 11:52 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Pasha Tatashin, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <mafs0o6sqavkx.fsf@kernel.org>

On Fri, Aug 08 2025, Pratyush Yadav wrote:
[...]
>> @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
>>  				unsigned int order)
>>  {
>>  	struct kho_mem_phys_bits *bits;
>> -	struct kho_mem_phys *physxa;
>> +	struct kho_mem_phys *physxa, *new_physxa;
>>  	const unsigned long pfn_high = pfn >> order;
>>  
>>  	might_sleep();
>>  
>> -	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
>> -	if (IS_ERR(physxa))
>> -		return PTR_ERR(physxa);
>> +	physxa = xa_load(&track->orders, order);
>> +	if (!physxa) {
>> +		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
>> +		if (!new_physxa)
>> +			return -ENOMEM;
>> +
>> +		xa_init(&new_physxa->phys_bits);
>> +		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
>> +				    GFP_KERNEL);
>> +		if (xa_is_err(physxa)) {
>> +			int err = xa_err(physxa);
>> +
>> +			xa_destroy(&new_physxa->phys_bits);
>> +			kfree(new_physxa);
>> +
>> +			return err;
>> +		}
>> +		if (physxa) {
>> +			xa_destroy(&new_physxa->phys_bits);
>> +			kfree(new_physxa);
>> +		} else {
>> +			physxa = new_physxa;
>> +		}
>
> I suppose this could be simplified a bit to:
>
> 	err = xa_err(physxa);
>         if (err || physxa) {
>         	xa_destroy(&new_physxa->phys_bits);
>                 kfree(new_physxa);
>
> 		if (err)
>                 	return err;
> 	} else {
>         	physxa = new_physxa;
> 	}

My email client completely messed the whitespace up so this is a bit
unreadable. Here is what I meant:

	err = xa_err(physxa);
	if (err || physxa) {
		xa_destroy(&new_physxa->phys_bits);
		kfree(new_physxa);

		if (err)
			return err;
	} else {
		physxa = new_physxa;
	}

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 03/30] kho: warn if KHO is disabled due to an error
From: Pratyush Yadav @ 2025-08-08 11:48 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-4-pasha.tatashin@soleen.com>

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> During boot scratch area is allocated based on command line
> parameters or auto calculated. However, scratch area may fail
> to allocate, and in that case KHO is disabled. Currently,
> no warning is printed that KHO is disabled, which makes it
> confusing for the end user to figure out why KHO is not
> available. Add the missing warning message.
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Acked-by: Pratyush Yadav <pratyush@kernel.org>

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 02/30] kho: mm: Don't allow deferred struct page with KHO
From: Pratyush Yadav @ 2025-08-08 11:47 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-3-pasha.tatashin@soleen.com>

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> KHO uses struct pages for the preserved memory early in boot, however,
> with deferred struct page initialization, only a small portion of
> memory has properly initialized struct pages.
>
> This problem was detected where vmemmap is poisoned, and illegal flag
> combinations are detected.
>
> Don't allow them to be enabled together, and later we will have to
> teach KHO to work properly with deferred struct page init kernel
> feature.
>
> Fixes: 990a950fe8fd ("kexec: add config option for KHO")
>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>

Nit: Drop the blank line before fixes. git interpret-trailers doesn't
seem to recognize the fixes otherwise, so this may break some tooling.
Try it yourself:

    $ git interpret-trailers --parse commit_message.txt

Other than this,

Acked-by: Pratyush Yadav <pratyush@kernel.org>

> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  kernel/Kconfig.kexec | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
> index 2ee603a98813..1224dd937df0 100644
> --- a/kernel/Kconfig.kexec
> +++ b/kernel/Kconfig.kexec
> @@ -97,6 +97,7 @@ config KEXEC_JUMP
>  config KEXEC_HANDOVER
>  	bool "kexec handover"
>  	depends on ARCH_SUPPORTS_KEXEC_HANDOVER && ARCH_SUPPORTS_KEXEC_FILE
> +	depends on !DEFERRED_STRUCT_PAGE_INIT
>  	select MEMBLOCK_KHO_SCRATCH
>  	select KEXEC_FILE
>  	select DEBUG_FS

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 01/30] kho: init new_physxa->phys_bits to fix lockdep
From: Pratyush Yadav @ 2025-08-08 11:42 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-2-pasha.tatashin@soleen.com>

Hi Pasha,

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> Lockdep shows the following warning:
>
> INFO: trying to register non-static key.
> The code is fine but needs lockdep annotation, or maybe
> you didn't initialize this object before use?
> turning off the locking correctness validator.
>
> [<ffffffff810133a6>] dump_stack_lvl+0x66/0xa0
> [<ffffffff8136012c>] assign_lock_key+0x10c/0x120
> [<ffffffff81358bb4>] register_lock_class+0xf4/0x2f0
> [<ffffffff813597ff>] __lock_acquire+0x7f/0x2c40
> [<ffffffff81360cb0>] ? __pfx_hlock_conflict+0x10/0x10
> [<ffffffff811707be>] ? native_flush_tlb_global+0x8e/0xa0
> [<ffffffff8117096e>] ? __flush_tlb_all+0x4e/0xa0
> [<ffffffff81172fc2>] ? __kernel_map_pages+0x112/0x140
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff81359556>] lock_acquire+0xe6/0x280
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff8100b9e0>] _raw_spin_lock+0x30/0x40
> [<ffffffff813ec327>] ? xa_load_or_alloc+0x67/0xe0
> [<ffffffff813ec327>] xa_load_or_alloc+0x67/0xe0
> [<ffffffff813eb4c0>] kho_preserve_folio+0x90/0x100
> [<ffffffff813ebb7f>] __kho_finalize+0xcf/0x400
> [<ffffffff813ebef4>] kho_finalize+0x34/0x70
>
> This is becase xa has its own lock, that is not initialized in
> xa_load_or_alloc.
>
> Modifiy __kho_preserve_order(), to properly call
> xa_init(&new_physxa->phys_bits);
>
> Fixes: fc33e4b44b27 ("kexec: enable KHO support for memory preservation")
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
> ---
>  kernel/kexec_handover.c | 29 +++++++++++++++++++++++++----
>  1 file changed, 25 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/kexec_handover.c b/kernel/kexec_handover.c
> index e49743ae52c5..6240bc38305b 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/kexec_handover.c
> @@ -144,14 +144,35 @@ static int __kho_preserve_order(struct kho_mem_track *track, unsigned long pfn,
>  				unsigned int order)
>  {
>  	struct kho_mem_phys_bits *bits;
> -	struct kho_mem_phys *physxa;
> +	struct kho_mem_phys *physxa, *new_physxa;
>  	const unsigned long pfn_high = pfn >> order;
>  
>  	might_sleep();
>  
> -	physxa = xa_load_or_alloc(&track->orders, order, sizeof(*physxa));
> -	if (IS_ERR(physxa))
> -		return PTR_ERR(physxa);
> +	physxa = xa_load(&track->orders, order);
> +	if (!physxa) {
> +		new_physxa = kzalloc(sizeof(*physxa), GFP_KERNEL);
> +		if (!new_physxa)
> +			return -ENOMEM;
> +
> +		xa_init(&new_physxa->phys_bits);
> +		physxa = xa_cmpxchg(&track->orders, order, NULL, new_physxa,
> +				    GFP_KERNEL);
> +		if (xa_is_err(physxa)) {
> +			int err = xa_err(physxa);
> +
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +
> +			return err;
> +		}
> +		if (physxa) {
> +			xa_destroy(&new_physxa->phys_bits);
> +			kfree(new_physxa);
> +		} else {
> +			physxa = new_physxa;
> +		}

I suppose this could be simplified a bit to:

	err = xa_err(physxa);
        if (err || physxa) {
        	xa_destroy(&new_physxa->phys_bits);
                kfree(new_physxa);

		if (err)
                	return err;
	} else {
        	physxa = new_physxa;
	}

No strong preference though, so fine either way. Up to you.

Reviewed-by: Pratyush Yadav <pratyush@kernel.org>

> +	}
>  
>  	bits = xa_load_or_alloc(&physxa->phys_bits, pfn_high / PRESERVE_BITS,
>  				sizeof(*bits));

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox