Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* copy_file_range return value on FUSE
From: Florian Weimer @ 2025-08-04  9:41 UTC (permalink / raw)
  To: fuse-devel, linux-api, linux-fsdevel, linux-kernel

The FUSE protocol uses struct fuse_write_out to convey the return value
of copy_file_range, which is restricted to uint32_t.  But the
copy_file_range interface supports a 64-bit copy operation.  Given that
copy_file_range is expected to clone huge files, large copies are not
unexpected, so this appears to be a real limitation.

There is another wrinkle: we'd need to check if the process runs in
32-bit compat mode, and reject size_t arguments larger than INT_MAX in
this case (with EOVERFLOW presumably).  But perhaps this should be
handled on the kernel side?  Currently, this doesn't seem to happen, and
we can get copy_file_range results in the in-band error range.
Applications have no way to disambiguate this.

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH v2 10/32] liveupdate: luo_core: Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-04  1:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250729172812.GP36037@nvidia.com>

> > +enum liveupdate_event {
> > +     LIVEUPDATE_PREPARE,
> > +     LIVEUPDATE_FREEZE,
> > +     LIVEUPDATE_FINISH,
> > +     LIVEUPDATE_CANCEL,
> > +};
>
> I saw a later patch moves these hunks, that is poor patch planning.

Yes, you're right. I have since moved this to uapi/linux/liveupdate.h
in the introductory patch to improve the structure of the patch
series.

> Ideally an ioctl subsystem should start out with the first patch
> introducing the basic cdev, file open, ioctl dispatch, ioctl uapi
> header and related simple infrastructure.

I have modified the patch series as follows: The rudimentary parts of
the cdev, including the uapi/liveupdate.h header, are now in this
introductory patch. The rest of the ioctl interface is added in the
old patch that introduced luo_ioctl.c.

> Then you'd go basically ioctl by ioctl adding the new ioctls and
> explaining what they do in the patch commit messages.
>
> > +/**
> > + * liveupdate_state_updated - Check if the system is in the live update
> > + * 'updated' state.
> > + *
> > + * This function checks if the live update orchestrator is in the
> > + * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
> > + * successfully rebooted into a new kernel as part of a live update, and the
> > + * preserved devices are expected to be in the process of being reclaimed.
> > + *
> > + * This is typically used by subsystems during early boot of the new kernel
> > + * to determine if they need to attempt to restore state from a previous
> > + * live update.
> > + *
> > + * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
> > + * false otherwise.
> > + */
> > +bool liveupdate_state_updated(void)
> > +{
> > +     return is_current_luo_state(LIVEUPDATE_STATE_UPDATED);
> > +}
> > +EXPORT_SYMBOL_GPL(liveupdate_state_updated);
>
> Unless there are existing in tree users there should not be exports.

Thank you, I have removed the exports from this patch and all others
in the series.

> I'm also not really sure why there is global state, I would expect the
> fd and session objects to record what kind of things they are, not
> having weird globals.

Having a global state is necessary for performance optimizations. This
is similar to why we export the state to userspace via sysfs: it
allows other subsystems to behave differently during a
performance-optimized live update versus a normal boot.

For example, in our code base we have a driver that doesn't
participate in the live update itself (it has no state to preserve).
However, during boot, it checks this global state. If it's a live
update boot, the driver skips certain steps, like loading firmware, to
accelerate the overall boot time.

In other words, even before userspace starts, this global awareness
enables optimizations that aren't necessary during a cold boot or a
regular kexec.

> Like liveupdate_register_subsystem() stuff, it already has a lock,
> &luo_subsystem_list_mutex, if you want to block mutation of the list
> then, IMHO, it makes more sense to stick a specific variable
> 'luo_subsystems_list_immutable' under that lock and make it very
> obvious.
>
> Stuff like luo_files_startup() feels clunky to me:
>
> +       ret = liveupdate_register_subsystem(&luo_file_subsys);
> +       if (ret) {
> +               pr_warn("Failed to register luo_file subsystem [%d]\n", ret);
> +               return ret;
> +       }
> +
> +       if (liveupdate_state_updated()) {
>
> Thats going to be a standard pattern - I would expect that
> liveupdate_register_subsystem() would do the check for updated and
> then arrange to call back something like
> liveupdate_subsystem.ops.post_update()
>
> And then post_update() would get the info that is currently under
> liveupdate_get_subsystem_data() as arguments instead of having to make
> more functions calls.
>
> Maybe even the fdt_node_check_compatible() can be hoisted.
>
> That would remove a bunch more liveupdate_state_updated() calls.

That's a good suggestion for a potential refactor. For now, the
state-check call is inexpensive and is not in a performance-critical
path. We can certainly implement this optimization later if it becomes
necessary.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-03 14:17 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <0e3d0dd8-994f-4665-969c-6daf332c5b94@lucifer.local>

[-- Attachment #1: Type: text/plain, Size: 729 bytes --]

Hi Lorenzo,

On Sun, Aug 03, 2025 at 12:15:15PM +0100, Lorenzo Stoakes wrote:
> On Sun, Aug 03, 2025 at 08:47:28AM +0200, Alejandro Colomar wrote:
> > Would it be possible to write a small C program that uses this new
> > feature?
> 
> I could do, but it's an unusual use of mremap() and we don't currently have
> example C code for the _general_ usage so I wonder if it might be somewhat
> misleading to have example code only for this?

I didn't mean for the manual page.  It was more for helping review the
changes.  If it's very small, it would be useful to include it in the
commit message.  What do you think?


Have a lovely day!
Alex

> 
> Cheers, Lorenzo

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Lorenzo Stoakes @ 2025-08-03 11:15 UTC (permalink / raw)
  To: Alejandro Colomar
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <ngytuoex4uvu5epsdqhvhypnhqidkr7cpwmmcxrml6kpftgusb@jo5ql6eko2ir>

On Sun, Aug 03, 2025 at 08:47:28AM +0200, Alejandro Colomar wrote:
> Would it be possible to write a small C program that uses this new
> feature?

I could do, but it's an unusual use of mremap() and we don't currently have
example C code for the _general_ usage so I wonder if it might be somewhat
misleading to have example code only for this?

Cheers, Lorenzo

^ permalink raw reply

* Re: [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-03  6:47 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Jan Kara, Liam R . Howlett, Vlastimil Babka,
	Jann Horn, Pedro Falcato, Rik van Riel, linux-mm, linux-kernel,
	linux-api
In-Reply-To: <1fd0223a6f903ffdd8ba644d0b67820b1921671f.1753795807.git.lorenzo.stoakes@oracle.com>

Hi Lorenzo,

On Tue, Jul 29, 2025 at 02:47:35PM +0100, Lorenzo Stoakes wrote:
> Document the new behaviour introduced in Linux 6.17 whereby it is now
> possible to move multiple mappings in a single operation, as long as the
> operation is simply an MREMAP_FIXED move - that is old_size is equal to
> new_size and MREMAP_FIXED is specified.
> 
> To make things clearer, also describe this kind of move operation, before
> expanding upon it to describe the newly introduced behaviour.
> 
> This change also explains the limitations of of this method and the
> possibility of partial failure.
> 
> Finally, we pluralise language where it makes sense to do so such that the
> documentation does not contradict either this new capability nor the
> pre-existing edge case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Would it be possible to write a small C program that uses this new
feature?


Cheers,
Alex

> ---
>  man/man2/mremap.2 | 84 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 73 insertions(+), 11 deletions(-)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..6ba51310c 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,18 +25,47 @@ moving it at the same time (controlled by the
>  argument and
>  the available virtual address space).
>  .P
> +Mappings can also simply be moved
> +(without any resizing)
> +by specifying equal
> +.I old_size
> +and
> +.I new_size
> +and using the
> +.B MREMAP_FIXED
> +flag
> +(see below).
> +Since Linux 6.17,
> +while
> +.I old_address
> +must reside within a mapping,
> +.I old_size
> +may span multiple mappings
> +which do not have to be
> +adjacent to one another when
> +performing a move like this.
> +The
> +.B MREMAP_DONTUNMAP
> +flag may be specified.
> +.P
> +If the operation is not
> +simply moving mappings,
> +then
> +.I old_size
> +must span only a single mapping.
> +.P
>  .I old_address
> -is the old address of the virtual memory block that you
> -want to expand (or shrink).
> +is the old address of the first virtual memory block that you
> +want to expand, shrink, and/or move.
>  Note that
>  .I old_address
>  has to be page aligned.
>  .I old_size
> -is the old size of the
> -virtual memory block.
> +is the size of the range containing
> +virtual memory blocks to be manipulated.
>  .I new_size
>  is the requested size of the
> -virtual memory block after the resize.
> +virtual memory blocks after the resize.
>  An optional fifth argument,
>  .IR new_address ,
>  may be provided; see the description of
> @@ -105,13 +134,43 @@ If
>  is specified, then
>  .B MREMAP_MAYMOVE
>  must also be specified.
> +.IP
> +Since Linux 6.17,
> +if
> +.I old_size
> +is equal to
> +.I new_size
> +and
> +.B MREMAP_FIXED
> +is specified, then
> +.I old_size
> +may span beyond the mapping in which
> +.I old_address
> +resides.
> +In this case,
> +gaps between mappings in the original range
> +are maintained in the new range.
> +The whole operation is performed atomically
> +unless an error arises,
> +in which case the operation may be partially
> +completed,
> +that is,
> +some mappings may be moved and others not.
> +.IP
> +
> +Moving multiple mappings is not permitted if
> +any of those mappings have either
> +been registered with
> +.BR userfaultfd (2) ,
> +or map drivers that
> +specify their own custom address mapping logic.
>  .TP
>  .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
>  .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
>  This flag, which must be used in conjunction with
>  .BR MREMAP_MAYMOVE ,
> -remaps a mapping to a new address but does not unmap the mapping at
> -.IR old_address .
> +remaps mappings to a new address but does not unmap them
> +from their original address.
>  .IP
>  The
>  .B MREMAP_DONTUNMAP
> @@ -149,13 +208,13 @@ mapped.
>  See NOTES for some possible applications of
>  .BR MREMAP_DONTUNMAP .
>  .P
> -If the memory segment specified by
> +If the memory segments specified by
>  .I old_address
>  and
>  .I old_size
> -is locked (using
> +are locked (using
>  .BR mlock (2)
> -or similar), then this lock is maintained when the segment is
> +or similar), then this lock is maintained when the segments are
>  resized and/or relocated.
>  As a consequence, the amount of memory locked by the process may change.
>  .SH RETURN VALUE
> @@ -188,7 +247,10 @@ virtual memory address for this process.
>  You can also get
>  .B EFAULT
>  even if there exist mappings that cover the
> -whole address space requested, but those mappings are of different types.
> +whole address space requested, but those mappings are of different types,
> +and the
> +.BR mremap ()
> +operation being performed does not support this.
>  .TP
>  .B EINVAL
>  An invalid argument was given.
> --
> 2.50.1

-- 
<https://www.alejandro-colomar.es/>

^ permalink raw reply

* Re: [PATCH v2 09/32] liveupdate: kho: move to kernel/liveupdate
From: Pasha Tatashin @ 2025-08-02 23:46 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250729171454.GO36037@nvidia.com>

On Tue, Jul 29, 2025 at 1:14 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jul 23, 2025 at 02:46:22PM +0000, Pasha Tatashin wrote:
> > Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
> > core kernel related files to the same place.
> >
> > Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> > ---
> >  Documentation/core-api/kho/concepts.rst       |  2 +-
> >  MAINTAINERS                                   |  2 +-
> >  init/Kconfig                                  |  2 ++
> >  kernel/Kconfig.kexec                          | 25 ----------------
> >  kernel/Makefile                               |  3 +-
> >  kernel/liveupdate/Kconfig                     | 30 +++++++++++++++++++
> >  kernel/liveupdate/Makefile                    |  7 +++++
> >  kernel/{ => liveupdate}/kexec_handover.c      |  6 ++--
> >  .../{ => liveupdate}/kexec_handover_debug.c   |  0
> >  .../kexec_handover_internal.h                 |  0
> >  10 files changed, 45 insertions(+), 32 deletions(-)
> >  create mode 100644 kernel/liveupdate/Kconfig
> >  create mode 100644 kernel/liveupdate/Makefile
> >  rename kernel/{ => liveupdate}/kexec_handover.c (99%)
> >  rename kernel/{ => liveupdate}/kexec_handover_debug.c (100%)
> >  rename kernel/{ => liveupdate}/kexec_handover_internal.h (100%)
>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v2 04/32] kho: allow to drive kho from within kernel
From: Pasha Tatashin @ 2025-08-02 23:40 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <aIdOcmTl-zrxXKAB@kernel.org>

Hi Mike,

Thank you for your review comments.

> > +     mutex_unlock(&kho_out.lock);
> > +     return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(kho_abort);
>
> I don't think a module should be able to drive KHO. Please drop
> EXPORT_SYMBOL_GPL here and for kho_finalize().

Agreed, removed these exports.

Pasha

^ permalink raw reply

* Re: [PATCH v2 01/32] kho: init new_physxa->phys_bits to fix lockdep
From: Pasha Tatashin @ 2025-08-02 23:33 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <aIdNTN1qd0dTvsQm@kernel.org>

> Just int err should be fine here, otherwise

Done

> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>

Thanks.

Pasha

^ permalink raw reply

* Re: [PATCH v3 1/2] man/man2/mremap.2: describe multiple mapping move
From: Alejandro Colomar @ 2025-08-02 16:14 UTC (permalink / raw)
  To: Lorenzo Stoakes, Jan Kara
  Cc: linux-man, Andrew Morton, Peter Xu, Alexander Viro,
	Christian Brauner, Liam R . Howlett, Vlastimil Babka, Jann Horn,
	Pedro Falcato, Rik van Riel, linux-mm, linux-kernel, linux-api
In-Reply-To: <1fd0223a6f903ffdd8ba644d0b67820b1921671f.1753795807.git.lorenzo.stoakes@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 5053 bytes --]

Hi Lorenzo, Jann,

On Tue, Jul 29, 2025 at 02:47:35PM +0100, Lorenzo Stoakes wrote:
> Document the new behaviour introduced in Linux 6.17 whereby it is now
> possible to move multiple mappings in a single operation, as long as the
> operation is simply an MREMAP_FIXED move - that is old_size is equal to
> new_size and MREMAP_FIXED is specified.
> 
> To make things clearer, also describe this kind of move operation, before
> expanding upon it to describe the newly introduced behaviour.
> 
> This change also explains the limitations of of this method and the
> possibility of partial failure.
> 
> Finally, we pluralise language where it makes sense to do so such that the
> documentation does not contradict either this new capability nor the
> pre-existing edge case.
> 
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

I see some very minor formatting or punctuation issues, but I'll fix
them while applying.  Jann, do you plan to review this?


Have a lovely day!
Alex

> ---
>  man/man2/mremap.2 | 84 ++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 73 insertions(+), 11 deletions(-)
> 
> diff --git a/man/man2/mremap.2 b/man/man2/mremap.2
> index 2168ca728..6ba51310c 100644
> --- a/man/man2/mremap.2
> +++ b/man/man2/mremap.2
> @@ -25,18 +25,47 @@ moving it at the same time (controlled by the
>  argument and
>  the available virtual address space).
>  .P
> +Mappings can also simply be moved
> +(without any resizing)
> +by specifying equal
> +.I old_size
> +and
> +.I new_size
> +and using the
> +.B MREMAP_FIXED
> +flag
> +(see below).
> +Since Linux 6.17,
> +while
> +.I old_address
> +must reside within a mapping,
> +.I old_size
> +may span multiple mappings
> +which do not have to be
> +adjacent to one another when
> +performing a move like this.
> +The
> +.B MREMAP_DONTUNMAP
> +flag may be specified.
> +.P
> +If the operation is not
> +simply moving mappings,
> +then
> +.I old_size
> +must span only a single mapping.
> +.P
>  .I old_address
> -is the old address of the virtual memory block that you
> -want to expand (or shrink).
> +is the old address of the first virtual memory block that you
> +want to expand, shrink, and/or move.
>  Note that
>  .I old_address
>  has to be page aligned.
>  .I old_size
> -is the old size of the
> -virtual memory block.
> +is the size of the range containing
> +virtual memory blocks to be manipulated.
>  .I new_size
>  is the requested size of the
> -virtual memory block after the resize.
> +virtual memory blocks after the resize.
>  An optional fifth argument,
>  .IR new_address ,
>  may be provided; see the description of
> @@ -105,13 +134,43 @@ If
>  is specified, then
>  .B MREMAP_MAYMOVE
>  must also be specified.
> +.IP
> +Since Linux 6.17,
> +if
> +.I old_size
> +is equal to
> +.I new_size
> +and
> +.B MREMAP_FIXED
> +is specified, then
> +.I old_size
> +may span beyond the mapping in which
> +.I old_address
> +resides.
> +In this case,
> +gaps between mappings in the original range
> +are maintained in the new range.
> +The whole operation is performed atomically
> +unless an error arises,
> +in which case the operation may be partially
> +completed,
> +that is,
> +some mappings may be moved and others not.
> +.IP
> +
> +Moving multiple mappings is not permitted if
> +any of those mappings have either
> +been registered with
> +.BR userfaultfd (2) ,
> +or map drivers that
> +specify their own custom address mapping logic.
>  .TP
>  .BR MREMAP_DONTUNMAP " (since Linux 5.7)"
>  .\" commit e346b3813067d4b17383f975f197a9aa28a3b077
>  This flag, which must be used in conjunction with
>  .BR MREMAP_MAYMOVE ,
> -remaps a mapping to a new address but does not unmap the mapping at
> -.IR old_address .
> +remaps mappings to a new address but does not unmap them
> +from their original address.
>  .IP
>  The
>  .B MREMAP_DONTUNMAP
> @@ -149,13 +208,13 @@ mapped.
>  See NOTES for some possible applications of
>  .BR MREMAP_DONTUNMAP .
>  .P
> -If the memory segment specified by
> +If the memory segments specified by
>  .I old_address
>  and
>  .I old_size
> -is locked (using
> +are locked (using
>  .BR mlock (2)
> -or similar), then this lock is maintained when the segment is
> +or similar), then this lock is maintained when the segments are
>  resized and/or relocated.
>  As a consequence, the amount of memory locked by the process may change.
>  .SH RETURN VALUE
> @@ -188,7 +247,10 @@ virtual memory address for this process.
>  You can also get
>  .B EFAULT
>  even if there exist mappings that cover the
> -whole address space requested, but those mappings are of different types.
> +whole address space requested, but those mappings are of different types,
> +and the
> +.BR mremap ()
> +operation being performed does not support this.
>  .TP
>  .B EINVAL
>  An invalid argument was given.
> --
> 2.50.1

-- 
<https://www.alejandro-colomar.es/>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC v3 1/4] kernel/api: introduce kernel API specification framework
From: Sasha Levin @ 2025-08-01 13:53 UTC (permalink / raw)
  To: Askar Safin; +Cc: linux-api, linux-doc, linux-kernel, tools
In-Reply-To: <20250716072141.12-1-safinaskar@zohomail.com>

On Wed, Jul 16, 2025 at 10:21:41AM +0300, Askar Safin wrote:
>> +   KAPI_PARAM_IN       = (1 << 0),
>> +   KAPI_PARAM_OUT      = (1 << 1),
>> +   KAPI_PARAM_INOUT    = (1 << 2),
>
>There is no need for KAPI_PARAM_INOUT. It could be replaced by KAPI_PARAM_IN | KAPI_PARAM_OUT

It could, but it's easier to write _INOUT :)

-- 
Thanks,
Sasha

^ permalink raw reply

* Re: [PATCH] sched/deadline: sched_getattr(...flags=1) returns the runtime left and abs deadline for DEADLINE tasks
From: Juri Lelli @ 2025-08-01 12:53 UTC (permalink / raw)
  To: Tommaso Cucinotta
  Cc: Ingo Molnar, Peter Zijlstra, Vincent Guittot, Dietmar Eggemann,
	Steven Rostedt, Ben Segall, Mel Gorman, Valentin Schneider,
	linux-kernel, linux-api, Tommaso Cucinotta
In-Reply-To: <20250715164148.1151620-1-tommaso.cucinotta@santannapisa.it>

Hi Tommaso,

On 15/07/25 18:39, Tommaso Cucinotta wrote:
> The SCHED_DEADLINE scheduler allows reading the statically configured
> run-time, deadline, and period parameters through the sched_getattr()
> system call. However, there is no immediate way to access, from user space,
> the current parameters used within the scheduler: the instantaneous runtime
> left in the current cycle, as well as the current absolute deadline.
> 
> The `flags' sched_getattr() parameter, so far mandated to contain zero,
> now supports the SCHED_GETATTR_FLAG_DL_DYNAMIC=1 flag, to request
> retrieval of the leftover runtime and absolute deadline, converted to a
> CLOCK_MONOTONIC reference, instead of the statically configured parameters.
> 
> This feature is useful for adaptive SCHED_DEADLINE tasks that need to
> modify their behavior depending on whether or not there is enough runtime
> left in the current period, and/or what is the current absolute deadline.
> 
> Notes:
> - before returning the instantaneous parameters, the runtime is updated;
> - the abs deadline is returned shifted from rq_clock() to ktime_get_ns(),
>   in CLOCK_MONOTONIC reference; this causes multiple invocations from the
>   same period to return values that may differ for a few ns (showing some
>   small drift), albeit the deadline doesn't move, in rq_clock() reference;
> - the abs deadline value returned to user-space, as unsigned 64-bit value,
>   can represent nearly 585 years since boot time;
> - setting flags=0 provides the old behavior (retrieve static parameters).

As we discussed this offline before your submission you know that I was
already on board with the idea, so I would really like to hear what
Peter and others think about this.

Still a few comments below.

$SUBJECT can maybe simply be "sched/deadline: Add reporting of remaining
time/abs deadline to sched_getattr".

> See also the notes from discussion held at OSPM 2025 on the topic
> "Making user space aware of current deadline-scheduler parameters":
> https://lwn.net/Articles/1022054/

I would probably remove the link from the changelog as it might
disappear/change in the future.

> Signed-off-by: Tommaso Cucinotta <tommaso.cucinotta@santannapisa.it>
> ---
>  include/uapi/linux/sched.h |  3 +++
>  kernel/sched/deadline.c    | 18 +++++++++++++++---
>  kernel/sched/sched.h       |  2 +-
>  kernel/sched/syscalls.c    | 16 +++++++++++-----
>  4 files changed, 30 insertions(+), 9 deletions(-)
> 
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 359a14cc..52b69ce8 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -146,4 +146,7 @@ struct clone_args {
>  			 SCHED_FLAG_KEEP_ALL		| \
>  			 SCHED_FLAG_UTIL_CLAMP)
>  
> +/* Only for sched_getattr() own flag param, if task is SCHED_DEADLINE */
> +#define SCHED_GETATTR_FLAG_DL_DYNAMIC	0x01
> +
>  #endif /* _UAPI_LINUX_SCHED_H */
> diff --git a/kernel/sched/deadline.c b/kernel/sched/deadline.c
> index 9c7d9528..1b5cd7fa 100644
> --- a/kernel/sched/deadline.c
> +++ b/kernel/sched/deadline.c
> @@ -3290,13 +3290,25 @@ void __setparam_dl(struct task_struct *p, const struct sched_attr *attr)
>  	dl_se->dl_density = to_ratio(dl_se->dl_deadline, dl_se->dl_runtime);
>  }
>  
> -void __getparam_dl(struct task_struct *p, struct sched_attr *attr)
> +void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int flags)
>  {
>  	struct sched_dl_entity *dl_se = &p->dl;
> +	struct rq *rq = task_rq(p);
> +	u64 adj_deadline;
>  
>  	attr->sched_priority = p->rt_priority;
> -	attr->sched_runtime = dl_se->dl_runtime;
> -	attr->sched_deadline = dl_se->dl_deadline;
> +	if (flags & SCHED_GETATTR_FLAG_DL_DYNAMIC) {
> +		guard(raw_spinlock_irq)(&rq->__lock);
> +		update_rq_clock(rq);
> +		update_curr_dl(rq);

I p is not current maybe we don't need to call update_curr_dl() as p
will still have sensible dynamic parameters updated last time it
blocked?

Also, even though this is superuser stuff and all, mildly fear this
could be used as a DOS attack vector? Do we want to be super defensive
and add some kind of rate limiting?

> +
> +		attr->sched_runtime = dl_se->runtime;
> +		adj_deadline = dl_se->deadline - rq_clock(rq) + ktime_get_ns();
> +		attr->sched_deadline = adj_deadline;
> +	} else {
> +		attr->sched_runtime = dl_se->dl_runtime;
> +		attr->sched_deadline = dl_se->dl_deadline;
> +	}
>  	attr->sched_period = dl_se->dl_period;
>  	attr->sched_flags &= ~SCHED_DL_FLAGS;
>  	attr->sched_flags |= dl_se->flags;
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3058fb62..f69bf019 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -353,7 +353,7 @@ extern int  sched_dl_global_validate(void);
>  extern void sched_dl_do_global(void);
>  extern int  sched_dl_overflow(struct task_struct *p, int policy, const struct sched_attr *attr);
>  extern void __setparam_dl(struct task_struct *p, const struct sched_attr *attr);
> -extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr);
> +extern void __getparam_dl(struct task_struct *p, struct sched_attr *attr, unsigned int flags);
>  extern bool __checkparam_dl(const struct sched_attr *attr);
>  extern bool dl_param_changed(struct task_struct *p, const struct sched_attr *attr);
>  extern int  dl_cpuset_cpumask_can_shrink(const struct cpumask *cur, const struct cpumask *trial);
> diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
> index 77ae87f3..c80b3568 100644
> --- a/kernel/sched/syscalls.c
> +++ b/kernel/sched/syscalls.c
> @@ -928,10 +928,10 @@ static int sched_copy_attr(struct sched_attr __user *uattr, struct sched_attr *a
>  	return -E2BIG;
>  }
>  
> -static void get_params(struct task_struct *p, struct sched_attr *attr)
> +static void get_params(struct task_struct *p, struct sched_attr *attr, unsigned int flags)
>  {
>  	if (task_has_dl_policy(p)) {
> -		__getparam_dl(p, attr);
> +		__getparam_dl(p, attr, flags);
>  	} else if (task_has_rt_policy(p)) {
>  		attr->sched_priority = p->rt_priority;
>  	} else {
> @@ -997,7 +997,7 @@ SYSCALL_DEFINE3(sched_setattr, pid_t, pid, struct sched_attr __user *, uattr,
>  		return -ESRCH;
>  
>  	if (attr.sched_flags & SCHED_FLAG_KEEP_PARAMS)
> -		get_params(p, &attr);
> +		get_params(p, &attr, 0);
>  
>  	return sched_setattr(p, &attr);
>  }
> @@ -1082,7 +1082,7 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
>  	int retval;
>  
>  	if (unlikely(!uattr || pid < 0 || usize > PAGE_SIZE ||
> -		      usize < SCHED_ATTR_SIZE_VER0 || flags))
> +		     usize < SCHED_ATTR_SIZE_VER0))
>  		return -EINVAL;
>  
>  	scoped_guard (rcu) {
> @@ -1090,6 +1090,12 @@ SYSCALL_DEFINE4(sched_getattr, pid_t, pid, struct sched_attr __user *, uattr,
>  		if (!p)
>  			return -ESRCH;
>  
> +		if (flags) {
> +			if (!task_has_dl_policy(p)
> +			    || flags != SCHED_GETATTR_FLAG_DL_DYNAMIC)

Nit pick. Formatting usually has line break after '||'.

> +				return -EINVAL;
> +		}
> +

Thanks!
Juri


^ permalink raw reply

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Aleksa Sarai @ 2025-07-31 14:21 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest
In-Reply-To: <20250731-angliederung-mahlt-9e5811969817@brauner>

[-- Attachment #1: Type: text/plain, Size: 7820 bytes --]

On 2025-07-31, Christian Brauner <brauner@kernel.org> wrote:
> On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote:
> > On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> > > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > > > /proc has historically had very opaque semantics about PID namespaces,
> > > > which is a little unfortunate for container runtimes and other programs
> > > > that deal with switching namespaces very often. One common issue is that
> > > > of converting between PIDs in the process's namespace and PIDs in the
> > > > namespace of /proc.
> > > > 
> > > > In principle, it is possible to do this today by opening a pidfd with
> > > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > > > contain a PID value translated to the pid namespace associated with that
> > > > procfs superblock). However, allocating a new file for each PID to be
> > > > converted is less than ideal for programs that may need to scan procfs,
> > > > and it is generally useful for userspace to be able to finally get this
> > > > information from procfs.
> > > > 
> > > > So, add a new API for this in the form of an ioctl(2) you can call on
> > > > the root directory of procfs. The returned file descriptor will have
> > > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > > > option, finally allowing userspace full control of the pid namespaces
> > > > associated with procfs instances.
> > > > 
> > > > The permission model for this is a bit looser than that of the "pidns"
> > > > mount option, but this is mainly because /proc/1/ns/pid provides the
> > > > same information, so as long as you have access to that magic-link (or
> > > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > > > or being in an ancestor pid namespace) it makes sense to allow userspace
> > > > to grab a handle. setns(2) will still have their own permission checks,
> > > > so being able to open a pidns handle doesn't really provide too many
> > > > other capabilities.
> > > > 
> > > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > > ---
> > > >  Documentation/filesystems/proc.rst |  4 +++
> > > >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> > > >  include/uapi/linux/fs.h            |  3 +++
> > > >  3 files changed, 59 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > > index c520b9f8a3fd..506383273c9d 100644
> > > > --- a/Documentation/filesystems/proc.rst
> > > > +++ b/Documentation/filesystems/proc.rst
> > > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > > >  will be used by the procfs instance when translating pids. By default, procfs
> > > >  will use the calling process's active pid namespace.
> > > >  
> > > > +Processes can check which pid namespace is used by a procfs instance by using
> > > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > > > +instance.
> > > > +
> > > >  Chapter 5: Filesystem behavior
> > > >  ==============================
> > > >  
> > > > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > > > index 057c8a125c6e..548a57ec2152 100644
> > > > --- a/fs/proc/root.c
> > > > +++ b/fs/proc/root.c
> > > > @@ -23,8 +23,10 @@
> > > >  #include <linux/cred.h>
> > > >  #include <linux/magic.h>
> > > >  #include <linux/slab.h>
> > > > +#include <linux/ptrace.h>
> > > >  
> > > >  #include "internal.h"
> > > > +#include "../internal.h"
> > > >  
> > > >  struct proc_fs_context {
> > > >  	struct pid_namespace	*pid_ns;
> > > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > > >  	return proc_pid_readdir(file, ctx);
> > > >  }
> > > >  
> > > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > > > +{
> > > > +	switch (cmd) {
> > > > +#ifdef CONFIG_PID_NS
> > > > +	case PROCFS_GET_PID_NAMESPACE: {
> > > > +		struct pid_namespace *active = task_active_pid_ns(current);
> > > > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > > > +		bool can_access_pidns = false;
> > > > +
> > > > +		/*
> > > > +		 * If we are in an ancestors of the pidns, or have join
> > > > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > > > +		 * would be able to grab a handle to the pidns.
> > > > +		 *
> > > > +		 * Otherwise, if there is a root process, then being able to
> > > > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > > > +		 * we should probably match the permission model. For empty
> > > > +		 * namespaces it seems unlikely for there to be a downside to
> > > > +		 * allowing unprivileged users to open a handle to it (setns
> > > > +		 * will fail for unprivileged users anyway).
> > > > +		 */
> > > > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > > > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> > > 
> > > This seems to imply that if @ns is a descendant of @active that the
> > > caller holds privileges over it. Is that actually always true?
> > > 
> > > IOW, why is the check different from the previous pidns= mount option
> > > check. I would've expected:
> > > 
> > > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> > > 
> > > and then the ptrace check as a fallback.
> > 
> > That would mirror pidns_install(), and I did think about it. The primary
> > (mostly handwave-y) reasoning I had for making it less strict was that:
> > 
> >  * If you are in an ancestor pidns, then you can already see those
> >    processes in your own /proc. In theory that means that you will be
> >    able to access /proc/$pid/ns/pid for at least some subprocess there
> >    (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
> >    cleared on ).
> > 
> >    Though hypothetically if they are all running as a different user,
> >    this does not apply (and you could create scenarios where a child
> >    pidns is owned by a userns that you do not have privileges over -- if
> >    you deal with setuid binaries). Maybe that risk means we should just
> >    combine them, I'm not sure.
> > 
> >  * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
> >    strange to disallow access even if it is not in an ancestor
> >    namespace. This is distinct to pidns_install(), where you want to
> >    ensure you cannot escape to a parent pid namespace, this is about
> >    getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
> > 
> > Maybe they should be combined to match pidns_install(), but then I would
> > expect the ptrace_may_access() check to apply to all processes in the
> > pidns to make it less restrictive, which is not something you can
> > practically do (and there is a higher chance that pid1 will have
> > SUID_DUMP_DISABLE than some random subprocess, which almost certainly
> > will not be SUID_DUMP_DISABLE).
> > 
> > Fundamentally, I guess I'm still trying to see what the risk is of
> > allowing a process to get a handle to a pidns that they have some kind
> > of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being
> 
> There shouldn't be. For example, you kinda implicitly do that with a
> pidfd, no? Because you can pass the pidfd to setns() instead of a
> namespace fd itself. Maybe that's the argument you're lookin for?

That argument works for me! I'll rewrite the commit message to make sure
it sounds like I came up with it. ;)

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH RFC v2 3/4] procfs: add PROCFS_GET_PID_NAMESPACE ioctl
From: Christian Brauner @ 2025-07-31 10:31 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Alexander Viro, Jan Kara, Jonathan Corbet, Shuah Khan,
	linux-kernel, linux-fsdevel, linux-api, linux-doc,
	linux-kselftest
In-Reply-To: <2025-07-25.1753409614-vile-track-icky-epidemic-frail-antidote-d7NYuu@cyphar.com>

On Fri, Jul 25, 2025 at 12:24:28PM +1000, Aleksa Sarai wrote:
> On 2025-07-24, Christian Brauner <brauner@kernel.org> wrote:
> > On Wed, Jul 23, 2025 at 09:18:53AM +1000, Aleksa Sarai wrote:
> > > /proc has historically had very opaque semantics about PID namespaces,
> > > which is a little unfortunate for container runtimes and other programs
> > > that deal with switching namespaces very often. One common issue is that
> > > of converting between PIDs in the process's namespace and PIDs in the
> > > namespace of /proc.
> > > 
> > > In principle, it is possible to do this today by opening a pidfd with
> > > pidfd_open(2) and then looking at /proc/self/fdinfo/$n (which will
> > > contain a PID value translated to the pid namespace associated with that
> > > procfs superblock). However, allocating a new file for each PID to be
> > > converted is less than ideal for programs that may need to scan procfs,
> > > and it is generally useful for userspace to be able to finally get this
> > > information from procfs.
> > > 
> > > So, add a new API for this in the form of an ioctl(2) you can call on
> > > the root directory of procfs. The returned file descriptor will have
> > > O_CLOEXEC set. This acts as a sister feature to the new "pidns" mount
> > > option, finally allowing userspace full control of the pid namespaces
> > > associated with procfs instances.
> > > 
> > > The permission model for this is a bit looser than that of the "pidns"
> > > mount option, but this is mainly because /proc/1/ns/pid provides the
> > > same information, so as long as you have access to that magic-link (or
> > > something equivalently reasonable such as privileges with CAP_SYS_ADMIN
> > > or being in an ancestor pid namespace) it makes sense to allow userspace
> > > to grab a handle. setns(2) will still have their own permission checks,
> > > so being able to open a pidns handle doesn't really provide too many
> > > other capabilities.
> > > 
> > > Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
> > > ---
> > >  Documentation/filesystems/proc.rst |  4 +++
> > >  fs/proc/root.c                     | 54 ++++++++++++++++++++++++++++++++++++--
> > >  include/uapi/linux/fs.h            |  3 +++
> > >  3 files changed, 59 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
> > > index c520b9f8a3fd..506383273c9d 100644
> > > --- a/Documentation/filesystems/proc.rst
> > > +++ b/Documentation/filesystems/proc.rst
> > > @@ -2398,6 +2398,10 @@ pidns= specifies a pid namespace (either as a string path to something like
> > >  will be used by the procfs instance when translating pids. By default, procfs
> > >  will use the calling process's active pid namespace.
> > >  
> > > +Processes can check which pid namespace is used by a procfs instance by using
> > > +the `PROCFS_GET_PID_NAMESPACE` ioctl() on the root directory of the procfs
> > > +instance.
> > > +
> > >  Chapter 5: Filesystem behavior
> > >  ==============================
> > >  
> > > diff --git a/fs/proc/root.c b/fs/proc/root.c
> > > index 057c8a125c6e..548a57ec2152 100644
> > > --- a/fs/proc/root.c
> > > +++ b/fs/proc/root.c
> > > @@ -23,8 +23,10 @@
> > >  #include <linux/cred.h>
> > >  #include <linux/magic.h>
> > >  #include <linux/slab.h>
> > > +#include <linux/ptrace.h>
> > >  
> > >  #include "internal.h"
> > > +#include "../internal.h"
> > >  
> > >  struct proc_fs_context {
> > >  	struct pid_namespace	*pid_ns;
> > > @@ -418,15 +420,63 @@ static int proc_root_readdir(struct file *file, struct dir_context *ctx)
> > >  	return proc_pid_readdir(file, ctx);
> > >  }
> > >  
> > > +static long int proc_root_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
> > > +{
> > > +	switch (cmd) {
> > > +#ifdef CONFIG_PID_NS
> > > +	case PROCFS_GET_PID_NAMESPACE: {
> > > +		struct pid_namespace *active = task_active_pid_ns(current);
> > > +		struct pid_namespace *ns = proc_pid_ns(file_inode(filp)->i_sb);
> > > +		bool can_access_pidns = false;
> > > +
> > > +		/*
> > > +		 * If we are in an ancestors of the pidns, or have join
> > > +		 * privileges (CAP_SYS_ADMIN), then it makes sense that we
> > > +		 * would be able to grab a handle to the pidns.
> > > +		 *
> > > +		 * Otherwise, if there is a root process, then being able to
> > > +		 * access /proc/$pid/ns/pid is equivalent to this ioctl and so
> > > +		 * we should probably match the permission model. For empty
> > > +		 * namespaces it seems unlikely for there to be a downside to
> > > +		 * allowing unprivileged users to open a handle to it (setns
> > > +		 * will fail for unprivileged users anyway).
> > > +		 */
> > > +		can_access_pidns = pidns_is_ancestor(ns, active) ||
> > > +				   ns_capable(ns->user_ns, CAP_SYS_ADMIN);
> > 
> > This seems to imply that if @ns is a descendant of @active that the
> > caller holds privileges over it. Is that actually always true?
> > 
> > IOW, why is the check different from the previous pidns= mount option
> > check. I would've expected:
> > 
> > ns_capable(_no_audit)(ns->user_ns) && pidns_is_ancestor(ns, active)
> > 
> > and then the ptrace check as a fallback.
> 
> That would mirror pidns_install(), and I did think about it. The primary
> (mostly handwave-y) reasoning I had for making it less strict was that:
> 
>  * If you are in an ancestor pidns, then you can already see those
>    processes in your own /proc. In theory that means that you will be
>    able to access /proc/$pid/ns/pid for at least some subprocess there
>    (even if some subprocesses have SUID_DUMP_DISABLE, that flag is
>    cleared on ).
> 
>    Though hypothetically if they are all running as a different user,
>    this does not apply (and you could create scenarios where a child
>    pidns is owned by a userns that you do not have privileges over -- if
>    you deal with setuid binaries). Maybe that risk means we should just
>    combine them, I'm not sure.
> 
>  * If you have CAP_SYS_ADMIN permissions over the pidns, it seems
>    strange to disallow access even if it is not in an ancestor
>    namespace. This is distinct to pidns_install(), where you want to
>    ensure you cannot escape to a parent pid namespace, this is about
>    getting a handle to do other operations (i.e. NS_GET_{P,TG}ID_*_PIDNS).
> 
> Maybe they should be combined to match pidns_install(), but then I would
> expect the ptrace_may_access() check to apply to all processes in the
> pidns to make it less restrictive, which is not something you can
> practically do (and there is a higher chance that pid1 will have
> SUID_DUMP_DISABLE than some random subprocess, which almost certainly
> will not be SUID_DUMP_DISABLE).
> 
> Fundamentally, I guess I'm still trying to see what the risk is of
> allowing a process to get a handle to a pidns that they have some kind
> of privilege over (whether it's CAP_SYS_ADMIN, or by the virtue of being

There shouldn't be. For example, you kinda implicitly do that with a
pidfd, no? Because you can pass the pidfd to setns() instead of a
namespace fd itself. Maybe that's the argument you're lookin for?

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Pavel Tikhomirov @ 2025-07-31  8:11 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Andrei Vagin, Al Viro, Andrei Vagin, linux-fsdevel, LKML, criu,
	Linux API, stable
In-Reply-To: <20250731-masten-resolut-89aca1e3454f@brauner>

On Thu, Jul 31, 2025 at 3:53 PM Christian Brauner <brauner@kernel.org> wrote:
>
> On Thu, Jul 31, 2025 at 10:40:40AM +0800, Pavel Tikhomirov wrote:
> > If detached mounts are our only concern, it looks like the check instead of:
> >
> > if (!check_mnt(mnt)) {
> >         err = -EINVAL;
> >         goto out_unlock;
> > }
> >
> > could've been a more relaxed one:
> >
> > if (mnt_detached(mnt)) {
> >         err = -EINVAL;
> >         goto out_unlock;
> > }
> >
> > bool mnt_detached(struct mount *mnt)
> > {
> >         return !mnt->mnt_ns;
> > }
> >
> > not to allow propagation change only on detached mounts. (As
> > umount_tree sets mnt_ns to NULL.)
>
> Changing propagation settings on detached mounts is fine and shoud work?
> Changing propagation settings on unmounted mounts not so much...

Sorry, it's my confused terminology, here by "detached" mounts I mean
mounts which were unmounted with umount2(MNT_DETACH), maybe I should
call them "unmounted" (e.g. s/mnt_detached/mnt_unmounted/).

And yes, I understand why we need to allow changing propagation on
mounts in anonymous namespace without being inside it, because one
can't just enter anonymous namespace.

I don't think that we need to change anything, just sharing my
thoughts that it could be more relaxed and will still protect us from
propagation setting on unmounted mounts.

>
> >
> > Also in do_mount_setattr we have a more relaxed check too:
> >
> > if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt))
> >         goto out;
> >
> > Best Regards, Tikhomirov Pavel.
> >
> > On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin <avagin@google.com> wrote:
> > >
> > > On Sat, Jul 26, 2025 at 10:53 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
> > > > > On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > > >
> > > > > > On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > > > > > > Hi Al and Christian,
> > > > > > >
> > > > > > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > > > > > > unmounted/not ours mounts") introduced an ABI backward compatibility
> > > > > > > break. CRIU depends on the previous behavior, and users are now
> > > > > > > reporting criu restore failures following the kernel update. This change
> > > > > > > has been propagated to stable kernels. Is this check strictly required?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > Would it be possible to check only if the current process has
> > > > > > > CAP_SYS_ADMIN within the mount user namespace?
> > > > > >
> > > > > > Not enough, both in terms of permissions *and* in terms of "thou
> > > > > > shalt not bugger the kernel data structures - nobody's priveleged
> > > > > > enough for that".
> > > > >
> > > > > Al,
> > > > >
> > > > > I am still thinking in terms of "Thou shalt not break userspace"...
> > > > >
> > > > > Seriously though, this original behavior has been in the kernel for 20
> > > > > years, and it hasn't triggered any corruptions in all that time.
> > > >
> > > > For a very mild example of fun to be had there:
> > > >         mount("none", "/mnt", "tmpfs", 0, "");
> > > >         chdir("/mnt");
> > > >         umount2(".", MNT_DETACH);
> > > >         mount(NULL, ".", NULL, MS_SHARED, NULL);
> > > > Repeat in a loop, watch mount group id leak.  That's a trivial example
> > > > of violating the assertion ("a mount that had been through umount_tree()
> > > > is out of propagation graph and related data structures for good").
> > >
> > > I wasn't referring to detached mounts. CRIU modifies mounts from
> > > non-current namespaces.
> > >
> > > >
> > > > As for the "CAP_SYS_ADMIN within the mount user namespace" - which
> > > > userns do you have in mind?
> > > >
> > >
> > > The user namespace of the target mount:
> > > ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
> > >

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Christian Brauner @ 2025-07-31  7:53 UTC (permalink / raw)
  To: Pavel Tikhomirov
  Cc: Andrei Vagin, Al Viro, Andrei Vagin, linux-fsdevel, LKML, criu,
	Linux API, stable
In-Reply-To: <CAE1zp74Myaab_U5ZswjCE=ND66bT907Y=vmsk14hV89R_ugbtg@mail.gmail.com>

On Thu, Jul 31, 2025 at 10:40:40AM +0800, Pavel Tikhomirov wrote:
> If detached mounts are our only concern, it looks like the check instead of:
> 
> if (!check_mnt(mnt)) {
>         err = -EINVAL;
>         goto out_unlock;
> }
> 
> could've been a more relaxed one:
> 
> if (mnt_detached(mnt)) {
>         err = -EINVAL;
>         goto out_unlock;
> }
> 
> bool mnt_detached(struct mount *mnt)
> {
>         return !mnt->mnt_ns;
> }
> 
> not to allow propagation change only on detached mounts. (As
> umount_tree sets mnt_ns to NULL.)

Changing propagation settings on detached mounts is fine and shoud work?
Changing propagation settings on unmounted mounts not so much...

> 
> Also in do_mount_setattr we have a more relaxed check too:
> 
> if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt))
>         goto out;
> 
> Best Regards, Tikhomirov Pavel.
> 
> On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin <avagin@google.com> wrote:
> >
> > On Sat, Jul 26, 2025 at 10:53 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
> > > > On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > >
> > > > > On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > > > > > Hi Al and Christian,
> > > > > >
> > > > > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > > > > > unmounted/not ours mounts") introduced an ABI backward compatibility
> > > > > > break. CRIU depends on the previous behavior, and users are now
> > > > > > reporting criu restore failures following the kernel update. This change
> > > > > > has been propagated to stable kernels. Is this check strictly required?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > Would it be possible to check only if the current process has
> > > > > > CAP_SYS_ADMIN within the mount user namespace?
> > > > >
> > > > > Not enough, both in terms of permissions *and* in terms of "thou
> > > > > shalt not bugger the kernel data structures - nobody's priveleged
> > > > > enough for that".
> > > >
> > > > Al,
> > > >
> > > > I am still thinking in terms of "Thou shalt not break userspace"...
> > > >
> > > > Seriously though, this original behavior has been in the kernel for 20
> > > > years, and it hasn't triggered any corruptions in all that time.
> > >
> > > For a very mild example of fun to be had there:
> > >         mount("none", "/mnt", "tmpfs", 0, "");
> > >         chdir("/mnt");
> > >         umount2(".", MNT_DETACH);
> > >         mount(NULL, ".", NULL, MS_SHARED, NULL);
> > > Repeat in a loop, watch mount group id leak.  That's a trivial example
> > > of violating the assertion ("a mount that had been through umount_tree()
> > > is out of propagation graph and related data structures for good").
> >
> > I wasn't referring to detached mounts. CRIU modifies mounts from
> > non-current namespaces.
> >
> > >
> > > As for the "CAP_SYS_ADMIN within the mount user namespace" - which
> > > userns do you have in mind?
> > >
> >
> > The user namespace of the target mount:
> > ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
> >

^ permalink raw reply

* Re: do_change_type(): refuse to operate on unmounted/not ours mounts
From: Pavel Tikhomirov @ 2025-07-31  2:40 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: Al Viro, Andrei Vagin, Christian Brauner, linux-fsdevel, LKML,
	criu, Linux API, stable
In-Reply-To: <CAEWA0a6jgj8vQhrijSJXUHBnCTtz0HEV66tmaVKPe83ng=3feQ@mail.gmail.com>

If detached mounts are our only concern, it looks like the check instead of:

if (!check_mnt(mnt)) {
        err = -EINVAL;
        goto out_unlock;
}

could've been a more relaxed one:

if (mnt_detached(mnt)) {
        err = -EINVAL;
        goto out_unlock;
}

bool mnt_detached(struct mount *mnt)
{
        return !mnt->mnt_ns;
}

not to allow propagation change only on detached mounts. (As
umount_tree sets mnt_ns to NULL.)

Also in do_mount_setattr we have a more relaxed check too:

if ((mnt_has_parent(mnt) || !is_anon_ns(mnt->mnt_ns)) && !check_mnt(mnt))
        goto out;

Best Regards, Tikhomirov Pavel.

On Sun, Jul 27, 2025 at 5:01 AM Andrei Vagin <avagin@google.com> wrote:
>
> On Sat, Jul 26, 2025 at 10:53 AM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > On Sat, Jul 26, 2025 at 10:12:34AM -0700, Andrei Vagin wrote:
> > > On Thu, Jul 24, 2025 at 4:00 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > >
> > > > On Thu, Jul 24, 2025 at 01:02:48PM -0700, Andrei Vagin wrote:
> > > > > Hi Al and Christian,
> > > > >
> > > > > The commit 12f147ddd6de ("do_change_type(): refuse to operate on
> > > > > unmounted/not ours mounts") introduced an ABI backward compatibility
> > > > > break. CRIU depends on the previous behavior, and users are now
> > > > > reporting criu restore failures following the kernel update. This change
> > > > > has been propagated to stable kernels. Is this check strictly required?
> > > >
> > > > Yes.
> > > >
> > > > > Would it be possible to check only if the current process has
> > > > > CAP_SYS_ADMIN within the mount user namespace?
> > > >
> > > > Not enough, both in terms of permissions *and* in terms of "thou
> > > > shalt not bugger the kernel data structures - nobody's priveleged
> > > > enough for that".
> > >
> > > Al,
> > >
> > > I am still thinking in terms of "Thou shalt not break userspace"...
> > >
> > > Seriously though, this original behavior has been in the kernel for 20
> > > years, and it hasn't triggered any corruptions in all that time.
> >
> > For a very mild example of fun to be had there:
> >         mount("none", "/mnt", "tmpfs", 0, "");
> >         chdir("/mnt");
> >         umount2(".", MNT_DETACH);
> >         mount(NULL, ".", NULL, MS_SHARED, NULL);
> > Repeat in a loop, watch mount group id leak.  That's a trivial example
> > of violating the assertion ("a mount that had been through umount_tree()
> > is out of propagation graph and related data structures for good").
>
> I wasn't referring to detached mounts. CRIU modifies mounts from
> non-current namespaces.
>
> >
> > As for the "CAP_SYS_ADMIN within the mount user namespace" - which
> > userns do you have in mind?
> >
>
> The user namespace of the target mount:
> ns_capable(mnt->mnt_ns->user_ns, CAP_SYS_ADMIN)
>

^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Pratyush Yadav @ 2025-07-29 23:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Jason Gunthorpe, Thomas Gleixner, Pasha Tatashin, pratyush,
	jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, anna.schumaker, song, zhangguopeng, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, mingo, bp, dave.hansen,
	x86, hpa, rafael, dakr, bartosz.golaszewski, cw00.choi,
	myungjoo.ham, yesanishhere, Jonathan.Cameron, quic_zijuhu,
	aleksander.lobakin, ira.weiny, andriy.shevchenko, leon, lukas,
	bhelgaas, wagi, djeffery, stuart.w.hayes, lennart, brauner,
	linux-api, linux-fsdevel, saeedm, ajayachandra, parav, leonro,
	witu
In-Reply-To: <20250729183548.49d6c2dc@gandalf.local.home>

On Tue, Jul 29 2025, Steven Rostedt wrote:

> On Tue, 29 Jul 2025 19:21:57 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> > As this is an evolving mechanism, having the corresponding library in
>> > the kernel similar to what we do with perf and other things makes a lot
>> > of sense.  
>> 
>> If we did this everywhere we'd have hundreds of libraries in the
>> kernel tree and I would feel bad for all the distros that have to deal
>> with packaging such a thing :(
>> 
>> It is great for development but I'm not sure mono-repo directions are
>> so good for the overall ecosystem.
>
> I have to agree here. When libtraceevent was in the kernel, it was a pain
> to orchestrate releases with distros. When it was moved out of the kernel,
> it made it much easier to manage.
>
> The main issue was versioning numbers. I know the kernel versioning is
> simply just "hey we added more stuff" and the numbers are meaningless.
>
> But a library usually has a different cycle than the kernel. If it doesn't
> have any changes from one kernel release to the next, the distros will make
> a new version anyway, as each kernel release means a new library release.
>
> This luoctl.c isn't even a library, as it has a "main()" and looks to me
> like an application. So my question is, why is it in tools/lib?

luoctl isn't the library, it is a user of it. See previous patches for
the main library. Don't get too excited though, it is only a thin
wrapper around the ioctls. The more interesting stuff is in patch 32
which shows the API in action.

To add some context: one of the reasons to include it in the series as
an RFC at the end was to showcase the userspace side of the API and have
a way for people to see how it can be used. Seeing an API in action
provides useful context for reviewing patches.

I think Pasha forgot to add the RFC tags when he created v2, since it is
only meant to be RFC right now and not proper patches.

The point of moving out of tree was also brought up in the live update
call and I agree with Jason's feedback on it. The plan is to drop it
from the series in the next revision, and just leave a reference to it
in the cover letter instead.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Steven Rostedt @ 2025-07-29 22:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Gleixner, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, rppt, dmatlack, rientjes, corbet, rdunlap,
	ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm, tj,
	yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, anna.schumaker, song, zhangguopeng, linux,
	linux-kernel, linux-doc, linux-mm, gregkh, mingo, bp, dave.hansen,
	x86, hpa, rafael, dakr, bartosz.golaszewski, cw00.choi,
	myungjoo.ham, yesanishhere, Jonathan.Cameron, quic_zijuhu,
	aleksander.lobakin, ira.weiny, andriy.shevchenko, leon, lukas,
	bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, parav,
	leonro, witu
In-Reply-To: <20250729222157.GT36037@nvidia.com>

On Tue, 29 Jul 2025 19:21:57 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:

> > As this is an evolving mechanism, having the corresponding library in
> > the kernel similar to what we do with perf and other things makes a lot
> > of sense.  
> 
> If we did this everywhere we'd have hundreds of libraries in the
> kernel tree and I would feel bad for all the distros that have to deal
> with packaging such a thing :(
> 
> It is great for development but I'm not sure mono-repo directions are
> so good for the overall ecosystem.

I have to agree here. When libtraceevent was in the kernel, it was a pain
to orchestrate releases with distros. When it was moved out of the kernel,
it made it much easier to manage.

The main issue was versioning numbers. I know the kernel versioning is
simply just "hey we added more stuff" and the numbers are meaningless.

But a library usually has a different cycle than the kernel. If it doesn't
have any changes from one kernel release to the next, the distros will make
a new version anyway, as each kernel release means a new library release.

This luoctl.c isn't even a library, as it has a "main()" and looks to me
like an application. So my question is, why is it in tools/lib?

-- Steve

^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Jason Gunthorpe @ 2025-07-29 22:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, mingo, bp, dave.hansen, x86, hpa,
	rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <877bzqkc38.ffs@tglx>

On Tue, Jul 29, 2025 at 09:53:47PM +0200, Thomas Gleixner wrote:
> On Tue, Jul 29 2025 at 13:14, Jason Gunthorpe wrote:
> > On Wed, Jul 23, 2025 at 02:46:44PM +0000, Pasha Tatashin wrote:
> >> From: Pratyush Yadav <ptyadav@amazon.de>
> >>  tools/lib/luo/Makefile       |   6 +-
> >>  tools/lib/luo/cli/.gitignore |   1 +
> >>  tools/lib/luo/cli/Makefile   |  18 ++++
> >>  tools/lib/luo/cli/luoctl.c   | 178 +++++++++++++++++++++++++++++++++++
> >>  4 files changed, 202 insertions(+), 1 deletion(-)
> >>  create mode 100644 tools/lib/luo/cli/.gitignore
> >>  create mode 100644 tools/lib/luo/cli/Makefile
> >>  create mode 100644 tools/lib/luo/cli/luoctl.c
> >
> > In the calls I thought the plan had changed to put libluo in its own
> > repository?
> >
> > There is nothing tightly linked to the kernel here, I think it would
> > be easier on everyone to not add ordinary libraries to the kernel
> > tree.
> 
> As this is an evolving mechanism, having the corresponding library in
> the kernel similar to what we do with perf and other things makes a lot
> of sense.

If we did this everywhere we'd have hundreds of libraries in the
kernel tree and I would feel bad for all the distros that have to deal
with packaging such a thing :(

It is great for development but I'm not sure mono-repo directions are
so good for the overall ecosystem.

I understood perf had a special reason to be in the kernel tree? I
don't think there is any special here beyond it is new.

Jason

^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Thomas Gleixner @ 2025-07-29 19:53 UTC (permalink / raw)
  To: Jason Gunthorpe, Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250729161450.GM36037@nvidia.com>

On Tue, Jul 29 2025 at 13:14, Jason Gunthorpe wrote:
> On Wed, Jul 23, 2025 at 02:46:44PM +0000, Pasha Tatashin wrote:
>> From: Pratyush Yadav <ptyadav@amazon.de>
>>  tools/lib/luo/Makefile       |   6 +-
>>  tools/lib/luo/cli/.gitignore |   1 +
>>  tools/lib/luo/cli/Makefile   |  18 ++++
>>  tools/lib/luo/cli/luoctl.c   | 178 +++++++++++++++++++++++++++++++++++
>>  4 files changed, 202 insertions(+), 1 deletion(-)
>>  create mode 100644 tools/lib/luo/cli/.gitignore
>>  create mode 100644 tools/lib/luo/cli/Makefile
>>  create mode 100644 tools/lib/luo/cli/luoctl.c
>
> In the calls I thought the plan had changed to put libluo in its own
> repository?
>
> There is nothing tightly linked to the kernel here, I think it would
> be easier on everyone to not add ordinary libraries to the kernel
> tree.

As this is an evolving mechanism, having the corresponding library in
the kernel similar to what we do with perf and other things makes a lot
of sense.

Thanks,

        tglx

^ permalink raw reply

* Re: [PATCH v2 14/32] liveupdate: luo_files: add infrastructure for FDs
From: Jason Gunthorpe @ 2025-07-29 17:33 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-15-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:27PM +0000, Pasha Tatashin wrote:
> +/**
> + * struct liveupdate_file_ops - Callbacks for for live-updatable files.
> + * @prepare:       Optional. Saves state for a specific file instance (@file,
> + *                 @arg) before update, potentially returning value via @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @freeze:        Optional. Performs final actions just before kernel
> + *                 transition, potentially reading/updating the handle via
> + *                 @data.
> + *                 Returns 0 on success, negative errno on failure.
> + * @cancel:        Optional. Cleans up state/resources if update is aborted
> + *                 after prepare/freeze succeeded, using the @data handle (by
> + *                 value) from the successful prepare. Returns void.
> + * @finish:        Optional. Performs final cleanup in the new kernel using the
> + *                 preserved @data handle (by value). Returns void.
> + * @retrieve:      Retrieve the preserved file. Must be called before finish.
> + * @can_preserve:  callback to determine if @file with associated context (@arg)
> + *                 can be preserved by this handler.
> + *                 Return bool (true if preservable, false otherwise).
> + */
> +struct liveupdate_file_ops {
> +	int (*prepare)(struct file *file, void *arg, u64 *data);
> +	int (*freeze)(struct file *file, void *arg, u64 *data);
> +	void (*cancel)(struct file *file, void *arg, u64 data);
> +	void (*finish)(struct file *file, void *arg, u64 data, bool reclaimed);
> +	int (*retrieve)(void *arg, u64 data, struct file **file);
> +	bool (*can_preserve)(struct file *file, void *arg);
> +};

ops structures often have an owner = THIS_MODULE

It wouldn't hurt to add it here too, and some appropriate module_get's
though I didn't try to figure what happens if userspace races a module
unload with other luo operations.

> +
> +/**
> + * struct liveupdate_file_handler - Represents a handler for a live-updatable
> + * file type.
> + * @ops:           Callback functions
> + * @compatible:    The compatibility string (e.g., "memfd-v1", "vfiofd-v1")
> + *                 that uniquely identifies the file type this handler supports.
> + *                 This is matched against the compatible string associated with
> + *                 individual &struct liveupdate_file instances.
> + * @arg:           An opaque pointer to implementation-specific context data
> + *                 associated with this file handler registration.

Why? This is not the normal way, if you want context data then
allocate a struct driver_liveupdate_file_handler and embed a normal
struct liveupdate_file_handler inside it, then use container_of.

> +	fdt_for_each_subnode(file_node_offset, luo_file_fdt_in, 0) {
> +		bool handler_found = false;
> +		u64 token;
> +
> +		node_name = fdt_get_name(luo_file_fdt_in, file_node_offset,
> +					 NULL);
> +		if (!node_name) {
> +			panic("FDT subnode at offset %d: Cannot get name\n",
> +			      file_node_offset);

I think this approach will raise lots of questions..

I'd introduce a new function "luo_deserialize_failure" that does panic
internally.

Only called by places that are parsing the FDT & related but run into
trouble that cannot be savely recovered from.

Jason

^ permalink raw reply

* Re: [PATCH v2 10/32] liveupdate: luo_core: Live Update Orchestrator
From: Jason Gunthorpe @ 2025-07-29 17:28 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-11-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:23PM +0000, Pasha Tatashin wrote:
> Introduce LUO, a mechanism intended to facilitate kernel updates while
> keeping designated devices operational across the transition (e.g., via
> kexec). The primary use case is updating hypervisors with minimal
> disruption to running virtual machines. For userspace side of hypervisor
> update we have copyless migration. LUO is for updating the kernel.
> 
> This initial patch lays the groundwork for the LUO subsystem.
> 
> Further functionality, including the implementation of state transition
> logic, integration with KHO, and hooks for subsystems and file
> descriptors, will be added in subsequent patches.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  include/linux/liveupdate.h       | 140 ++++++++++++++
>  kernel/liveupdate/Kconfig        |  27 +++
>  kernel/liveupdate/Makefile       |   1 +
>  kernel/liveupdate/luo_core.c     | 301 +++++++++++++++++++++++++++++++
>  kernel/liveupdate/luo_internal.h |  21 +++
>  5 files changed, 490 insertions(+)
>  create mode 100644 include/linux/liveupdate.h
>  create mode 100644 kernel/liveupdate/luo_core.c
>  create mode 100644 kernel/liveupdate/luo_internal.h
> 
> diff --git a/include/linux/liveupdate.h b/include/linux/liveupdate.h
> new file mode 100644
> index 000000000000..da8f05c81e51
> --- /dev/null
> +++ b/include/linux/liveupdate.h
> @@ -0,0 +1,140 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +
> +/*
> + * Copyright (c) 2025, Google LLC.
> + * Pasha Tatashin <pasha.tatashin@soleen.com>
> + */
> +#ifndef _LINUX_LIVEUPDATE_H
> +#define _LINUX_LIVEUPDATE_H
> +
> +#include <linux/bug.h>
> +#include <linux/types.h>
> +#include <linux/list.h>
> +
> +/**
> + * enum liveupdate_event - Events that trigger live update callbacks.
> + * @LIVEUPDATE_PREPARE: PREPARE should happen *before* the blackout window.
> + *                      Subsystems should prepare for an upcoming reboot by
> + *                      serializing their states. However, it must be considered
> + *                      that user applications, e.g. virtual machines are still
> + *                      running during this phase.
> + * @LIVEUPDATE_FREEZE:  FREEZE sent from the reboot() syscall, when the current
> + *                      kernel is on its way out. This is the final opportunity
> + *                      for subsystems to save any state that must persist
> + *                      across the reboot. Callbacks for this event should be as
> + *                      fast as possible since they are on the critical path of
> + *                      rebooting into the next kernel.
> + * @LIVEUPDATE_FINISH:  FINISH is sent in the newly booted kernel after a
> + *                      successful live update and normally *after* the blackout
> + *                      window. Subsystems should perform any final cleanup
> + *                      during this phase. This phase also provides an
> + *                      opportunity to clean up devices that were preserved but
> + *                      never explicitly reclaimed during the live update
> + *                      process. State restoration should have already occurred
> + *                      before this event. Callbacks for this event must not
> + *                      fail. The completion of this call transitions the
> + *                      machine from ``updated`` to ``normal`` state.
> + * @LIVEUPDATE_CANCEL:  CANCEL the live update and go back to normal state. This
> + *                      event is user initiated, or is done automatically when
> + *                      LIVEUPDATE_PREPARE or LIVEUPDATE_FREEZE stage fails.
> + *                      Subsystems should revert any actions taken during the
> + *                      corresponding prepare event. Callbacks for this event
> + *                      must not fail.
> + *
> + * These events represent the different stages and actions within the live
> + * update process that subsystems (like device drivers and bus drivers)
> + * need to be aware of to correctly serialize and restore their state.
> + *
> + */
> +enum liveupdate_event {
> +	LIVEUPDATE_PREPARE,
> +	LIVEUPDATE_FREEZE,
> +	LIVEUPDATE_FINISH,
> +	LIVEUPDATE_CANCEL,
> +};

I saw a later patch moves these hunks, that is poor patch planning.

Ideally an ioctl subsystem should start out with the first patch
introducing the basic cdev, file open, ioctl dispatch, ioctl uapi
header and related simple infrastructure.

Then you'd go basically ioctl by ioctl adding the new ioctls and
explaining what they do in the patch commit messages.

> +/**
> + * liveupdate_state_updated - Check if the system is in the live update
> + * 'updated' state.
> + *
> + * This function checks if the live update orchestrator is in the
> + * ``LIVEUPDATE_STATE_UPDATED`` state. This state indicates that the system has
> + * successfully rebooted into a new kernel as part of a live update, and the
> + * preserved devices are expected to be in the process of being reclaimed.
> + *
> + * This is typically used by subsystems during early boot of the new kernel
> + * to determine if they need to attempt to restore state from a previous
> + * live update.
> + *
> + * @return true if the system is in the ``LIVEUPDATE_STATE_UPDATED`` state,
> + * false otherwise.
> + */
> +bool liveupdate_state_updated(void)
> +{
> +	return is_current_luo_state(LIVEUPDATE_STATE_UPDATED);
> +}
> +EXPORT_SYMBOL_GPL(liveupdate_state_updated);

Unless there are existing in tree users there should not be exports.

I'm also not really sure why there is global state, I would expect the
fd and session objects to record what kind of things they are, not
having weird globals.

Like liveupdate_register_subsystem() stuff, it already has a lock,
&luo_subsystem_list_mutex, if you want to block mutation of the list
then, IMHO, it makes more sense to stick a specific variable
'luo_subsystems_list_immutable' under that lock and make it very
obvious.

Stuff like luo_files_startup() feels clunky to me:

+       ret = liveupdate_register_subsystem(&luo_file_subsys);
+       if (ret) {
+               pr_warn("Failed to register luo_file subsystem [%d]\n", ret);
+               return ret;
+       }
+
+       if (liveupdate_state_updated()) {

Thats going to be a standard pattern - I would expect that
liveupdate_register_subsystem() would do the check for updated and
then arrange to call back something like
liveupdate_subsystem.ops.post_update()

And then post_update() would get the info that is currently under
liveupdate_get_subsystem_data() as arguments instead of having to make
more functions calls.

Maybe even the fdt_node_check_compatible() can be hoisted.

That would remove a bunch more liveupdate_state_updated() calls.

etc.

Jason

^ permalink raw reply

* Re: [PATCH v2 09/32] liveupdate: kho: move to kernel/liveupdate
From: Jason Gunthorpe @ 2025-07-29 17:14 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-10-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:22PM +0000, Pasha Tatashin wrote:
> Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
> core kernel related files to the same place.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  Documentation/core-api/kho/concepts.rst       |  2 +-
>  MAINTAINERS                                   |  2 +-
>  init/Kconfig                                  |  2 ++
>  kernel/Kconfig.kexec                          | 25 ----------------
>  kernel/Makefile                               |  3 +-
>  kernel/liveupdate/Kconfig                     | 30 +++++++++++++++++++
>  kernel/liveupdate/Makefile                    |  7 +++++
>  kernel/{ => liveupdate}/kexec_handover.c      |  6 ++--
>  .../{ => liveupdate}/kexec_handover_debug.c   |  0
>  .../kexec_handover_internal.h                 |  0
>  10 files changed, 45 insertions(+), 32 deletions(-)
>  create mode 100644 kernel/liveupdate/Kconfig
>  create mode 100644 kernel/liveupdate/Makefile
>  rename kernel/{ => liveupdate}/kexec_handover.c (99%)
>  rename kernel/{ => liveupdate}/kexec_handover_debug.c (100%)
>  rename kernel/{ => liveupdate}/kexec_handover_internal.h (100%)

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason

^ permalink raw reply

* Re: [PATCH v2 16/32] liveupdate: luo_ioctl: add ioctl interface
From: Jason Gunthorpe @ 2025-07-29 16:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-17-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:29PM +0000, Pasha Tatashin wrote:
> Introduce the user-space interface for the Live Update Orchestrator
> via ioctl commands, enabling external control over the live update
> process and management of preserved resources.

I strongly recommend copying something like fwctl (which is copying
iommufd, which is copying some other best practices). I will try to
outline the main points below.

The design of the fwctl scheme allows alot of options for ABI
compatible future extensions and I very strongly recommend that
complex ioctl style APIs be built with that in mind. I have so many
scars from trying to undo fixed ABI design :)

> +/**
> + * struct liveupdate_fd - Holds parameters for preserving and restoring file
> + * descriptors across live update.
> + * @fd:    Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: The user-space file
> + *         descriptor to be preserved.
> + *         Output for %LIVEUPDATE_IOCTL_FD_RESTORE: The new file descriptor
> + *         representing the fully restored kernel resource.
> + * @flags: Unused, reserved for future expansion, must be set to 0.
> + * @token: Input for %LIVEUPDATE_IOCTL_FD_PRESERVE: An opaque, unique token
> + *         preserved for preserved resource.
> + *         Input for %LIVEUPDATE_IOCTL_FD_RESTORE: The token previously
> + *         provided to the preserve ioctl for the resource to be restored.
> + *
> + * This structure is used as the argument for the %LIVEUPDATE_IOCTL_FD_PRESERVE
> + * and %LIVEUPDATE_IOCTL_FD_RESTORE ioctls. These ioctls allow specific types
> + * of file descriptors (for example memfd, kvm, iommufd, and VFIO) to have their
> + * underlying kernel state preserved across a live update cycle.
> + *
> + * To preserve an FD, user space passes this struct to
> + * %LIVEUPDATE_IOCTL_FD_PRESERVE with the @fd field set. On success, the
> + * kernel uses the @token field to uniquly associate the preserved FD.
> + *
> + * After the live update transition, user space passes the struct populated with
> + * the *same* @token to %LIVEUPDATE_IOCTL_FD_RESTORE. The kernel uses the @token
> + * to find the preserved state and, on success, populates the @fd field with a
> + * new file descriptor referring to the restored resource.
> + */
> +struct liveupdate_fd {
> +	int		fd;

'int' should not appear in uapi structs. Fds are __s32

> +	__u32		flags;
> +	__aligned_u64	token;
> +};
> +
> +/* The ioctl type, documented in ioctl-number.rst */
> +#define LIVEUPDATE_IOCTL_TYPE		0xBA

I have found it very helpful to organize the ioctl numbering like this:

#define IOMMUFD_TYPE (';')

enum {
	IOMMUFD_CMD_BASE = 0x80,
	IOMMUFD_CMD_DESTROY = IOMMUFD_CMD_BASE,
	IOMMUFD_CMD_IOAS_ALLOC = 0x81,
	IOMMUFD_CMD_IOAS_ALLOW_IOVAS = 0x82,
[..]

#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)

The numbers should be tightly packed and non-overlapping. It becomes
difficult to manage this if the numbers are sprinkled all over the
file. The above structuring will enforce git am conflicts if things
get muddled up. Saved me a few times already in iommufd.

> +/**
> + * LIVEUPDATE_IOCTL_FD_PRESERVE - Validate and initiate preservation for a file
> + * descriptor.
> + *
> + * Argument: Pointer to &struct liveupdate_fd.
> + *
> + * User sets the @fd field identifying the file descriptor to preserve
> + * (e.g., memfd, kvm, iommufd, VFIO). The kernel validates if this FD type
> + * and its dependencies are supported for preservation. If validation passes,
> + * the kernel marks the FD internally and *initiates the process* of preparing
> + * its state for saving. The actual snapshotting of the state typically occurs
> + * during the subsequent %LIVEUPDATE_IOCTL_PREPARE execution phase, though
> + * some finalization might occur during freeze.
> + * On successful validation and initiation, the kernel uses the @token
> + * field with an opaque identifier representing the resource being preserved.
> + * This token confirms the FD is targeted for preservation and is required for
> + * the subsequent %LIVEUPDATE_IOCTL_FD_RESTORE call after the live update.
> + *
> + * Return: 0 on success (validation passed, preservation initiated), negative
> + * error code on failure (e.g., unsupported FD type, dependency issue,
> + * validation failed).
> + */
> +#define LIVEUPDATE_IOCTL_FD_PRESERVE					\
> +	_IOW(LIVEUPDATE_IOCTL_TYPE, 0x00, struct liveupdate_fd)

From a kdoc perspective I find it works much better to attach the kdoc
to the struct, not the ioctl:

/**
 * struct iommu_destroy - ioctl(IOMMU_DESTROY)
 * @size: sizeof(struct iommu_destroy)
 * @id: iommufd object ID to destroy. Can be any destroyable object type.
 *
 * Destroy any object held within iommufd.
 */
struct iommu_destroy {
	__u32 size;
	__u32 id;
};
#define IOMMU_DESTROY _IO(IOMMUFD_TYPE, IOMMUFD_CMD_DESTROY)

Generates this kdoc:

https://docs.kernel.org/userspace-api/iommufd.html#c.iommu_destroy

You should also make sure to link the uapi header into the kdoc build
under the "userspace API" chaper.

The structs should also be self-describing. I am fairly strongly
against using the size mechanism in the _IOW macro, it is instantly
ABI incompatible and basically impossible to deal with from userspace.

Hence why the IOMMFD version is _IO().

This means stick a size member in the first 4 bytes of every
struct. More on this later..

> +/**
> + * LIVEUPDATE_IOCTL_FD_UNPRESERVE - Remove a file descriptor from the
> + * preservation list.
> + *
> + * Argument: Pointer to __u64 token.

Every ioctl should have a struct, with the size header. If you want to
do more down the road you can not using this structure.

> +#define LIVEUPDATE_IOCTL_FD_RESTORE					\
> +	_IOWR(LIVEUPDATE_IOCTL_TYPE, 0x02, struct liveupdate_fd)

Strongly recommend that every ioctl have a unique struct. Sharing
structs makes future extend-ability harder.

> +/**
> + * LIVEUPDATE_IOCTL_PREPARE - Initiate preparation phase and trigger state
> + * saving.

Perhaps these just want to be a single 'set state' ioctl with an enum
input argument?

> @@ -7,4 +7,5 @@ obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
>  obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
>  obj-$(CONFIG_LIVEUPDATE)		+= luo_core.o
>  obj-$(CONFIG_LIVEUPDATE)		+= luo_files.o
> +obj-$(CONFIG_LIVEUPDATE)		+= luo_ioctl.o
>  obj-$(CONFIG_LIVEUPDATE)		+= luo_subsystems.o

I don't think luo is modular, but I think it is generally better to
write the kbuilds as though it was anyhow if it has a lot of files:

iommufd-y := \
	device.o \
	eventq.o \
	hw_pagetable.o \
	io_pagetable.o \
	ioas.o \
	main.o \
	pages.o \
	vfio_compat.o \
	viommu.o
obj-$(CONFIG_IOMMUFD) += iommufd.o

Basically don't repeat obj-$(CONFIG_LIVEUPDATE), every one of those
lines creates a new module (if it was modular)

> +static int luo_open(struct inode *inodep, struct file *filep)
> +{
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EACCES;

IMHO file system permissions should control permission to open. No
capable check.

> +	if (filep->f_flags & O_EXCL)
> +		return -EINVAL;

O_EXCL doesn't really do anything for cdev, I'd drop this.

The open should have an atomic to check for single open though.

> +static long luo_ioctl(struct file *filep, unsigned int cmd, unsigned long arg)
> +{
> +	void __user *argp = (void __user *)arg;
> +	struct liveupdate_fd luo_fd;
> +	enum liveupdate_state state;
> +	int ret = 0;
> +	u64 token;
> +
> +	if (_IOC_TYPE(cmd) != LIVEUPDATE_IOCTL_TYPE)
> +		return -ENOTTY;

The generic parse/disptach from fwctl is a really good idea here, you
can cut and paste it, change the names. It makes it really easy to manage future extensibility:

List the ops and their structs:

static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
	IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
	IOCTL_OP(FWCTL_RPC, fwctl_cmd_rpc, struct fwctl_rpc, out),
};

Index the list and copy_from_user the struct desribing the opt:

static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
			       unsigned long arg)
{
	struct fwctl_uctx *uctx = filp->private_data;
	const struct fwctl_ioctl_op *op;
	struct fwctl_ucmd ucmd = {};
	union fwctl_ucmd_buffer buf;
	unsigned int nr;
	int ret;

	nr = _IOC_NR(cmd);
	if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
		return -ENOIOCTLCMD;

	op = &fwctl_ioctl_ops[nr - FWCTL_CMD_BASE];
	if (op->ioctl_num != cmd)
		return -ENOIOCTLCMD;

	ucmd.uctx = uctx;
	ucmd.cmd = &buf;
	ucmd.ubuffer = (void __user *)arg;
        // This is reading/checking the standard 4 byte size header:
	ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
	if (ret)
		return ret;

	if (ucmd.user_size < op->min_size)
		return -EINVAL;

	ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
				    ucmd.user_size);


Removes a bunch of boiler plate and easy to make wrong copy_from_users
in the ioctls. Centralizes size validation, zero padding checking/etc.

> +		ret = luo_register_file(luo_fd.token, luo_fd.fd);
> +		if (!ret && copy_to_user(argp, &luo_fd, sizeof(luo_fd))) {
> +			WARN_ON_ONCE(luo_unregister_file(luo_fd.token));
> +			ret = -EFAULT;

Then for extensibility you'd copy back the struct:

static int ucmd_respond(struct fwctl_ucmd *ucmd, size_t cmd_len)
{
	if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
			 min_t(size_t, ucmd->user_size, cmd_len)))
		return -EFAULT;
	return 0;
}

Which truncates it/etc according to some ABI extensibility rules.

> +static int __init liveupdate_init(void)
> +{
> +	int err;
> +
> +	if (!liveupdate_enabled())
> +		return 0;
> +
> +	err = misc_register(&liveupdate_miscdev);
> +	if (err < 0) {
> +		pr_err("Failed to register misc device '%s': %d\n",
> +		       liveupdate_miscdev.name, err);

Should remove most of the pr_err's, here too IMHO..

Jason

^ permalink raw reply

* Re: [PATCH v2 31/32] libluo: introduce luoctl
From: Jason Gunthorpe @ 2025-07-29 16:14 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250723144649.1696299-32-pasha.tatashin@soleen.com>

On Wed, Jul 23, 2025 at 02:46:44PM +0000, Pasha Tatashin wrote:
> From: Pratyush Yadav <ptyadav@amazon.de>
> 
> luoctl is a utility to interact with the LUO state machine. It currently
> supports viewing and change the current state of LUO. This can be used
> by scripts, tools, or developers to control LUO state during the live
> update process.
>
> Example usage:
> 
>     $ luoctl state
>     normal
>     $ luoctl prepare
>     $ luoctl state
>     prepared
>     $ luoctl cancel
>     $ luoctl state
>     normal
> 
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  tools/lib/luo/Makefile       |   6 +-
>  tools/lib/luo/cli/.gitignore |   1 +
>  tools/lib/luo/cli/Makefile   |  18 ++++
>  tools/lib/luo/cli/luoctl.c   | 178 +++++++++++++++++++++++++++++++++++
>  4 files changed, 202 insertions(+), 1 deletion(-)
>  create mode 100644 tools/lib/luo/cli/.gitignore
>  create mode 100644 tools/lib/luo/cli/Makefile
>  create mode 100644 tools/lib/luo/cli/luoctl.c

In the calls I thought the plan had changed to put libluo in its own
repository?

There is nothing tightly linked to the kernel here, I think it would
be easier on everyone to not add ordinary libraries to the kernel
tree.

Jason

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox