Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-27  8:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826205057.GC1603531@mit.edu>

On Tue, Aug 26, 2025 at 04:50:57PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> > 
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> 
> But if that's it, why can't the script interpreter (python, bash,
> etc.) before executing the script, checks for executability via
> faccessat(2) or fstat(2)?

From commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to
execveat(2)"):

    This is different from faccessat(2) + X_OK which only checks a subset of
    access rights (i.e. inode permission and mount options for regular
    files), but not the full context (e.g. all LSM access checks).  The main
    use case for access(2) is for SUID processes to (partially) check access
    on behalf of their caller.  The main use case for execveat(2) +
    AT_EXECVE_CHECK is to check if a script execution would be allowed,
    according to all the different restrictions in place.  Because the use
    of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
    execution, user space gets the same error codes.


> 
> The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
> was doing more than just the executability check?

I would say that that AT_EXECVE_CHECK does a full executability check
(with the full caller's credentials checked against the currently
enforced security policy).

The rationale to add O_DENY_WRITE (which is now abandoned) was to avoid a race
condition between the check and the full read.  Indeed, with a full
execveat(2), the kernel write-lock the file to avoid such issue (which can lead
to other issues).

> 
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> 
> Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
> more substantive requirements and design doc might be helpful.  It
> appears you have some assumptions that perhaps other kernel developers
> are not aware.  I certainly seem to be missing something.....

My reasoning was to explain the rationale for a kernel feature in the commit
message, and the user doc (why and how to use it) in the user-facing
documentation.  Documentation improvements are welcome!

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Theodore Ts'o @ 2025-08-26 20:50 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.iewie7Et5aiw@digikod.net>

On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> 
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).

But if that's it, why can't the script interpreter (python, bash,
etc.) before executing the script, checks for executability via
faccessat(2) or fstat(2)?

The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
was doing more than just the executability check?

> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).

Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
more substantive requirements and design doc might be helpful.  It
appears you have some assumptions that perhaps other kernel developers
are not aware.  I certainly seem to be missing something.....

    		  	    	    - Ted

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jeff Xu @ 2025-08-26 20:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <20250826.eWi6chuayae4@digikod.net>

Hi Mickaël

On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > Hi Mickaël
> >
> > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > >
> > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > >
> > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > >
> > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > because it broke userspace assumptions.
> > > > >
> > > > > Oh, good to know.
> > > > >
> > > > > >
> > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > >
> > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > removed for security reasons; as
> > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > >
> > > > > > |        MAP_DENYWRITE
> > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > |               signaled that attempts to write to the underlying file
> > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > |               of-service attacks.)"
> > > > > >
> > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > they're updated in-place.
> > > > >
> > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > >
> > > > > Anyway, we should rely on the access control policy to control write and
> > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > the references and the background!
> > > >
> > > > I'm confused.  I understand that there are many contexts in which one
> > > > would want to prevent execution of unapproved content, which might
> > > > include preventing a given process from modifying some code and then
> > > > executing it.
> > > >
> > > > I don't understand what these deny-write features have to do with it.
> > > > These features merely prevent someone from modifying code *that is
> > > > currently in use*, which is not at all the same thing as preventing
> > > > modifying code that might get executed -- one can often modify
> > > > contents *before* executing those contents.
> > >
> > > The order of checks would be:
> > > 1. open script with O_DENY_WRITE
> > > 2. check executability with AT_EXECVE_CHECK
> > > 3. read the content and interpret it
> > >
> > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> >
> > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > work with bytecodes like Java, for example. If we let the Java runtime
> > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > develop a policy based on that.
>
> Sure, I'm using "script" to make it simple, but this applies to other
> use cases.
>
That makes sense.

> >
> > > The deny-write feature was to guarantee that there is no race condition
> > > between step 2 and 3.  All these checks are supposed to be done by a
> > > trusted interpreter (which is allowed to be executed).  The
> > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > associated security policies) allowed the *current* content of the file
> > > to be executed.  Whatever happen before or after that (wrt.
> > > O_DENY_WRITE) should be covered by the security policy.
> > >
> > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> >
> > Enforcing non-write for the path that stores scripts or bytecodes can
> > be challenging due to historical or backward compatibility reasons.
> > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > before it is used, we can assume it will detect any "problem" that
> > happened before that, (e.g. the file was overwritten). However, that
> > also imposes two additional requirements:
> > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
>
> This is already the case, so any kind of LSM checks are good.
>
May I ask how this is done? some code in do_open_execat() does this ?
Apologies if this is a basic question.

> > 2>The file content kept by the process remains unchanged after passing
> > the AT_EXECVE_CHECK.
>
> The goal of this patch was to avoid such race condition in the case
> where executable files can be updated.  But in most cases it should not
> be a security issue (because processes allowed to write to executable
> files should be trusted), but this could still lead to bugs (because of
> inconsistent file content, half-updated).
>
There is also a time gap between:
a> the time of AT_EXECVE_CHECK
b> the time that the app opens the file for execution.
right ? another potential attack path (though this is not the case I
mentioned previously).

For the case I mentioned previously, I have to think more if the race
condition is a bug or security issue.
IIUC, two solutions are discussed so far:
1> the process could write to fs to update the script.  However, for
execution, the process still uses the copy that passed the
AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)
or 2> the process blocks the write while opening the file as read only
and executing the script. (this seems to be the approach of this
patch).

I wonder if there are other ideas.

Thanks and regards,
-Jeff

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Pasha Tatashin @ 2025-08-26 18:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826160307.GC2130239@nvidia.com>

On Tue, Aug 26, 2025 at 4:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:25AM +0000, Pasha Tatashin wrote:
> > Introduce a sysfs interface for the Live Update Orchestrator
> > under /sys/kernel/liveupdate/. This interface provides a way for
> > userspace tools and scripts to monitor the current state of the LUO
> > state machine.
>
> Now that you have a cdev these files may be more logically placed
> under the cdev's sysfs and not under kernel? This can be done easially
> using the attribute mechanisms in the struct device.
>
> Again sort of back to my earlier point that everything should be
> logically linked to the cdev as though there could be many cdevs, even
> though there are not. It just keeps the code design more properly
> layered and understanble rather than doing something unique..

I am going to drop this patch entirely, and only rely on "luoctl
state" (see https://tinyurl.com/luoddesign) to query the state from
"/dev/liveupdate"

Pasha

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-26 17:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826123041.GB1603531@mit.edu>

On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> Is there a single, unified design and requirements document that
> describes the threat model, and what you are trying to achieve with
> AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> that has landed for AT_EXECVE_CHECK and it really doesn't describe
> what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> 
>    "The AT_EXECVE_CHECK execveat(2) flag, and the
>    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
>    securebits are intended for script interpreters and dynamic linkers
>    to enforce a consistent execution security policy handled by the
>    kernel."

From the documentation:

  Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
  on a regular file and returns 0 if execution of this file would be
  allowed, ignoring the file format and then the related interpreter
  dependencies (e.g. ELF libraries, script’s shebang).

> 
> Um, what security policy?

Whether the file is allowed to be executed.  This includes file
permission, mount point option, ACL, LSM policies...

> What checks?

Executability checks?

> What is a sample exploit
> which is blocked by AT_EXECVE_CHECK?

Executing/interpreting any data: sh script.txt

> 
> And then on top of it, why can't you do these checks by modifying the
> script interpreters?

The script interpreter requires modification to use AT_EXECVE_CHECK.

There is no other way for user space to reliably check executability of
files (taking into account all enforced security
policies/configurations).

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 17:08 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bB9r_pMzd0VbLsAGTwh8kvV_o3rFM_W--drutewomr1ZQ@mail.gmail.com>

On Tue, Aug 26, 2025 at 05:03:59PM +0000, Pasha Tatashin wrote:

> Perhaps, we should add session extensions to the kernel as follow-up
> after this series lands, we would also need to rewrite luod design
> accordingly to move some of the sessions logic into the kernel.

This is what I imagined at least..

I wouldn't even try to do anything with pid if it can't solve the
whole problem.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 17:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162203.GE2130239@nvidia.com>

> > The existing interface, with the addition of passing a pidfd, provides
> > the necessary flexibility without being invasive. The change would be
> > localized to the new code that performs the FD retrieval and wouldn't
> > involve spoofing current or making widespread changes.
> > For example, to handle cgroup charging for a memfd, the flow inside
> > memfd_luo_retrieve() would look something like this:
> >
> > task = get_pid_task(target_pid, PIDTYPE_PID);
> > mm = get_task_mm(task);
> >     // ...
> >     folio = kho_restore_folio(phys);
> >     // Charge to the target mm, not 'current->mm'
> >     mem_cgroup_charge(folio, mm, ...);
> > mmput(mm);
> > put_task_struct(task);
>
> Execpt it doesn't work like that in all places, iommufd for example
> uses GFP_KERNEL_ACCOUNT which relies on current.

That's a good point. For kernel allocations, I don't see a clean way
to account for a different process.

We should not be doing major allocations during the retrieval process
itself. Ideally, the kernel would restore an FD using only the
preserved folio data (that we can cleanly charge), and then let the
user process perform any subsequent actions that might cause new
kernel memory allocations. However, I can see how that might not be
practical for all handlers.

Perhaps, we should add session extensions to the kernel as follow-up
after this series lands, we would also need to rewrite luod design
accordingly to move some of the sessions logic into the kernel.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 16:22 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAbqMb0ZYvsC9tsf6w5myfUyqo3N4fUP3CwVA_kUDQteg@mail.gmail.com>

On Tue, Aug 26, 2025 at 04:10:31PM +0000, Pasha Tatashin wrote:
> 
> > I think in the calls the idea was it was reasonable to start without
> > sessions fds at all, but in this case we shouldn't be mucking with
> > pids or current.
> 
> The existing interface, with the addition of passing a pidfd, provides
> the necessary flexibility without being invasive. The change would be
> localized to the new code that performs the FD retrieval and wouldn't
> involve spoofing current or making widespread changes.
> For example, to handle cgroup charging for a memfd, the flow inside
> memfd_luo_retrieve() would look something like this:
> 
> task = get_pid_task(target_pid, PIDTYPE_PID);
> mm = get_task_mm(task);
>     // ...
>     folio = kho_restore_folio(phys);
>     // Charge to the target mm, not 'current->mm'
>     mem_cgroup_charge(folio, mm, ...);
> mmput(mm);
> put_task_struct(task);

Execpt it doesn't work like that in all places, iommufd for example
uses GFP_KERNEL_ACCOUNT which relies on current.

How you fix that when current is the wrong cgroup, I have no idea if
it is even possible.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-08-26 16:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-30-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:

> +	/*
> +	 * Most of the space should be taken by preserved folios. So take its
> +	 * size, plus a page for other properties.
> +	 */
> +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> +	if (!fdt) {
> +		err = -ENOMEM;
> +		goto err_unpin;
> +	}

This doesn't seem to have any versioning scheme, it really should..

> +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> +				       (void **)&preserved_folios);
> +	if (err) {
> +		pr_err("Failed to reserve folios property in FDT: %s\n",
> +		       fdt_strerror(err));
> +		err = -ENOMEM;
> +		goto err_free_fdt;
> +	}

Yuk.

This really wants some luo helper

'luo alloc array'
'luo restore array'
'luo free array'

Which would get a linearized list of pages in the vmap to hold the
array and then allocate some structure to record the page list and
return back the u64 of the phys_addr of the top of the structure to
store in whatever.

Getting fdt to allocate the array inside the fds is just not going to
work for anything of size.

> +	for (; i < nr_pfolios; i++) {
> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> +		phys_addr_t phys;
> +		u64 index;
> +		int flags;
> +
> +		if (!pfolio->foliodesc)
> +			continue;
> +
> +		phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> +		folio = kho_restore_folio(phys);
> +		if (!folio) {
> +			pr_err("Unable to restore folio at physical address: %llx\n",
> +			       phys);
> +			goto put_file;
> +		}
> +		index = pfolio->index;
> +		flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> +
> +		/* Set up the folio for insertion. */
> +		/*
> +		 * TODO: Should find a way to unify this and
> +		 * shmem_alloc_and_add_folio().
> +		 */
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> 
> +		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> +		if (ret) {
> +			pr_err("shmem: failed to charge folio index %d: %d\n",
> +			       i, ret);
> +			goto unlock_folio;
> +		}

[..]

> +		folio_add_lru(folio);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}

Probably some consolidation will be needed to make this less
duplicated..

But overall I think just using the memfd_luo_preserved_folio as the
serialization is entirely file, I don't think this needs anything more
complicated.

What it does need is an alternative to the FDT with versioning.

Which seems to me to be entirely fine as:

 struct memfd_luo_v0 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios;
 };

 struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);

Which also shows the actual data needing to be serialized comes from
more than one struct and has to be marshaled in code, somehow, to a
single struct.

Then I imagine a fairly simple forwards/backwards story. If something
new is needed that is non-optional, lets say you compress the folios
list to optimize holes:

 struct memfd_luo_v1 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios_list_with_holes;
 };

Obviously a v0 kernel cannot parse this, but in this case a v1 aware
kernel could optionally duplicate and write out the v0 format as well:

 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
 luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Then the rule is fairly simple, when the sucessor kernel goes to
deserialize it asks luo for the versions it supports:

 if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
    restore_v1(&memfd_luo_v1)
 else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
    restore_v0(&memfd_luo_v0)
 else
    luo_failure("Do not understand this");

luo core just manages this list of versioned data per serialized
object. There is only one version per object.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826151327.GA2130239@nvidia.com>

On Tue, Aug 26, 2025 at 3:13 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:02:13PM +0000, Pasha Tatashin wrote:
> > I'm trying to understand the drawbacks of the PID-based approach.
> > Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
> > not a good idea?
>
> It will be a major invasive change all over the place in the kernel
> to change things that assume current to do something else. We should
> try to avoid this.
>
> > In this flow, the client isn't providing an arbitrary PID; the trusted
> > luod agent is providing the PID of a process it has an active
> > connection with.
>
> PIDs are wobbly thing, you can never really trust them unless they are
> in a pidfd.

Makes, sense, using a PID by value is fragile due to reuse. Luod would
acquire a pidfd for the client process from its socket connection and
pass that pidfd to the kernel in the RESTORE_FD ioctl. The kernel
would then be operating on a stable, secure handle to the target
process.

> > The idea was to let luod handle the session/security story, and the
> > kernel handle the core preservation mechanism. Adding sessions to the
> > kernel, delegates the management and part of the security model into
> > the kernel. I am not sure if it is necessary, what can be cleanly
> > managed in userspace should stay in userspace.
>
> session fds were an update imagined to allow the kernel to partition
> things the session FD it self could be shared with other processes.

I understand the model you're proposing: luod acts as a factory,
issuing session FDs that are then passed to clients, allowing them to
perform restore operations within their own context. While we can
certainly extend the design to support that, I am still trying to
determine if it's strictly necessary, especially if the same outcome
(correct resource attribution) can be achieved with less kernel
complexity. My primary concern is that functionality that can be
cleanly managed in userspace should remain there.

> I think in the calls the idea was it was reasonable to start without
> sessions fds at all, but in this case we shouldn't be mucking with
> pids or current.

The existing interface, with the addition of passing a pidfd, provides
the necessary flexibility without being invasive. The change would be
localized to the new code that performs the FD retrieval and wouldn't
involve spoofing current or making widespread changes.
For example, to handle cgroup charging for a memfd, the flow inside
memfd_luo_retrieve() would look something like this:

task = get_pid_task(target_pid, PIDTYPE_PID);
mm = get_task_mm(task);
    // ...
    folio = kho_restore_folio(phys);
    // Charge to the target mm, not 'current->mm'
    mem_cgroup_charge(folio, mm, ...);
mmput(mm);
put_task_struct(task);

This approach seems quite contained, and does not modify the existing
interfaces. It avoids the need for the kernel to manage the entire
session state and its associated security model.

Pasha

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Jason Gunthorpe @ 2025-08-26 16:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-20-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:25AM +0000, Pasha Tatashin wrote:
> Introduce a sysfs interface for the Live Update Orchestrator
> under /sys/kernel/liveupdate/. This interface provides a way for
> userspace tools and scripts to monitor the current state of the LUO
> state machine.

Now that you have a cdev these files may be more logically placed
under the cdev's sysfs and not under kernel? This can be done easially
using the attribute mechanisms in the struct device.

Again sort of back to my earlier point that everything should be
logically linked to the cdev as though there could be many cdevs, even
though there are not. It just keeps the code design more properly
layered and understanble rather than doing something unique..

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 15:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bBrCd8t_BUeE-sVPGjsJwmtk3mCSVhTMGbseTi_Wk+4yQ@mail.gmail.com>

On Tue, Aug 26, 2025 at 03:02:13PM +0000, Pasha Tatashin wrote:
> I'm trying to understand the drawbacks of the PID-based approach.
> Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
> not a good idea?

It will be a major invasive change all over the place in the kernel
to change things that assume current to do something else. We should
try to avoid this.

> In this flow, the client isn't providing an arbitrary PID; the trusted
> luod agent is providing the PID of a process it has an active
> connection with.

PIDs are wobbly thing, you can never really trust them unless they are
in a pidfd.

> The idea was to let luod handle the session/security story, and the
> kernel handle the core preservation mechanism. Adding sessions to the
> kernel, delegates the management and part of the security model into
> the kernel. I am not sure if it is necessary, what can be cleanly
> managed in userspace should stay in userspace.

session fds were an update imagined to allow the kernel to partition
things the session FD it self could be shared with other processes.

I think in the calls the idea was it was reasonable to start without
sessions fds at all, but in this case we shouldn't be mucking with
pids or current.

Since it seems that is important it should be addressed by issuing the
restore ioctl inside the correct process context, that is a much
easier thing to delegate to the kernel than trying to deal with
spoofing current/etc.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 15:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826142406.GE1970008@nvidia.com>

On Tue, Aug 26, 2025 at 2:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 26, 2025 at 01:54:31PM +0000, Pasha Tatashin wrote:
> > > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > > >
> > > > Changelog from v2:
> > > > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > > > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > > > - With the above changes, sessions are not needed, and should be
> > > >   maintained by the user-agent itself, so removed support for
> > > >   sessions.
> > >
> > > If all the FDs are restored in the agent's context, this assigns all the
> > > resources to the agent. For example, if the agent restores a memfd, all
> > > the memory gets charged to the agent's cgroup, and the client gets none
> > > of it. This makes it impossible to do any kind of resource limits.
> > >
> > > This was one of the advantages of being able to pass around sessions
> > > instead of FDs. The agent can pass on the right session to the right
> > > client, and then the client does the restore, getting all the resources
> > > charged to it.
> > >
> > > If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> > > for many kinds of workloads. Do you have any ideas on how to do proper
> > > resource attribution with the current patches? If not, then perhaps we
> > > should reconsider this change?
> >
> > Hi Pratyush,
> >
> > That's an excellent point, and you're right that we must have a
> > solution for correct resource charging.
> >
> > I'd prefer to keep the session logic in the userspace agent (luod
> > https://tinyurl.com/luoddesign).
> >
> > For the charging problem, I believe there's a clear path forward with
> > the current ioctl-based API. The design of the ioctl commands (with a
> > size field in each struct) is intentionally extensible. In a follow-up
> > patch, we can extend the liveupdate_ioctl_fd_restore struct to include
> > a target pid field. The luod agent, would then be able to restore an
> > FD on behalf of a client and instruct the kernel to charge the
> > associated resources to that client's PID.
>
> This wasn't quite the idea though..
>
> The sessions sub FD were intended to be passed directly to other
> processes though unix sockets and fd passing so they could run their
> own ioctls in their own context for both save and restore. The ioctls
> available on the sessions should be specifically narrowed to be safe
> for this.
>
> I can understand not implementing session FDs in the first version,
> but when sessions FD are available they should work like this and
> solve the namespace/cgroup/etc issues.
>
> Passing some PID in an ioctl is not a great idea...

Hi Jason,

I'm trying to understand the drawbacks of the PID-based approach.
Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
not a good idea?

From my perspective, luod would have a live, open socket to the client
process requesting the restore. It can use SO_PEERCRED to securely
identify the client's PID at that moment. The flow would be:

1. Client connects and resumes its session with luod.
2. Client requests to restore TOKEN_X.
3. luod verifies the client owns TOKEN_X for its session.
4. luod calls the RESTORE_FD ioctl, telling the kernel: "Please
restore TOKEN_X and charge the resources to PID Y (which I just
verified is on the other end of this socket)."
5. The kernel performs the action.
6. luod receives the new FD from the kernel and passes it back to the
client over the socket.

In this flow, the client isn't providing an arbitrary PID; the trusted
luod agent is providing the PID of a process it has an active
connection with.

The idea was to let luod handle the session/security story, and the
kernel handle the core preservation mechanism. Adding sessions to the
kernel, delegates the management and part of the security model into
the kernel. I am not sure if it is necessary, what can be cleanly
managed in userspace should stay in userspace.

Thanks,
Pasha


>
> Jason

^ permalink raw reply

* [PATCH 5.4 313/403] move_mount: allow to add a mount into an existing group
From: Greg Kroah-Hartman @ 2025-08-26 11:10 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric W. Biederman, Alexander Viro,
	Christian Brauner, Mattias Nissler, Aleksa Sarai, Andrei Vagin,
	linux-fsdevel, linux-api, lkml, Pavel Tikhomirov, Sasha Levin
In-Reply-To: <20250826110905.607690791@linuxfoundation.org>

5.4-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

[ Upstream commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 ]

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Stable-dep-of: cffd0441872e ("use uniform permission checks for all mount propagation changes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/namespace.c             | 77 +++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ee5a87061f20..3c1afe60d438 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2624,6 +2624,78 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+	struct mount *from, *to;
+	int err;
+
+	from = real_mount(from_path->mnt);
+	to = real_mount(to_path->mnt);
+
+	namespace_lock();
+
+	err = -EINVAL;
+	/* To and From must be mounted */
+	if (!is_mounted(&from->mnt))
+		goto out;
+	if (!is_mounted(&to->mnt))
+		goto out;
+
+	err = -EPERM;
+	/* We should be allowed to modify mount namespaces of both mounts */
+	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+
+	err = -EINVAL;
+	/* To and From paths should be mount roots */
+	if (from_path->dentry != from_path->mnt->mnt_root)
+		goto out;
+	if (to_path->dentry != to_path->mnt->mnt_root)
+		goto out;
+
+	/* Setting sharing groups is only allowed across same superblock */
+	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+		goto out;
+
+	/* From mount root should be wider than To mount root */
+	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+		goto out;
+
+	/* From mount should not have locked children in place of To's root */
+	if (has_locked_children(from, to->mnt.mnt_root))
+		goto out;
+
+	/* Setting sharing groups is only allowed on private mounts */
+	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+		goto out;
+
+	/* From should not be private */
+	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+		goto out;
+
+	if (IS_MNT_SLAVE(from)) {
+		struct mount *m = from->mnt_master;
+
+		list_add(&to->mnt_slave, &m->mnt_slave_list);
+		to->mnt_master = m;
+	}
+
+	if (IS_MNT_SHARED(from)) {
+		to->mnt_group_id = from->mnt_group_id;
+		list_add(&to->mnt_share, &from->mnt_share);
+		lock_mount_hash();
+		set_mnt_shared(to);
+		unlock_mount_hash();
+	}
+
+	err = 0;
+out:
+	namespace_unlock();
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3583,7 +3655,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+	if (flags & MOVE_MOUNT_SET_GROUP)
+		ret = do_set_group(&from_path, &to_path);
+	else
+		ret = do_move_mount(&from_path, &to_path);
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index 96a0240f23fe..535ca707dfd7 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -70,7 +70,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
+#define MOVE_MOUNT__MASK		0x00000177
 
 /*
  * fsopen() flags.
-- 
2.50.1




^ permalink raw reply related

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 14:24 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bBoLi9tYWHSFyDEHWd_cwvS_hR4q2HMmg-C+SJpQDNs=g@mail.gmail.com>

On Tue, Aug 26, 2025 at 01:54:31PM +0000, Pasha Tatashin wrote:
> > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > >
> > > Changelog from v2:
> > > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > > - With the above changes, sessions are not needed, and should be
> > >   maintained by the user-agent itself, so removed support for
> > >   sessions.
> >
> > If all the FDs are restored in the agent's context, this assigns all the
> > resources to the agent. For example, if the agent restores a memfd, all
> > the memory gets charged to the agent's cgroup, and the client gets none
> > of it. This makes it impossible to do any kind of resource limits.
> >
> > This was one of the advantages of being able to pass around sessions
> > instead of FDs. The agent can pass on the right session to the right
> > client, and then the client does the restore, getting all the resources
> > charged to it.
> >
> > If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> > for many kinds of workloads. Do you have any ideas on how to do proper
> > resource attribution with the current patches? If not, then perhaps we
> > should reconsider this change?
> 
> Hi Pratyush,
> 
> That's an excellent point, and you're right that we must have a
> solution for correct resource charging.
> 
> I'd prefer to keep the session logic in the userspace agent (luod
> https://tinyurl.com/luoddesign).
> 
> For the charging problem, I believe there's a clear path forward with
> the current ioctl-based API. The design of the ioctl commands (with a
> size field in each struct) is intentionally extensible. In a follow-up
> patch, we can extend the liveupdate_ioctl_fd_restore struct to include
> a target pid field. The luod agent, would then be able to restore an
> FD on behalf of a client and instruct the kernel to charge the
> associated resources to that client's PID.

This wasn't quite the idea though..

The sessions sub FD were intended to be passed directly to other
processes though unix sockets and fd passing so they could run their
own ioctls in their own context for both save and restore. The ioctls
available on the sessions should be specifically narrowed to be safe
for this.

I can understand not implementing session FDs in the first version,
but when sessions FD are available they should work like this and
solve the namespace/cgroup/etc issues.

Passing some PID in an ioctl is not a great idea...

Jason

^ permalink raw reply

* [PATCH 5.10 402/523] move_mount: allow to add a mount into an existing group
From: Greg Kroah-Hartman @ 2025-08-26 11:10 UTC (permalink / raw)
  To: stable
  Cc: Greg Kroah-Hartman, patches, Eric W. Biederman, Alexander Viro,
	Christian Brauner, Mattias Nissler, Aleksa Sarai, Andrei Vagin,
	linux-fsdevel, linux-api, lkml, Pavel Tikhomirov, Sasha Levin
In-Reply-To: <20250826110924.562212281@linuxfoundation.org>

5.10-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>

[ Upstream commit 9ffb14ef61bab83fa818736bf3e7e6b6e182e8e2 ]

Previously a sharing group (shared and master ids pair) can be only
inherited when mount is created via bindmount. This patch adds an
ability to add an existing private mount into an existing sharing group.

With this functionality one can first create the desired mount tree from
only private mounts (without the need to care about undesired mount
propagation or mount creation order implied by sharing group
dependencies), and next then setup any desired mount sharing between
those mounts in tree as needed.

This allows CRIU to restore any set of mount namespaces, mount trees and
sharing group trees for a container.

We have many issues with restoring mounts in CRIU related to sharing
groups and propagation:
- reverse sharing groups vs mount tree order requires complex mounts
  reordering which mostly implies also using some temporary mounts
(please see https://lkml.org/lkml/2021/3/23/569 for more info)

- mount() syscall creates tons of mounts due to propagation
- mount re-parenting due to propagation
- "Mount Trap" due to propagation
- "Non Uniform" propagation, meaning that with different tricks with
  mount order and temporary children-"lock" mounts one can create mount
  trees which can't be restored without those tricks
(see https://www.linuxplumbersconf.org/event/7/contributions/640/)

With this new functionality we can resolve all the problems with
propagation at once.

Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Mattias Nissler <mnissler@chromium.org>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: linux-fsdevel@vger.kernel.org
Cc: linux-api@vger.kernel.org
Cc: lkml <linux-kernel@vger.kernel.org>
Co-developed-by: Andrei Vagin <avagin@gmail.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Stable-dep-of: cffd0441872e ("use uniform permission checks for all mount propagation changes")
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 fs/namespace.c             | 77 +++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/mount.h |  3 +-
 2 files changed, 78 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index ee6d139f7529..7f7ccc9e53b8 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -2692,6 +2692,78 @@ static bool check_for_nsfs_mounts(struct mount *subtree)
 	return ret;
 }
 
+static int do_set_group(struct path *from_path, struct path *to_path)
+{
+	struct mount *from, *to;
+	int err;
+
+	from = real_mount(from_path->mnt);
+	to = real_mount(to_path->mnt);
+
+	namespace_lock();
+
+	err = -EINVAL;
+	/* To and From must be mounted */
+	if (!is_mounted(&from->mnt))
+		goto out;
+	if (!is_mounted(&to->mnt))
+		goto out;
+
+	err = -EPERM;
+	/* We should be allowed to modify mount namespaces of both mounts */
+	if (!ns_capable(from->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+	if (!ns_capable(to->mnt_ns->user_ns, CAP_SYS_ADMIN))
+		goto out;
+
+	err = -EINVAL;
+	/* To and From paths should be mount roots */
+	if (from_path->dentry != from_path->mnt->mnt_root)
+		goto out;
+	if (to_path->dentry != to_path->mnt->mnt_root)
+		goto out;
+
+	/* Setting sharing groups is only allowed across same superblock */
+	if (from->mnt.mnt_sb != to->mnt.mnt_sb)
+		goto out;
+
+	/* From mount root should be wider than To mount root */
+	if (!is_subdir(to->mnt.mnt_root, from->mnt.mnt_root))
+		goto out;
+
+	/* From mount should not have locked children in place of To's root */
+	if (has_locked_children(from, to->mnt.mnt_root))
+		goto out;
+
+	/* Setting sharing groups is only allowed on private mounts */
+	if (IS_MNT_SHARED(to) || IS_MNT_SLAVE(to))
+		goto out;
+
+	/* From should not be private */
+	if (!IS_MNT_SHARED(from) && !IS_MNT_SLAVE(from))
+		goto out;
+
+	if (IS_MNT_SLAVE(from)) {
+		struct mount *m = from->mnt_master;
+
+		list_add(&to->mnt_slave, &m->mnt_slave_list);
+		to->mnt_master = m;
+	}
+
+	if (IS_MNT_SHARED(from)) {
+		to->mnt_group_id = from->mnt_group_id;
+		list_add(&to->mnt_share, &from->mnt_share);
+		lock_mount_hash();
+		set_mnt_shared(to);
+		unlock_mount_hash();
+	}
+
+	err = 0;
+out:
+	namespace_unlock();
+	return err;
+}
+
 static int do_move_mount(struct path *old_path, struct path *new_path)
 {
 	struct mnt_namespace *ns;
@@ -3667,7 +3739,10 @@ SYSCALL_DEFINE5(move_mount,
 	if (ret < 0)
 		goto out_to;
 
-	ret = do_move_mount(&from_path, &to_path);
+	if (flags & MOVE_MOUNT_SET_GROUP)
+		ret = do_set_group(&from_path, &to_path);
+	else
+		ret = do_move_mount(&from_path, &to_path);
 
 out_to:
 	path_put(&to_path);
diff --git a/include/uapi/linux/mount.h b/include/uapi/linux/mount.h
index dd8306ea336c..fc6a2e63130b 100644
--- a/include/uapi/linux/mount.h
+++ b/include/uapi/linux/mount.h
@@ -71,7 +71,8 @@
 #define MOVE_MOUNT_T_SYMLINKS		0x00000010 /* Follow symlinks on to path */
 #define MOVE_MOUNT_T_AUTOMOUNTS		0x00000020 /* Follow automounts on to path */
 #define MOVE_MOUNT_T_EMPTY_PATH		0x00000040 /* Empty to path permitted */
-#define MOVE_MOUNT__MASK		0x00000077
+#define MOVE_MOUNT_SET_GROUP		0x00000100 /* Set sharing group instead */
+#define MOVE_MOUNT__MASK		0x00000177
 
 /*
  * fsopen() flags.
-- 
2.50.1




^ permalink raw reply related

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 13:54 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, lennart,
	brauner, linux-api, linux-fsdevel, saeedm, ajayachandra, jgg,
	parav, leonro, witu
In-Reply-To: <mafs0ms7mxly1.fsf@kernel.org>

> > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> >
> > Changelog from v2:
> > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > - With the above changes, sessions are not needed, and should be
> >   maintained by the user-agent itself, so removed support for
> >   sessions.
>
> If all the FDs are restored in the agent's context, this assigns all the
> resources to the agent. For example, if the agent restores a memfd, all
> the memory gets charged to the agent's cgroup, and the client gets none
> of it. This makes it impossible to do any kind of resource limits.
>
> This was one of the advantages of being able to pass around sessions
> instead of FDs. The agent can pass on the right session to the right
> client, and then the client does the restore, getting all the resources
> charged to it.
>
> If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> for many kinds of workloads. Do you have any ideas on how to do proper
> resource attribution with the current patches? If not, then perhaps we
> should reconsider this change?

Hi Pratyush,

That's an excellent point, and you're right that we must have a
solution for correct resource charging.

I'd prefer to keep the session logic in the userspace agent (luod
https://tinyurl.com/luoddesign).

For the charging problem, I believe there's a clear path forward with
the current ioctl-based API. The design of the ioctl commands (with a
size field in each struct) is intentionally extensible. In a follow-up
patch, we can extend the liveupdate_ioctl_fd_restore struct to include
a target pid field. The luod agent, would then be able to restore an
FD on behalf of a client and instruct the kernel to charge the
associated resources to that client's PID.

This keeps the responsibilities clean: luod manages sessions and
authorization, while the kernel provides the specific mechanism for
resource attribution. I agree this is a must-have feature, but I think
it can be cleanly added on top of the current foundation.

Pasha

>
> [...]
>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [RFC] Extension to POSIX API for zero-copy data transfers
From: Mihai-Drosi Câju @ 2025-08-26 13:17 UTC (permalink / raw)
  To: Serge E. Hallyn; +Cc: linux-api, alx
In-Reply-To: <aK2rRYT+FXe6BvwC@mail.hallyn.com>

On Tue, Aug 26, 2025 at 3:40 PM Serge E. Hallyn <serge@hallyn.com> wrote:
>
> On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> > Greetings,
> >
> > I've been thinking about a POSIX-like API that would allow
> > read/write/send/recv to be zero-copy instead of being buffered. As such,
> > storage devices and network sockets can have data transferred to and from
> > them directly to a user-space application's buffers.
>
> Hi Mihai,
>
> You're proposing a particular API.  Do you have a kernel side
> implementation of something along these lines?  Do you have a particular
> user space use case of your own in mind, or have you spoken to any
> potential users?
>
I have a user-space implementation based on DPDK, it has a different and
less hackish API than the one presented here. I thought it best to seek
feedback from the Linux community before actually writing code for the kernel.

The use-case is the same as the normal Berkley sockets API.
Except that it's faster because you don't copy buffers between kernel and
user-space on each send and recv. You can even receive a buffer that
will be written to disk or vice-versa. Thereby making obsolete KTLS, etc.

I have not spoken to potential users, but I am aware of several attempts
at a zero copy TCP/IP stack F-Stack, mTCP, io_uring.

> > My focus was initially on network stacks and I drew inspiration from DPDK.
> > I'm also aware of some work underway on extending io_uring to support zero
> > copy.
>
> I've not really been following io_uring work.  Can you summarize the
> status of their zero copy support and the advantages that this new
> API would bring?
>

So far, io_uring only supports zero copy reception of TCP segments
https://docs.kernel.org/networking/iou-zcrx.html
it's rather cluttered...

> thanks,
> -serge
>
> > A draft API would work as follows:
> > * The application fills-out a series of iovec's with buffers in its own
> > memory that can store data from protocols such as TCP or UDP. These iovec's
> > will serve as hints that will tell the network stack that it can definitely
> > map a part of a frame's contents into the described buffers. For example, an
> > iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> > the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> > syscall, its fields will be overwritten to something like { .iov_base =
> > 0x4036, .iov_len = 1460 }
> > * In order to receive packets, the application calls readv or a readv-like
> > syscall and its array of iovec will be modified to point to data payloads.
> > Given that their pages will be mapped directly to user-space, some header
> > fields or tail-room may have to be zero-ed out before being mapped, in order
> > to prevent information leaks. Anny array of iovec's passed to one such readv
> > syscall should be checked for sanity such as being able to hold data
> > payloads in corner cases, not overlap with each-other and hold values that
> > would prove to map pages to.
> > * The return value would be the number of data payloads that have been
> > populated. Only the first such elements in the provided array would end-up
> > containing data payloads.
> > * The syscall's prototype would be quite identical to that of readv, except
> > that iov would not be a const struct iovec *, but just a struct iovec * and
> > the return type would be modified. Like so:
> >  int zc_readv(int fd, struct iovec *iov, int iovcnt);
> >
> > * In the case of write's a struct iovec may not suffice as the provided
> > buffers should not only provide the location and size of data to be sent,
> > but also the guarantee that the buffers have sufficient head and tail room.
> > A hackish syscall would look like so:
> >  int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> > * While the first iovec should describe the entire memory area available to
> > a packet, including enough head and tail room for headers and CRC's or other
> > fields specific to the NIC, the second should describe a sub-buffer that
> > holds the data to be written.
> > * Again, sanity checks should be performed on the entire array, for things
> > like having enough room for other fields, not overlapping, proper alignment,
> > ability to DMA to a device, etc.
> > * After calling zc_writev the pages associated with the provided iovec's are
> > immediately swapped for zero-pages to avoid data-leaks.
> > * For writes, arbitrary physical pages may not work for every NIC as some
> > are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> > application would have to manage a memory pool associated with each
> > file-descriptor(possibly NIC) that would contain memory that is physically
> > mapped to areas that can be DMA'ed to the proper devices. For example one
> > may mmap the file-descriptor to obtain a pool of a certain size for this
> > purpose.
> >
> > This concept can be extended to storage devices, unfortunately I am
> > unfamiliar with NVMe and SCSI so I can only guess that they work in a
> > similar manner to NIC rings, in that data can be written and read to
> > arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> > and zc_write can be used on file descriptors pointing to storage devices to
> > fetch or write sectors that contain data belonging to files. Some data
> > should be zeroed-out in this case as well, as sectors more often that not
> > will contain data that does not belong to the intended files.
> >
> > For example one can mix such syscalls to read directly from storage into NIC
> > buffers, providing in-place encryption on the way(via TLS) and send them to
> > a client in a similar way that Netflix does with in-kernel TLS and sendfile.
> >
> > All the best,
> > Mihai
> >
> >
> >

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-08-26 13:16 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-1-pasha.tatashin@soleen.com>

Hi Pasha,

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> This series introduces the LUO, a kernel subsystem designed to
> facilitate live kernel updates with minimal downtime,
> particularly in cloud delplyoments aiming to update without fully
> disrupting running virtual machines.
>
> This series builds upon KHO framework by adding programmatic
> control over KHO's lifecycle and leveraging KHO for persisting LUO's
> own metadata across the kexec boundary. The git branch for this series
> can be found at:
>
> https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
>
> Changelog from v2:
> - Addressed comments from Mike Rapoport and Jason Gunthorpe
> - Only one user agent (LiveupdateD) can open /dev/liveupdate
> - With the above changes, sessions are not needed, and should be
>   maintained by the user-agent itself, so removed support for
>   sessions.

If all the FDs are restored in the agent's context, this assigns all the
resources to the agent. For example, if the agent restores a memfd, all
the memory gets charged to the agent's cgroup, and the client gets none
of it. This makes it impossible to do any kind of resource limits.

This was one of the advantages of being able to pass around sessions
instead of FDs. The agent can pass on the right session to the right
client, and then the client does the restore, getting all the resources
charged to it.

If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
for many kinds of workloads. Do you have any ideas on how to do proper
resource attribution with the current patches? If not, then perhaps we
should reconsider this change?

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC] Extension to POSIX API for zero-copy data transfers
From: Serge E. Hallyn @ 2025-08-26 12:40 UTC (permalink / raw)
  To: mcaju95; +Cc: linux-api, alx, serge
In-Reply-To: <98RCQS.25Q70IQZ9KFA1@gmail.com>

On Sun, Jan 19, 2025 at 10:21:45PM +0200, mcaju95@gmail.com wrote:
> Greetings,
> 
> I've been thinking about a POSIX-like API that would allow
> read/write/send/recv to be zero-copy instead of being buffered. As such,
> storage devices and network sockets can have data transferred to and from
> them directly to a user-space application's buffers.

Hi Mihai,

You're proposing a particular API.  Do you have a kernel side
implementation of something along these lines?  Do you have a particular
user space use case of your own in mind, or have you spoken to any
potential users?

> My focus was initially on network stacks and I drew inspiration from DPDK.
> I'm also aware of some work underway on extending io_uring to support zero
> copy.

I've not really been following io_uring work.  Can you summarize the
status of their zero copy support and the advantages that this new
API would bring?

thanks,
-serge

> A draft API would work as follows:
> * The application fills-out a series of iovec's with buffers in its own
> memory that can store data from protocols such as TCP or UDP. These iovec's
> will serve as hints that will tell the network stack that it can definitely
> map a part of a frame's contents into the described buffers. For example, an
> iovec may contain { .iov_base = 0x4000, .iov_len = 0xa000 }. In this case,
> the data payload may end-up anywhere between 0x4000 and 0xe000 and after the
> syscall, its fields will be overwritten to something like { .iov_base =
> 0x4036, .iov_len = 1460 }
> * In order to receive packets, the application calls readv or a readv-like
> syscall and its array of iovec will be modified to point to data payloads.
> Given that their pages will be mapped directly to user-space, some header
> fields or tail-room may have to be zero-ed out before being mapped, in order
> to prevent information leaks. Anny array of iovec's passed to one such readv
> syscall should be checked for sanity such as being able to hold data
> payloads in corner cases, not overlap with each-other and hold values that
> would prove to map pages to.
> * The return value would be the number of data payloads that have been
> populated. Only the first such elements in the provided array would end-up
> containing data payloads.
> * The syscall's prototype would be quite identical to that of readv, except
> that iov would not be a const struct iovec *, but just a struct iovec * and
> the return type would be modified. Like so:
>  int zc_readv(int fd, struct iovec *iov, int iovcnt);
> 
> * In the case of write's a struct iovec may not suffice as the provided
> buffers should not only provide the location and size of data to be sent,
> but also the guarantee that the buffers have sufficient head and tail room.
> A hackish syscall would look like so:
>  int zc_writev(int fd, const struct iovec (*iov)[2], int iovcnt);
> * While the first iovec should describe the entire memory area available to
> a packet, including enough head and tail room for headers and CRC's or other
> fields specific to the NIC, the second should describe a sub-buffer that
> holds the data to be written.
> * Again, sanity checks should be performed on the entire array, for things
> like having enough room for other fields, not overlapping, proper alignment,
> ability to DMA to a device, etc.
> * After calling zc_writev the pages associated with the provided iovec's are
> immediately swapped for zero-pages to avoid data-leaks.
> * For writes, arbitrary physical pages may not work for every NIC as some
> are bound by 32bit addressing constrains on the PCIe bus, etc. As such the
> application would have to manage a memory pool associated with each
> file-descriptor(possibly NIC) that would contain memory that is physically
> mapped to areas that can be DMA'ed to the proper devices. For example one
> may mmap the file-descriptor to obtain a pool of a certain size for this
> purpose.
> 
> This concept can be extended to storage devices, unfortunately I am
> unfamiliar with NVMe and SCSI so I can only guess that they work in a
> similar manner to NIC rings, in that data can be written and read to
> arbitrary physical RAM(as allowed by the IOMMU). Syscalls similar to zc_read
> and zc_write can be used on file descriptors pointing to storage devices to
> fetch or write sectors that contain data belonging to files. Some data
> should be zeroed-out in this case as well, as sectors more often that not
> will contain data that does not belong to the intended files.
> 
> For example one can mix such syscalls to read directly from storage into NIC
> buffers, providing in-place encryption on the way(via TLS) and send them to
> a client in a similar way that Netflix does with in-kernel TLS and sendfile.
> 
> All the best,
> Mihai
> 
> 
> 

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-26 12:39 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Jeff Xu
In-Reply-To: <CALmYWFv90uzq0J76+xtUFjZxDzR2rYvrFbrr5Jva5zdy_dvoHA@mail.gmail.com>

On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> Hi Mickaël
> 
> On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > >
> > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > >
> > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > >
> > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > because it broke userspace assumptions.
> > > >
> > > > Oh, good to know.
> > > >
> > > > >
> > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > >
> > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > removed for security reasons; as
> > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > >
> > > > > |        MAP_DENYWRITE
> > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > |               signaled that attempts to write to the underlying file
> > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > |               of-service attacks.)"
> > > > >
> > > > > It seems to me that the same issue applies to your patch - it would
> > > > > allow unprivileged processes to essentially lock files such that other
> > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > users to prevent root from updating config files or stuff like that if
> > > > > they're updated in-place.
> > > >
> > > > Yes, I agree, but since it is the case for executed files I though it
> > > > was worth starting a discussion on this topic.  This new flag could be
> > > > restricted to executable files, but we should avoid system-wide locks
> > > > like this.  I'm not sure how Windows handle these issues though.
> > > >
> > > > Anyway, we should rely on the access control policy to control write and
> > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > the references and the background!
> > >
> > > I'm confused.  I understand that there are many contexts in which one
> > > would want to prevent execution of unapproved content, which might
> > > include preventing a given process from modifying some code and then
> > > executing it.
> > >
> > > I don't understand what these deny-write features have to do with it.
> > > These features merely prevent someone from modifying code *that is
> > > currently in use*, which is not at all the same thing as preventing
> > > modifying code that might get executed -- one can often modify
> > > contents *before* executing those contents.
> >
> > The order of checks would be:
> > 1. open script with O_DENY_WRITE
> > 2. check executability with AT_EXECVE_CHECK
> > 3. read the content and interpret it
> >
> I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> 
> AT_EXECVE_CHECK is not just for scripting languages. It could also
> work with bytecodes like Java, for example. If we let the Java runtime
> call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> develop a policy based on that.

Sure, I'm using "script" to make it simple, but this applies to other
use cases.

> 
> > The deny-write feature was to guarantee that there is no race condition
> > between step 2 and 3.  All these checks are supposed to be done by a
> > trusted interpreter (which is allowed to be executed).  The
> > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > associated security policies) allowed the *current* content of the file
> > to be executed.  Whatever happen before or after that (wrt.
> > O_DENY_WRITE) should be covered by the security policy.
> >
> Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> 
> Enforcing non-write for the path that stores scripts or bytecodes can
> be challenging due to historical or backward compatibility reasons.
> Since AT_EXECVE_CHECK provides a mechanism to check the file right
> before it is used, we can assume it will detect any "problem" that
> happened before that, (e.g. the file was overwritten). However, that
> also imposes two additional requirements:
> 1> the file doesn't change while AT_EXECVE_CHECK does the check.

This is already the case, so any kind of LSM checks are good.

> 2>The file content kept by the process remains unchanged after passing
> the AT_EXECVE_CHECK.

The goal of this patch was to avoid such race condition in the case
where executable files can be updated.  But in most cases it should not
be a security issue (because processes allowed to write to executable
files should be trusted), but this could still lead to bugs (because of
inconsistent file content, half-updated).

> 
> I imagine, the complete solution for AT_EXECVE_CHECK would include
> those two grantees.

There is no issue directly with AT_EXECVE_CHECK, but according to the
system configuration, interpreters could read a file that was updated
after the AT_EXECVE_CHECK.  This should not be an issue for secure
systems where executable files are only updated with trusted code,
except if the update mechanism is not atomic.  The main use case for
this patch series was for generic distros that may not have the
write-xor-execute guarantees e.g., for developers.

The only viable solution I see would be to have some kind of snapshot of
files, requested by interpreters, but I'm not sure if it is worth it.

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-26 12:35 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Jeff Xu
In-Reply-To: <lhuikibbv0g.fsf@oldenburg.str.redhat.com>

On Mon, Aug 25, 2025 at 11:39:11AM +0200, Florian Weimer wrote:
> * Mickaël Salaün:
> 
> > The order of checks would be:
> > 1. open script with O_DENY_WRITE
> > 2. check executability with AT_EXECVE_CHECK
> > 3. read the content and interpret it
> >
> > The deny-write feature was to guarantee that there is no race condition
> > between step 2 and 3.  All these checks are supposed to be done by a
> > trusted interpreter (which is allowed to be executed).  The
> > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > associated security policies) allowed the *current* content of the file
> > to be executed.  Whatever happen before or after that (wrt.
> > O_DENY_WRITE) should be covered by the security policy.
> 
> Why isn't it an improper system configuration if the script file is
> writable?

It is, except if the system only wants to track executions (e.g. record
checksum of scripts) without restricting file modifications.

> 
> In the past, the argument was that making a file (writable and)
> executable was an auditable even, and that provided enough coverage for
> those people who are interested in this.

Yes, but in this case there is a race condition that this patch tried to
fix.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Theodore Ts'o @ 2025-08-26 12:30 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.aig5aiShunga@digikod.net>

Is there a single, unified design and requirements document that
describes the threat model, and what you are trying to achieve with
AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
that has landed for AT_EXECVE_CHECK and it really doesn't describe
what *are* the checks that AT_EXECVE_CHECK is trying to achieve:

   "The AT_EXECVE_CHECK execveat(2) flag, and the
   SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
   securebits are intended for script interpreters and dynamic linkers
   to enforce a consistent execution security policy handled by the
   kernel."

Um, what security policy?  What checks?  What is a sample exploit
which is blocked by AT_EXECVE_CHECK?

And then on top of it, why can't you do these checks by modifying the
script interpreters?

Confused,

						- Ted

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-26 11:23 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250826-skorpion-magma-141496988fdc@brauner>

On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> > Hi,
> > 
> > Script interpreters can check if a file would be allowed to be executed
> > by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> > well on systems with write-xor-execute policies, where scripts cannot
> > be modified by malicious processes. However, this protection may not be
> > available on more generic distributions.
> > 
> > The key difference between `./script.sh` and `sh script.sh` (when using
> > AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> > for writing while it's being executed. To achieve parity, the kernel
> > should provide a mechanism for script interpreters to deny write access
> > during script interpretation. While interpreters can copy script content
> > into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> > 
> > This patch series introduces a new O_DENY_WRITE flag for use with
> > open*(2) and fcntl(2). Both interfaces are necessary since script
> > interpreters may receive either a file path or file descriptor. For
> > backward compatibility, open(2) with O_DENY_WRITE will not fail on
> > unsupported systems, while users requiring explicit support guarantees
> > can use openat2(2).
> 
> We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
> before and you've been told by Linus as well that this is a nogo.

Oh, please, don't mix up everything.  First, this is an RFC, and as I
explained, the goal is to start a discussion with something concrete.
Second, doing a one-time check on a file and providing guarantees for
the whole lifetime of an opened file requires different approaches,
hence this O_ *proposal*.

> 
> Nothing has changed in that regard and I'm not interested in stuffing
> the VFS APIs full of special-purpose behavior to work around the fact
> that this is work that needs to be done in userspace. Change the apps,
> stop pushing more and more cruft into the VFS that has no business
> there.

It would be interesting to know how to patch user space to get the same
guarantees...  Do you think I would propose a kernel patch otherwise?

> 
> That's before we get into all the issues that are introduced by this
> mechanism that magically makes arbitrary files unwritable. It's not just
> a DoS it's likely to cause breakage in userspace as well. I removed the
> deny-write from execve because it already breaks various use-cases or
> leads to spurious failures in e.g., go. We're not spreading this disease
> as a first-class VFS API.

Jann explained it very well, and the deny-write for execve is still
there, but let's keep it civil.  I already agreed that this is not a
good approach, but we could get interesting proposals.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Christian Brauner @ 2025-08-26  9:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250822170800.2116980-1-mic@digikod.net>

On Fri, Aug 22, 2025 at 07:07:58PM +0200, Mickaël Salaün wrote:
> Hi,
> 
> Script interpreters can check if a file would be allowed to be executed
> by the kernel using the new AT_EXECVE_CHECK flag. This approach works
> well on systems with write-xor-execute policies, where scripts cannot
> be modified by malicious processes. However, this protection may not be
> available on more generic distributions.
> 
> The key difference between `./script.sh` and `sh script.sh` (when using
> AT_EXECVE_CHECK) is that execve(2) prevents the script from being opened
> for writing while it's being executed. To achieve parity, the kernel
> should provide a mechanism for script interpreters to deny write access
> during script interpretation. While interpreters can copy script content
> into a buffer, a race condition remains possible after AT_EXECVE_CHECK.
> 
> This patch series introduces a new O_DENY_WRITE flag for use with
> open*(2) and fcntl(2). Both interfaces are necessary since script
> interpreters may receive either a file path or file descriptor. For
> backward compatibility, open(2) with O_DENY_WRITE will not fail on
> unsupported systems, while users requiring explicit support guarantees
> can use openat2(2).

We've said no to abusing the O_* flag space for that AT_EXECVE_* stuff
before and you've been told by Linus as well that this is a nogo.

Nothing has changed in that regard and I'm not interested in stuffing
the VFS APIs full of special-purpose behavior to work around the fact
that this is work that needs to be done in userspace. Change the apps,
stop pushing more and more cruft into the VFS that has no business
there.

That's before we get into all the issues that are introduced by this
mechanism that magically makes arbitrary files unwritable. It's not just
a DoS it's likely to cause breakage in userspace as well. I removed the
deny-write from execve because it already breaks various use-cases or
leads to spurious failures in e.g., go. We're not spreading this disease
as a first-class VFS API.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox