Linux userland API discussions
 help / color / mirror / Atom feed
* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Aleksa Sarai @ 2025-08-27 22:48 UTC (permalink / raw)
  To: Paul Eggert
  Cc: Adhemerval Zanella Netto, Arjun Shankar, libc-alpha, linux-api
In-Reply-To: <5c9fa556-da00-4b76-8a70-8e2d1dddd92d@cs.ucla.edu>

[-- Attachment #1: Type: text/plain, Size: 4172 bytes --]

On 2025-08-27, Paul Eggert <eggert@cs.ucla.edu> wrote:
> On 2025-08-26 22:58, Aleksa Sarai wrote:
> > Personally I think both approaches are less than ideal, and having rich
> > feature flags for the entire system would be better but I don't think
> > this is something that would be feasible to apply to everything in the
> > entire kernel.
> 
> Agreed. But I'm not seeing how a hypothetical "give me the supported flags"
> flag would be useful enough to justify the flag.
> 
> I'm looking at this from the user point of view, and it is not ringing a
> bell for me. Granted, the current "try the flag combination you want and see
> whether it works" is not ideal, but it's accurate (which is not always true
> for "give me the supported flags" flag) and you need to do it anyway
> (because the "give me the supported flags" flag is inherently inaccurate),
> so why bother with a "give me the supported flags" flag?
> 
> Here's an example. Suppose we want to extend openat2 so that it also does
> the equivalent of statx atomically with the open, to avoid some races with
> the current openat/fstat pair of system calls. Under the approach you're
> proposing, I suppose we could extend struct open_how so that it has a new
> struct statx member, add new flags to be put into struct open_how's flags
> member, and programs would be able to query the new flags via a "give me the
> supported flags" call.
> 
> But in this scenario, the "give me the supported flags" flag is useless. If
> I'm an old program I can't use the new flags even if I detect them because
> my struct open_how is too small. And if I'm a new program I can simply use
> the new flags - and even if I tested for the new flags (with the "give me
> the supported flags" flag) I'd have to test the result anyway because
> perhaps the new flags are not supported for this particular flag combination
> or file.
> 
> What specific scenario would make the "give me supported flags" flag worth
> the hassle of supporting and documenting and testing such a flag?

"Just try it" leads to programs that have to test dozens of flag
combinations for syscalls at startup, and for many syscalls you cannot
"just try" whether the new flag works (think of a new shutdown(2) flag,
or most clone3(2) flags). What you end up having to do is create an
elaborate setup where if the flag works you get an error (but not
-EINVAL!) so that you can be fairly confident that you didn't modify the
system when doing the check. As someone who has to write this
boilerplate whenever I need to use most system calls, this really
**really** sucks. In some cases you can just try it and then fallback
(caching whether it was supported), but in a lot of programs it is
preferable to know well in advance whether a feature is supported.

A simple example would be mounts -- if MOUNT_BENEATH is not supported
then you need to structure how you construct your mount tree differently
to try to emulate the same behaviour. This means that not knowing if
MOUNT_BENEATH is supported upfront causes you to redo a lot of work in
the fallback case. If changing id-mappings for mounts hadn't required
adding a new syscall, this would've also been an issue for programs that
needed to change the ID-mappings of mounts.

Some kind of "just tell me what flags are supported" mechanism avoids
this problem by telling you in one shot what features are supported (so
newer programs can take advantage of them). Most systems that expect to
be extended over time have something like this, but it's usually in the
form of string-based feature names (/sys/kernel/cgroup/features, for
instance). I wouldn't be against such an idea (if we could actually
guarantee that everyone actually used it), but something similar was
proposed back in 2020 and never happened -- CHECK_FIELDS is a very
simple solution to the problem that works for the most common case and
can be implemented per-syscall.

I've added linux-api to Cc, as I'm sure there are plenty of other ideas
on how to solve this.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-27 20:35 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Andy Lutomirski, Theodore Ts'o, Christian Brauner, Al Viro,
	Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250827.Fuo1Iel1pa7i@digikod.net>

On Wed, Aug 27, 2025 at 12:07 PM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> > On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > > Is there a single, unified design and requirements document that
> > > > describes the threat model, and what you are trying to achieve with
> > > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > > >
> > > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > > >    securebits are intended for script interpreters and dynamic linkers
> > > >    to enforce a consistent execution security policy handled by the
> > > >    kernel."
> > >
> > > From the documentation:
> > >
> > >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> > >   on a regular file and returns 0 if execution of this file would be
> > >   allowed, ignoring the file format and then the related interpreter
> > >   dependencies (e.g. ELF libraries, script’s shebang).
> > >
> > > >
> > > > Um, what security policy?
> > >
> > > Whether the file is allowed to be executed.  This includes file
> > > permission, mount point option, ACL, LSM policies...
> >
> > This needs *waaaaay* more detail for any sort of useful evaluation.
> > Is an actual credible security policy rolling dice?  Asking ChatGPT?
> > Looking at security labels?  Does it care who can write to the file,
> > or who owns the file, or what the file's hash is, or what filesystem
> > it's on, or where it came from?  Does it dynamically inspect the
> > contents?  Is it controlled by an unprivileged process?
>
> AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
> calls, but without actually executing the file/fd.
>

okay... but see below.

> >
> > I can easily come up with security policies for which DENYWRITE is
> > completely useless.  I can come up with convoluted and
> > not-really-credible policies where DENYWRITE is important, but I'm
> > honestly not sure that those policies are actually useful.  I'm
> > honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> > because it should have been parametrized by *what format is expected*
> > -- it might be possible to bypass a policy by executing a perfectly
> > fine Python script using bash, for example.
>
> There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
> series, and a lot of discussions too (you where part of them).  We ended
> up with this design, which is simple and follows the kernel semantic
> (requested by Linus).

I recall this.  That doesn't mean I totally love AT_EXECVE_CHECK.  And
it especially doesn't mean that I believe that it usefully does
something that justifies anything like DENYWRITE.

>
> >
> > I genuinely have not come up with a security policy that I believe
> > makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> > that such a policy does not exist -- I'm saying that I have not
> > thought of such a thing after a few minutes of thought and reading
> > these threads.
>
> A simple use case is for systems that wants to enforce a
> write-xor-execute policy e.g., thanks to mount point options.

Sure, but I'm contemplating DENYWRITE, and this thread is about
DENYWRITE.  If the kernel is enforcing W^X, then there are really two
almost unrelated things going on:

1. LSM policy that enforces W^X for memory mappings.  This is to
enforce that applications don't do nasty things like having executable
stacks, and it's a mess because no one has really figured out how JITs
are supposed to work in this world.  It has almost nothing to do with
execve except incidentally.

2. LSM policy that enforces that someone doesn't execve (or similar)
something that *that user* can write.  Or that non-root can write.  Or
that anyone at all can write, etc.

I think, but I'm not sure, that you're talking about #2.  So maybe
there's a policy that says that one may only exec things that are on
an fs with the 'exec' mount option.  Or maybe there's a policy that
says that one may only exec things that are on a readonly fs.  In
these specific cases, I believe in AT_EXECVE_CHECK.  *But* I don't
believe in DENYWRITE: in the 'exec' case, if an fs has the exec option
set, that doesn't change if the file is subsequently modified.  And if
an fs is readonly, then the file is quite unlikely to be modified at
all and will certainly not be modified via the mount through which
it's being executed.  And you don't need DENYWRITE.

So I think my question still stands: is there a credible security
policy *that actually benefits from DENYWRITE*?  If so, can you give
an example?

> >
> > Seriously, consider all the unending recent attacks on LLMs an
> > inspiration.  The implications of viewing an image, downscaling the
> > image, possibly interpreting the image as something containing text,
> > possibly following instructions in a given language contained in the
> > image, etc are all wildly different.  A mechanism for asking for
> > general permission to "consume this image" is COMPLETELY MISSING THE
> > POINT.  (Never mind that the current crop of LLMs seem entirely
> > incapable of constraining their own use of some piece of input, but
> > that's a different issue and is besides the point here.)
>
> You're asking about what should we consider executable.  This is a good
> question, but AT_EXECVE_CHECK is there to answer another question: would
> the kernel execute it or not?
>

That's a sort of odd way of putting it.  The kernel won't execute it
because the kernel doesn't know how to :)  But I think I understand
what you're saying.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-27 19:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Arnd Bergmann, Christian Heimes,
	Dmitry Vyukov, Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn,
	Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CALCETrW=V9vst_ho2Q4sQUJ5uZECY5h7TnF==sG4JWq8PsWb8Q@mail.gmail.com>

On Wed, Aug 27, 2025 at 10:35:28AM -0700, Andy Lutomirski wrote:
> On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > > Is there a single, unified design and requirements document that
> > > describes the threat model, and what you are trying to achieve with
> > > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> > >
> > >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> > >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> > >    securebits are intended for script interpreters and dynamic linkers
> > >    to enforce a consistent execution security policy handled by the
> > >    kernel."
> >
> > From the documentation:
> >
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> >
> > >
> > > Um, what security policy?
> >
> > Whether the file is allowed to be executed.  This includes file
> > permission, mount point option, ACL, LSM policies...
> 
> This needs *waaaaay* more detail for any sort of useful evaluation.
> Is an actual credible security policy rolling dice?  Asking ChatGPT?
> Looking at security labels?  Does it care who can write to the file,
> or who owns the file, or what the file's hash is, or what filesystem
> it's on, or where it came from?  Does it dynamically inspect the
> contents?  Is it controlled by an unprivileged process?

AT_EXECVE_CHECK only does the same checks as done by other execveat(2)
calls, but without actually executing the file/fd.

> 
> I can easily come up with security policies for which DENYWRITE is
> completely useless.  I can come up with convoluted and
> not-really-credible policies where DENYWRITE is important, but I'm
> honestly not sure that those policies are actually useful.  I'm
> honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
> because it should have been parametrized by *what format is expected*
> -- it might be possible to bypass a policy by executing a perfectly
> fine Python script using bash, for example.

There have been a lot of bikesheding for the AT_EXECVE_CHECK patch
series, and a lot of discussions too (you where part of them).  We ended
up with this design, which is simple and follows the kernel semantic
(requested by Linus).

> 
> I genuinely have not come up with a security policy that I believe
> makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
> that such a policy does not exist -- I'm saying that I have not
> thought of such a thing after a few minutes of thought and reading
> these threads.

A simple use case is for systems that wants to enforce a
write-xor-execute policy e.g., thanks to mount point options.

> 
> 
> > > And then on top of it, why can't you do these checks by modifying the
> > > script interpreters?
> >
> > The script interpreter requires modification to use AT_EXECVE_CHECK.
> >
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> >
> 
> As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
> this goal.  If it were genuinely useful, I would much, much prefer a
> totally different API: a *syscall* that takes, as input, a file
> descriptor of something that an interpreter wants to execute and a
> whole lot of context as to what that interpreter wants to do with it.
> And I admit I'm *still* not convinced.

As mentioned above, AT_EXECVE_CHECK follows the kernel semantic. Nothing
fancy.

> 
> Seriously, consider all the unending recent attacks on LLMs an
> inspiration.  The implications of viewing an image, downscaling the
> image, possibly interpreting the image as something containing text,
> possibly following instructions in a given language contained in the
> image, etc are all wildly different.  A mechanism for asking for
> general permission to "consume this image" is COMPLETELY MISSING THE
> POINT.  (Never mind that the current crop of LLMs seem entirely
> incapable of constraining their own use of some piece of input, but
> that's a different issue and is besides the point here.)

You're asking about what should we consider executable.  This is a good
question, but AT_EXECVE_CHECK is there to answer another question: would
the kernel execute it or not?

^ permalink raw reply

* Re: [PATCH v19 4/8] fork: Add shadow stack support to clone3()
From: Edgecombe, Rick P @ 2025-08-27 17:58 UTC (permalink / raw)
  To: dietmar.eggemann@arm.com, broonie@kernel.org,
	Szabolcs.Nagy@arm.com, brauner@kernel.org,
	dave.hansen@linux.intel.com, debug@rivosinc.com, mgorman@suse.de,
	vincent.guittot@linaro.org, fweimer@redhat.com, mingo@redhat.com,
	rostedt@goodmis.org, hjl.tools@gmail.com, tglx@linutronix.de,
	vschneid@redhat.com, shuah@kernel.org, hpa@zytor.com,
	peterz@infradead.org, bp@alien8.de, bsegall@google.com,
	x86@kernel.org, juri.lelli@redhat.com
  Cc: yury.khrustalev@arm.com, linux-kselftest@vger.kernel.org,
	akpm@linux-foundation.org, jannh@google.com,
	linux-kernel@vger.kernel.org, catalin.marinas@arm.com,
	will@kernel.org, wilco.dijkstra@arm.com, kees@kernel.org,
	linux-api@vger.kernel.org
In-Reply-To: <20250819-clone3-shadow-stack-v19-4-bc957075479b@kernel.org>

On Tue, 2025-08-19 at 17:21 +0100, Mark Brown wrote:
> +int arch_shstk_validate_clone(struct task_struct *t,
> +			      struct vm_area_struct *vma,
> +			      struct page *page,
> +			      struct kernel_clone_args *args)
> +{
> +	/*
> +	 * SSP is aligned, so reserved bits and mode bit are a zero, just mark
> +	 * the token 64-bit.
> +	 */

What is this comment doing here? It doesn't make sense. It looks copied from
create_rstor_token()?

> +	void *maddr = page_address(page);
> +	unsigned long token;
> +	int offset;
> +	u64 expected;
> +
> +	token = args->shadow_stack_token;
> +	expected = (token + SS_FRAME_SIZE) | BIT(0);

Instead of the above comment, I think the important thing to say is that args-
>shadow_stack_token is 8 byte aligned, so offset can't overflow out of the page.

Maybe?

/* kernel_clone_args verification assures token address is 8 byte aligned */

> +	offset = offset_in_page(token);
> +
> +	if (!cmpxchg_to_user_page(vma, page, token, (unsigned long *)(maddr + offset),
> +				  expected, 0))
> +		return -EINVAL;
> +	set_page_dirty_lock(page);
> +
> +	return 0;
> +}
> +

With those changes, for the series:

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-08-27 17:35 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Theodore Ts'o, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <20250826.iewie7Et5aiw@digikod.net>

On Tue, Aug 26, 2025 at 10:47 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> > Is there a single, unified design and requirements document that
> > describes the threat model, and what you are trying to achieve with
> > AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> > letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> > that has landed for AT_EXECVE_CHECK and it really doesn't describe
> > what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> >
> >    "The AT_EXECVE_CHECK execveat(2) flag, and the
> >    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
> >    securebits are intended for script interpreters and dynamic linkers
> >    to enforce a consistent execution security policy handled by the
> >    kernel."
>
> From the documentation:
>
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).
>
> >
> > Um, what security policy?
>
> Whether the file is allowed to be executed.  This includes file
> permission, mount point option, ACL, LSM policies...

This needs *waaaaay* more detail for any sort of useful evaluation.
Is an actual credible security policy rolling dice?  Asking ChatGPT?
Looking at security labels?  Does it care who can write to the file,
or who owns the file, or what the file's hash is, or what filesystem
it's on, or where it came from?  Does it dynamically inspect the
contents?  Is it controlled by an unprivileged process?

I can easily come up with security policies for which DENYWRITE is
completely useless.  I can come up with convoluted and
not-really-credible policies where DENYWRITE is important, but I'm
honestly not sure that those policies are actually useful.  I'm
honestly a bit concerned that AT_EXECVE_CHECK is fundamentally busted
because it should have been parametrized by *what format is expected*
-- it might be possible to bypass a policy by executing a perfectly
fine Python script using bash, for example.

I genuinely have not come up with a security policy that I believe
makes sense that needs AT_EXECVE_CHECK and DENYWRITE.  I'm not saying
that such a policy does not exist -- I'm saying that I have not
thought of such a thing after a few minutes of thought and reading
these threads.


> > And then on top of it, why can't you do these checks by modifying the
> > script interpreters?
>
> The script interpreter requires modification to use AT_EXECVE_CHECK.
>
> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).
>

As mentioned above, even AT_EXECVE_CHECK does not obviously accomplish
this goal.  If it were genuinely useful, I would much, much prefer a
totally different API: a *syscall* that takes, as input, a file
descriptor of something that an interpreter wants to execute and a
whole lot of context as to what that interpreter wants to do with it.
And I admit I'm *still* not convinced.

Seriously, consider all the unending recent attacks on LLMs an
inspiration.  The implications of viewing an image, downscaling the
image, possibly interpreting the image as something containing text,
possibly following instructions in a given language contained in the
image, etc are all wildly different.  A mechanism for asking for
general permission to "consume this image" is COMPLETELY MISSING THE
POINT.  (Never mind that the current crop of LLMs seem entirely
incapable of constraining their own use of some piece of input, but
that's a different issue and is besides the point here.)

^ permalink raw reply

* Re: [PATCH v3 17/30] liveupdate: luo_files: luo_ioctl: Unregister all FDs on device close
From: Pratyush Yadav @ 2025-08-27 15:34 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-18-pasha.tatashin@soleen.com>

Hi Pasha,

On Thu, Aug 07 2025, Pasha Tatashin wrote:

> Currently, a file descriptor registered for preservation via the remains
> globally registered with LUO until it is explicitly unregistered. This
> creates a potential for resource leaks into the next kernel if the
> userspace agent crashes or exits without proper cleanup before a live
> update is fully initiated.
>
> This patch ties the lifetime of FD preservation requests to the lifetime
> of the open file descriptor for /dev/liveupdate, creating an implicit
> "session".
>
> When the /dev/liveupdate file descriptor is closed (either explicitly
> via close() or implicitly on process exit/crash), the .release
> handler, luo_release(), is now called. This handler invokes the new
> function luo_unregister_all_files(), which iterates through all FDs
> that were preserved through that session and unregisters them.

Why special case files here? Shouldn't you undo all the serialization
done for all the subsystems?

Anyway, this is buggy. I found this when testing the memfd patches. If
you preserve a memfd and close the /dev/liveupdate FD before reboot,
luo_unregister_all_files() calls the cancel callback, which calls
kho_unpreserve_folio(). But kho_unpreserve_folio() fails because KHO is
still in finalized state. This doesn't happen when cancelling explicitly
because luo_cancel() calls kho_abort().

I think you should just make the release go through the cancel flow,
since the operation is essentially a cancel anyway. There are subtle
differences here though, since the release might be called before
prepare, so we need to be careful of that.


>
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> ---
>  kernel/liveupdate/luo_files.c    | 19 +++++++++++++++++++
>  kernel/liveupdate/luo_internal.h |  1 +
>  kernel/liveupdate/luo_ioctl.c    |  1 +
>  3 files changed, 21 insertions(+)
>
> diff --git a/kernel/liveupdate/luo_files.c b/kernel/liveupdate/luo_files.c
> index 33577c9e9a64..63f8b086b785 100644
> --- a/kernel/liveupdate/luo_files.c
> +++ b/kernel/liveupdate/luo_files.c
> @@ -721,6 +721,25 @@ int luo_unregister_file(u64 token)
>  	return ret;
>  }
>  
> +/**
> + * luo_unregister_all_files - Unpreserve all currently registered files.
> + *
> + * Iterates through all file descriptors currently registered for preservation
> + * and unregisters them, freeing all associated resources. This is typically
> + * called when LUO agent exits.
> + */
> +void luo_unregister_all_files(void)
> +{
> +	struct luo_file *luo_file;
> +	unsigned long token;
> +
> +	luo_state_read_enter();
> +	xa_for_each(&luo_files_xa_out, token, luo_file)
> +		__luo_unregister_file(token);
> +	luo_state_read_exit();
> +	WARN_ON_ONCE(atomic64_read(&luo_files_count) != 0);
> +}
> +
>  /**
>   * luo_retrieve_file - Find a registered file instance by its token.
>   * @token: The unique token of the file instance to retrieve.
> diff --git a/kernel/liveupdate/luo_internal.h b/kernel/liveupdate/luo_internal.h
> index 5692196fd425..189e032d7738 100644
> --- a/kernel/liveupdate/luo_internal.h
> +++ b/kernel/liveupdate/luo_internal.h
> @@ -37,5 +37,6 @@ void luo_do_subsystems_cancel_calls(void);
>  int luo_retrieve_file(u64 token, struct file **filep);
>  int luo_register_file(u64 token, int fd);
>  int luo_unregister_file(u64 token);
> +void luo_unregister_all_files(void);
>  
>  #endif /* _LINUX_LUO_INTERNAL_H */
> diff --git a/kernel/liveupdate/luo_ioctl.c b/kernel/liveupdate/luo_ioctl.c
> index 6f61569c94e8..7ca33d1c868f 100644
> --- a/kernel/liveupdate/luo_ioctl.c
> +++ b/kernel/liveupdate/luo_ioctl.c
> @@ -137,6 +137,7 @@ static int luo_open(struct inode *inodep, struct file *filep)
>  
>  static int luo_release(struct inode *inodep, struct file *filep)
>  {
> +	luo_unregister_all_files();
>  	atomic_set(&luo_device_in_use, 0);
>  
>  	return 0;

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-08-27 15:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

Hi Jason,

Thanks for the review.

On Tue, Aug 26 2025, Jason Gunthorpe wrote:

> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
>> +	/*
>> +	 * Most of the space should be taken by preserved folios. So take its
>> +	 * size, plus a page for other properties.
>> +	 */
>> +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> +	if (!fdt) {
>> +		err = -ENOMEM;
>> +		goto err_unpin;
>> +	}
>
> This doesn't seem to have any versioning scheme, it really should..

It does. See the "compatible" property.

    static const char memfd_luo_compatible[] = "memfd-v1";

static struct liveupdate_file_handler memfd_luo_handler = {
	.ops = &memfd_luo_file_ops,
	.compatible = memfd_luo_compatible,
};

This goes into the LUO FDT:

	static int luo_files_to_fdt(struct xarray *files_xa_out)
	[...]
	xa_for_each(files_xa_out, token, h) {
		[...]
		ret = fdt_property_string(luo_file_fdt_out, "compatible",
					  h->fh->compatible);

So this function only gets called for the version 1.

>
>> +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> +				       (void **)&preserved_folios);
>> +	if (err) {
>> +		pr_err("Failed to reserve folios property in FDT: %s\n",
>> +		       fdt_strerror(err));
>> +		err = -ENOMEM;
>> +		goto err_free_fdt;
>> +	}
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.

Yep, I agree. This version already runs into size limits of around 1 GiB
due to the FDT being limited to MAX_PAGE_ORDER, since that is the
largest contiguous piece of memory folio_alloc() can give us. On top,
FDT is only limited to 32 bits. While very large, it isn't unreasonable
to expect metadata exceeding that for some use cases (4 GiB is only 0.4%
of 1 TiB and there are systems a lot larger than that around).

I think we need something a luo_xarray data structure that users like
memfd (and later hugetlb and guest_memfd and maybe others) can build to
make serialization easier. It will cover both contiguous arrays and
arrays with some holes in them.

I did it this way mainly to keep things simple and get things out. But
Pasha already mentioned he is running into this limit for some tests, so
I think I will experiment around with a serialized xarray design.

>
>> +	for (; i < nr_pfolios; i++) {
>> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
>> +		phys_addr_t phys;
>> +		u64 index;
>> +		int flags;
>> +
>> +		if (!pfolio->foliodesc)
>> +			continue;
>> +
>> +		phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
>> +		folio = kho_restore_folio(phys);
>> +		if (!folio) {
>> +			pr_err("Unable to restore folio at physical address: %llx\n",
>> +			       phys);
>> +			goto put_file;
>> +		}
>> +		index = pfolio->index;
>> +		flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
>> +
>> +		/* Set up the folio for insertion. */
>> +		/*
>> +		 * TODO: Should find a way to unify this and
>> +		 * shmem_alloc_and_add_folio().
>> +		 */
>> +		__folio_set_locked(folio);
>> +		__folio_set_swapbacked(folio);
>> 
>> +		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
>> +		if (ret) {
>> +			pr_err("shmem: failed to charge folio index %d: %d\n",
>> +			       i, ret);
>> +			goto unlock_folio;
>> +		}
>
> [..]
>
>> +		folio_add_lru(folio);
>> +		folio_unlock(folio);
>> +		folio_put(folio);
>> +	}
>
> Probably some consolidation will be needed to make this less
> duplicated..

Maybe. I do have that as a TODO item, but I took a quick look today and
I am not sure if it will make things simple enough. There are a few
places that add a folio to the shmem page cache, and all of them have
subtle differences and consolidating them all might be tricky. Let me
give it a shot...

>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.

As I explained above, the versioning is already there. Beyond that, why
do you think a raw C struct is better than FDT? It is just another way
of expressing the same information. FDT is a bit more cumbersome to
write and read, but comes at the benefit of more introspect-ability.

>
> Which seems to me to be entirely fine as:
>
>  struct memfd_luo_v0 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios;
>  };
>
>  struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
>  struct memfd_luo_v1 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios_list_with_holes;
>  };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

I think what you describe here is essentially how LUO works currently,
just that the mechanisms are a bit different.

For example, instead of the subsystem calling luo_store_object(), the
LUO core calls back into the subsystem at the appropriate time to let it
populate the object. See memfd_luo_prepare() and the data argument. The
version is decided by the compatible string with which the handler was
registered.

Since LUO knows when to start serializing what, I think this flow of
calling into the subsystem and letting it fill in an object that LUO
tracks and hands over makes a lot of sense.

>
> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
>  if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
>     restore_v1(&memfd_luo_v1)
>  else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
>     restore_v0(&memfd_luo_v0)
>  else
>     luo_failure("Do not understand this");

Similarly, on restore side, the new kernel can register handlers of all
the versions it can deal with, and LUO core takes care of calling into
the right callback. See  memfd_luo_retrieve() for example. If we now have
a v2, the new kernel can simply define a new handler for v2 and add a
new memfd_luo_retrieve_v2().

>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.

This also holds true.

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pratyush Yadav @ 2025-08-27 14:01 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Jason Gunthorpe, Pratyush Yadav, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bB9r_pMzd0VbLsAGTwh8kvV_o3rFM_W--drutewomr1ZQ@mail.gmail.com>

On Tue, Aug 26 2025, Pasha Tatashin wrote:

>> > The existing interface, with the addition of passing a pidfd, provides
>> > the necessary flexibility without being invasive. The change would be
>> > localized to the new code that performs the FD retrieval and wouldn't
>> > involve spoofing current or making widespread changes.
>> > For example, to handle cgroup charging for a memfd, the flow inside
>> > memfd_luo_retrieve() would look something like this:
>> >
>> > task = get_pid_task(target_pid, PIDTYPE_PID);
>> > mm = get_task_mm(task);
>> >     // ...
>> >     folio = kho_restore_folio(phys);
>> >     // Charge to the target mm, not 'current->mm'
>> >     mem_cgroup_charge(folio, mm, ...);
>> > mmput(mm);
>> > put_task_struct(task);
>> >
>> > This approach seems quite contained, and does not modify the existing
>> > interfaces. It avoids the need for the kernel to manage the entire
>> > session state and its associated security model.

Even with sessions, I don't think the kernel has to deal with the
security model. /dev/liveupdate can still be single-open only, with only
luod getting access to it. The the kernel just hands over sessions to
luod (maybe with a new ioctl LIVEUPDATE_IOCTL_CREATE_SESSION), and luod
takes care of the security model and lifecycle. If luod crashes and
loses its handle to /dev/liveupdate, all the sessions associated with it
go away too.

Essentially, the sessions from kernel perspective would just be a
container to group different resources together. I think this adds a
small bit of complexity on the session management and serialization
side, but I think will save complexity on participating subsystems.

>>
>> Execpt it doesn't work like that in all places, iommufd for example
>> uses GFP_KERNEL_ACCOUNT which relies on current.
>
> That's a good point. For kernel allocations, I don't see a clean way
> to account for a different process.
>
> We should not be doing major allocations during the retrieval process
> itself. Ideally, the kernel would restore an FD using only the
> preserved folio data (that we can cleanly charge), and then let the
> user process perform any subsequent actions that might cause new
> kernel memory allocations. However, I can see how that might not be
> practical for all handlers.
>
> Perhaps, we should add session extensions to the kernel as follow-up
> after this series lands, we would also need to rewrite luod design
> accordingly to move some of the sessions logic into the kernel.

I know the KHO is supposed to not be backwards compatible yet. What is
the goal for the LUO APIs? Are they also not backwards compatible? If
not, I think we should also consider how sessions will play into
backwards compatibility. For example, once we add sessions, what happens
to the older versions of luod that directly call preserve or unpreserve?

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Aleksa Sarai @ 2025-08-27 10:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Al Viro, Christian Brauner, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module, Andy Lutomirski, Jeff Xu
In-Reply-To: <20250822170800.2116980-2-mic@digikod.net>

[-- Attachment #1: Type: text/plain, Size: 6799 bytes --]

On 2025-08-22, Mickaël Salaün <mic@digikod.net> wrote:
> Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> passed file descriptors).  This changes the state of the opened file by
> making it read-only until it is closed.  The main use case is for script
> interpreters to get the guarantee that script' content cannot be altered
> while being read and interpreted.  This is useful for generic distros
> that may not have a write-xor-execute policy.  See commit a5874fde3c08
> ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> 
> Both execve(2) and the IOCTL to enable fsverity can already set this
> property on files with deny_write_access().  This new O_DENY_WRITE make
> it widely available.  This is similar to what other OSs may provide
> e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> Cc: Al Viro <viro@zeniv.linux.org.uk>
> Cc: Andy Lutomirski <luto@amacapital.net>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Jeff Xu <jeffxu@chromium.org>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Paul Moore <paul@paul-moore.com>
> Cc: Serge Hallyn <serge@hallyn.com>
> Reported-by: Robert Waite <rowait@microsoft.com>
> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> Link: https://lore.kernel.org/r/20250822170800.2116980-2-mic@digikod.net
> ---
>  fs/fcntl.c                       | 26 ++++++++++++++++++++++++--
>  fs/file_table.c                  |  2 ++
>  fs/namei.c                       |  6 ++++++
>  include/linux/fcntl.h            |  2 +-
>  include/uapi/asm-generic/fcntl.h |  4 ++++
>  5 files changed, 37 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 5598e4d57422..0c80c0fbc706 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -34,7 +34,8 @@
>  
>  #include "internal.h"
>  
> -#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
> +#define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME | \
> +	O_DENY_WRITE)
>  
>  static int setfl(int fd, struct file * filp, unsigned int arg)
>  {
> @@ -80,8 +81,29 @@ static int setfl(int fd, struct file * filp, unsigned int arg)
>  			error = 0;
>  	}
>  	spin_lock(&filp->f_lock);
> +
> +	if (arg & O_DENY_WRITE) {
> +		/* Only regular files. */
> +		if (!S_ISREG(inode->i_mode)) {
> +			error = -EINVAL;
> +			goto unlock;
> +		}
> +
> +		/* Only sets once. */
> +		if (!(filp->f_flags & O_DENY_WRITE)) {
> +			error = exe_file_deny_write_access(filp);
> +			if (error)
> +				goto unlock;
> +		}
> +	} else {
> +		if (filp->f_flags & O_DENY_WRITE)
> +			exe_file_allow_write_access(filp);
> +	}

I appreciate the goal of making this something that can be cleared
(presumably for interpreters that mmap(MAP_PRIVATE) their scripts), but
making a security-related flag this easy to clear seems like a footgun
(any library function could mask O_DENY_WRITE or forget to copy the old
flag values).

> +
>  	filp->f_flags = (arg & SETFL_MASK) | (filp->f_flags & ~SETFL_MASK);
>  	filp->f_iocb_flags = iocb_flags(filp);
> +
> +unlock:
>  	spin_unlock(&filp->f_lock);
>  
>   out:
> @@ -1158,7 +1180,7 @@ static int __init fcntl_init(void)
>  	 * Exceptions: O_NONBLOCK is a two bit define on parisc; O_NDELAY
>  	 * is defined as O_NONBLOCK on some platforms and not on others.
>  	 */
> -	BUILD_BUG_ON(20 - 1 /* for O_RDONLY being 0 */ !=
> +	BUILD_BUG_ON(21 - 1 /* for O_RDONLY being 0 */ !=
>  		HWEIGHT32(
>  			(VALID_OPEN_FLAGS & ~(O_NONBLOCK | O_NDELAY)) |
>  			__FMODE_EXEC));
> diff --git a/fs/file_table.c b/fs/file_table.c
> index 81c72576e548..6ba896b6a53f 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -460,6 +460,8 @@ static void __fput(struct file *file)
>  	locks_remove_file(file);
>  
>  	security_file_release(file);
> +	if (unlikely(file->f_flags & O_DENY_WRITE))
> +		exe_file_allow_write_access(file);
>  	if (unlikely(file->f_flags & FASYNC)) {
>  		if (file->f_op->fasync)
>  			file->f_op->fasync(-1, file, 0);
> diff --git a/fs/namei.c b/fs/namei.c
> index cd43ff89fbaa..366530bf937d 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -3885,6 +3885,12 @@ static int do_open(struct nameidata *nd,
>  	error = may_open(idmap, &nd->path, acc_mode, open_flag);
>  	if (!error && !(file->f_mode & FMODE_OPENED))
>  		error = vfs_open(&nd->path, file);
> +	if (!error && (open_flag & O_DENY_WRITE)) {
> +		if (S_ISREG(file_inode(file)->i_mode))
> +			error = exe_file_deny_write_access(file);
> +		else
> +			error = -EINVAL;
> +	}
>  	if (!error)
>  		error = security_file_post_open(file, op->acc_mode);
>  	if (!error && do_truncate)
> diff --git a/include/linux/fcntl.h b/include/linux/fcntl.h
> index a332e79b3207..dad14101686f 100644
> --- a/include/linux/fcntl.h
> +++ b/include/linux/fcntl.h
> @@ -10,7 +10,7 @@
>  	(O_RDONLY | O_WRONLY | O_RDWR | O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC | \
>  	 O_APPEND | O_NDELAY | O_NONBLOCK | __O_SYNC | O_DSYNC | \
>  	 FASYNC	| O_DIRECT | O_LARGEFILE | O_DIRECTORY | O_NOFOLLOW | \
> -	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE)
> +	 O_NOATIME | O_CLOEXEC | O_PATH | __O_TMPFILE | O_DENY_WRITE)

I don't like this patch for the same reasons Christian has already said,
but in addition -- you cannot just add new open(2) flags like this.

Unlike openat2(2), classic open(2) does not verify invalid flag bits, so
any new flag must be designed so that old kernels will return an error
for that flag combination, which ensures that:

 * No old programs set those bits inadvertently, which lets us avoid
   breaking userspace in some really fun and hard-to-debug ways.
 * For security-related bits, that new programs running on old kernels
   do not think they are getting a security property that they aren't
   actually getting.

O_TMPFILE's bitflag soup is an example of how you can resolve this issue
for open(2), but I would suggest that authors of new O_* flags seriously
consider making their flags openat2(2)-only unless it's trivial to get
the above behaviour.

>  /* List of all valid flags for the how->resolve argument: */
>  #define VALID_RESOLVE_FLAGS \
> diff --git a/include/uapi/asm-generic/fcntl.h b/include/uapi/asm-generic/fcntl.h
> index 613475285643..facd9136f5af 100644
> --- a/include/uapi/asm-generic/fcntl.h
> +++ b/include/uapi/asm-generic/fcntl.h
> @@ -91,6 +91,10 @@
>  /* a horrid kludge trying to make sure that this will fail on old kernels */
>  #define O_TMPFILE (__O_TMPFILE | O_DIRECTORY)
>  
> +#ifndef O_DENY_WRITE
> +#define O_DENY_WRITE	040000000
> +#endif
> +
>  #ifndef O_NDELAY
>  #define O_NDELAY	O_NONBLOCK
>  #endif

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Aleksa Sarai @ 2025-08-27 10:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Mickaël Salaün, Al Viro, Christian Brauner, Kees Cook,
	Paul Moore, Serge Hallyn, Andy Lutomirski, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module, Andy Lutomirski, Jeff Xu
In-Reply-To: <CAG48ez1XjUdcFztc_pF2qcoLi7xvfpJ224Ypc=FoGi-Px-qyZw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]

On 2025-08-22, Jann Horn <jannh@google.com> wrote:
> On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > passed file descriptors).  This changes the state of the opened file by
> > making it read-only until it is closed.  The main use case is for script
> > interpreters to get the guarantee that script' content cannot be altered
> > while being read and interpreted.  This is useful for generic distros
> > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> >
> > Both execve(2) and the IOCTL to enable fsverity can already set this
> > property on files with deny_write_access().  This new O_DENY_WRITE make
> 
> The kernel actually tried to get rid of this behavior on execve() in
> commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> because it broke userspace assumptions.

Also the ETXTBSY behaviour for binaries is not always guaranteed to
block writes to the file. When we were discussing this back in 2021 and
when we initially removed it, I remember there being some fairly trivial
ways to get around it anyway (but because process mm is mapped with
MAP_PRIVATE, writes aren't seen by the actual process).

> > it widely available.  This is similar to what other OSs may provide
> > e.g., opening a file with only FILE_SHARE_READ on Windows.
> 
> We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> removed for security reasons; as
> https://man7.org/linux/man-pages/man2/mmap.2.html says:
> 
> |        MAP_DENYWRITE
> |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> |               signaled that attempts to write to the underlying file
> |               should fail with ETXTBSY.  But this was a source of denial-
> |               of-service attacks.)"
> 
> It seems to me that the same issue applies to your patch - it would
> allow unprivileged processes to essentially lock files such that other
> processes can't write to them anymore. This might allow unprivileged
> users to prevent root from updating config files or stuff like that if
> they're updated in-place.

Agreed, and this was one of the major issues with the also-now-removed
mandatory locking as well.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Aleksa Sarai @ 2025-08-27  9:42 UTC (permalink / raw)
  To: Askar Safin
  Cc: Alejandro Colomar, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-2-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 2382 bytes --]

On 2025-08-26, Askar Safin <safinaskar@zohomail.com> wrote:
> My edit is based on experiments and reading Linux code
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>
> ---
>  man/man2/mount.2 | 27 ++++++++++++++++++++++++---
>  1 file changed, 24 insertions(+), 3 deletions(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 5d83231f9..599c2d6fa 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -405,7 +405,25 @@ flag can be used with
>  to modify only the per-mount-point flags.
>  .\" See https://lwn.net/Articles/281157/
>  This is particularly useful for setting or clearing the "read-only"
> -flag on a mount without changing the underlying filesystem.
> +flag on a mount without changing the underlying filesystem parameters.

When reading the whole sentence, this feels a bit incomplete
("filesystem parameters ... of what?"). Maybe

  This is particularly useful for setting or clearing the "read-only"
  flag on a mount without changing the underlying filesystem's
  filesystem parameters.

or

  This is particularly useful for setting or clearing the "read-only"
  flag on a mount without changing the filesystem parameters of the
  underlying filesystem.

would be better?

That one nit aside, feel free to take my

Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>

> +The
> +.I data
> +argument is ignored if
> +.B MS_REMOUNT
> +and
> +.B MS_BIND
> +are specified.
> +The mount point will
> +have its existing per-mount-point flags
> +cleared and replaced with those in
> +.IR mountflags .
> +This means that
> +if you wish to preserve
> +any existing per-mount-point flags,
> +you need to include them in
> +.IR mountflags ,
> +along with the per-mount-point flags you wish to set
> +(or with the flags you wish to clear missing).
>  Specifying
>  .I mountflags
>  as:
> @@ -416,8 +434,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
>  .EE
>  .in
>  .P
> -will make access through this mountpoint read-only, without affecting
> -other mounts.
> +will make access through this mount point read-only
> +(clearing all other per-mount-point flags),
> +without affecting
> +other mounts
> +of this filesystem.
>  .\"
>  .SS Creating a bind mount
>  If
> -- 
> 2.47.2
> 

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 265 bytes --]

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Mickaël Salaün @ 2025-08-27  8:19 UTC (permalink / raw)
  To: Jeff Xu
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <CABi2SkUJ1PDm_uri=4o+C13o5wFQD=xA7zVKU-we+unsEDm3dw@mail.gmail.com>

On Tue, Aug 26, 2025 at 01:29:55PM -0700, Jeff Xu wrote:
> Hi Mickaël
> 
> On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
> >
> > On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > > Hi Mickaël
> > >
> > > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > >
> > > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > >
> > > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > > >
> > > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > > >
> > > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > > because it broke userspace assumptions.
> > > > > >
> > > > > > Oh, good to know.
> > > > > >
> > > > > > >
> > > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > > >
> > > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > > removed for security reasons; as
> > > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > > >
> > > > > > > |        MAP_DENYWRITE
> > > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > > |               signaled that attempts to write to the underlying file
> > > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > > |               of-service attacks.)"
> > > > > > >
> > > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > > they're updated in-place.
> > > > > >
> > > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > > >
> > > > > > Anyway, we should rely on the access control policy to control write and
> > > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > > the references and the background!
> > > > >
> > > > > I'm confused.  I understand that there are many contexts in which one
> > > > > would want to prevent execution of unapproved content, which might
> > > > > include preventing a given process from modifying some code and then
> > > > > executing it.
> > > > >
> > > > > I don't understand what these deny-write features have to do with it.
> > > > > These features merely prevent someone from modifying code *that is
> > > > > currently in use*, which is not at all the same thing as preventing
> > > > > modifying code that might get executed -- one can often modify
> > > > > contents *before* executing those contents.
> > > >
> > > > The order of checks would be:
> > > > 1. open script with O_DENY_WRITE
> > > > 2. check executability with AT_EXECVE_CHECK
> > > > 3. read the content and interpret it
> > > >
> > > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> > >
> > > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > > work with bytecodes like Java, for example. If we let the Java runtime
> > > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > > develop a policy based on that.
> >
> > Sure, I'm using "script" to make it simple, but this applies to other
> > use cases.
> >
> That makes sense.
> 
> > >
> > > > The deny-write feature was to guarantee that there is no race condition
> > > > between step 2 and 3.  All these checks are supposed to be done by a
> > > > trusted interpreter (which is allowed to be executed).  The
> > > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > > associated security policies) allowed the *current* content of the file
> > > > to be executed.  Whatever happen before or after that (wrt.
> > > > O_DENY_WRITE) should be covered by the security policy.
> > > >
> > > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> > >
> > > Enforcing non-write for the path that stores scripts or bytecodes can
> > > be challenging due to historical or backward compatibility reasons.
> > > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > > before it is used, we can assume it will detect any "problem" that
> > > happened before that, (e.g. the file was overwritten). However, that
> > > also imposes two additional requirements:
> > > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
> >
> > This is already the case, so any kind of LSM checks are good.
> >
> May I ask how this is done? some code in do_open_execat() does this ?
> Apologies if this is a basic question.

do_open_execat() calls exe_file_deny_write_access()

> 
> > > 2>The file content kept by the process remains unchanged after passing
> > > the AT_EXECVE_CHECK.
> >
> > The goal of this patch was to avoid such race condition in the case
> > where executable files can be updated.  But in most cases it should not
> > be a security issue (because processes allowed to write to executable
> > files should be trusted), but this could still lead to bugs (because of
> > inconsistent file content, half-updated).
> >
> There is also a time gap between:
> a> the time of AT_EXECVE_CHECK
> b> the time that the app opens the file for execution.
> right ? another potential attack path (though this is not the case I
> mentioned previously).

As explained in the documentation, to avoid this specific race
condition, interpreters should open the script once, check the FD with
AT_EXECVE_CHECK, and then read the content with the same FD.

> 
> For the case I mentioned previously, I have to think more if the race
> condition is a bug or security issue.
> IIUC, two solutions are discussed so far:
> 1> the process could write to fs to update the script.  However, for
> execution, the process still uses the copy that passed the
> AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)

Yes, the snapshot solution would be the best, but I guess it would rely
on filesystems to support this feature.

> or 2> the process blocks the write while opening the file as read only
> and executing the script. (this seems to be the approach of this
> patch).

Yes, and this is not something we want anymore.

> 
> I wonder if there are other ideas.

I don't see other efficient ways do give the same guarantees.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-27  8:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826205057.GC1603531@mit.edu>

On Tue, Aug 26, 2025 at 04:50:57PM -0400, Theodore Ts'o wrote:
> On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> > 
> >   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
> >   on a regular file and returns 0 if execution of this file would be
> >   allowed, ignoring the file format and then the related interpreter
> >   dependencies (e.g. ELF libraries, script’s shebang).
> 
> But if that's it, why can't the script interpreter (python, bash,
> etc.) before executing the script, checks for executability via
> faccessat(2) or fstat(2)?

From commit a5874fde3c08 ("exec: Add a new AT_EXECVE_CHECK flag to
execveat(2)"):

    This is different from faccessat(2) + X_OK which only checks a subset of
    access rights (i.e. inode permission and mount options for regular
    files), but not the full context (e.g. all LSM access checks).  The main
    use case for access(2) is for SUID processes to (partially) check access
    on behalf of their caller.  The main use case for execveat(2) +
    AT_EXECVE_CHECK is to check if a script execution would be allowed,
    according to all the different restrictions in place.  Because the use
    of AT_EXECVE_CHECK follows the exact kernel semantic as for a real
    execution, user space gets the same error codes.


> 
> The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
> was doing more than just the executability check?

I would say that that AT_EXECVE_CHECK does a full executability check
(with the full caller's credentials checked against the currently
enforced security policy).

The rationale to add O_DENY_WRITE (which is now abandoned) was to avoid a race
condition between the check and the full read.  Indeed, with a full
execveat(2), the kernel write-lock the file to avoid such issue (which can lead
to other issues).

> 
> > There is no other way for user space to reliably check executability of
> > files (taking into account all enforced security
> > policies/configurations).
> 
> Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
> more substantive requirements and design doc might be helpful.  It
> appears you have some assumptions that perhaps other kernel developers
> are not aware.  I certainly seem to be missing something.....

My reasoning was to explain the rationale for a kernel feature in the commit
message, and the user doc (why and how to use it) in the user-facing
documentation.  Documentation improvements are welcome!

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Theodore Ts'o @ 2025-08-26 20:50 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826.iewie7Et5aiw@digikod.net>

On Tue, Aug 26, 2025 at 07:47:30PM +0200, Mickaël Salaün wrote:
> 
>   Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
>   on a regular file and returns 0 if execution of this file would be
>   allowed, ignoring the file format and then the related interpreter
>   dependencies (e.g. ELF libraries, script’s shebang).

But if that's it, why can't the script interpreter (python, bash,
etc.) before executing the script, checks for executability via
faccessat(2) or fstat(2)?

The whole O_DONY_WRITE dicsussion seemed to imply that AT_EXECVE_CHECK
was doing more than just the executability check?

> There is no other way for user space to reliably check executability of
> files (taking into account all enforced security
> policies/configurations).

Why doesn't faccessat(2) or fstat(2) suffice?  This is why having a
more substantive requirements and design doc might be helpful.  It
appears you have some assumptions that perhaps other kernel developers
are not aware.  I certainly seem to be missing something.....

    		  	    	    - Ted

^ permalink raw reply

* Re: [RFC PATCH v1 1/2] fs: Add O_DENY_WRITE
From: Jeff Xu @ 2025-08-26 20:29 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Jeff Xu, Andy Lutomirski, Jann Horn, Al Viro, Christian Brauner,
	Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <20250826.eWi6chuayae4@digikod.net>

Hi Mickaël

On Tue, Aug 26, 2025 at 5:39 AM Mickaël Salaün <mic@digikod.net> wrote:
>
> On Mon, Aug 25, 2025 at 10:57:57AM -0700, Jeff Xu wrote:
> > Hi Mickaël
> >
> > On Mon, Aug 25, 2025 at 2:31 AM Mickaël Salaün <mic@digikod.net> wrote:
> > >
> > > On Sun, Aug 24, 2025 at 11:04:03AM -0700, Andy Lutomirski wrote:
> > > > On Sun, Aug 24, 2025 at 4:03 AM Mickaël Salaün <mic@digikod.net> wrote:
> > > > >
> > > > > On Fri, Aug 22, 2025 at 09:45:32PM +0200, Jann Horn wrote:
> > > > > > On Fri, Aug 22, 2025 at 7:08 PM Mickaël Salaün <mic@digikod.net> wrote:
> > > > > > > Add a new O_DENY_WRITE flag usable at open time and on opened file (e.g.
> > > > > > > passed file descriptors).  This changes the state of the opened file by
> > > > > > > making it read-only until it is closed.  The main use case is for script
> > > > > > > interpreters to get the guarantee that script' content cannot be altered
> > > > > > > while being read and interpreted.  This is useful for generic distros
> > > > > > > that may not have a write-xor-execute policy.  See commit a5874fde3c08
> > > > > > > ("exec: Add a new AT_EXECVE_CHECK flag to execveat(2)")
> > > > > > >
> > > > > > > Both execve(2) and the IOCTL to enable fsverity can already set this
> > > > > > > property on files with deny_write_access().  This new O_DENY_WRITE make
> > > > > >
> > > > > > The kernel actually tried to get rid of this behavior on execve() in
> > > > > > commit 2a010c41285345da60cece35575b4e0af7e7bf44.; but sadly that had
> > > > > > to be reverted in commit 3b832035387ff508fdcf0fba66701afc78f79e3d
> > > > > > because it broke userspace assumptions.
> > > > >
> > > > > Oh, good to know.
> > > > >
> > > > > >
> > > > > > > it widely available.  This is similar to what other OSs may provide
> > > > > > > e.g., opening a file with only FILE_SHARE_READ on Windows.
> > > > > >
> > > > > > We used to have the analogous mmap() flag MAP_DENYWRITE, and that was
> > > > > > removed for security reasons; as
> > > > > > https://man7.org/linux/man-pages/man2/mmap.2.html says:
> > > > > >
> > > > > > |        MAP_DENYWRITE
> > > > > > |               This flag is ignored.  (Long ago—Linux 2.0 and earlier—it
> > > > > > |               signaled that attempts to write to the underlying file
> > > > > > |               should fail with ETXTBSY.  But this was a source of denial-
> > > > > > |               of-service attacks.)"
> > > > > >
> > > > > > It seems to me that the same issue applies to your patch - it would
> > > > > > allow unprivileged processes to essentially lock files such that other
> > > > > > processes can't write to them anymore. This might allow unprivileged
> > > > > > users to prevent root from updating config files or stuff like that if
> > > > > > they're updated in-place.
> > > > >
> > > > > Yes, I agree, but since it is the case for executed files I though it
> > > > > was worth starting a discussion on this topic.  This new flag could be
> > > > > restricted to executable files, but we should avoid system-wide locks
> > > > > like this.  I'm not sure how Windows handle these issues though.
> > > > >
> > > > > Anyway, we should rely on the access control policy to control write and
> > > > > execute access in a consistent way (e.g. write-xor-execute).  Thanks for
> > > > > the references and the background!
> > > >
> > > > I'm confused.  I understand that there are many contexts in which one
> > > > would want to prevent execution of unapproved content, which might
> > > > include preventing a given process from modifying some code and then
> > > > executing it.
> > > >
> > > > I don't understand what these deny-write features have to do with it.
> > > > These features merely prevent someone from modifying code *that is
> > > > currently in use*, which is not at all the same thing as preventing
> > > > modifying code that might get executed -- one can often modify
> > > > contents *before* executing those contents.
> > >
> > > The order of checks would be:
> > > 1. open script with O_DENY_WRITE
> > > 2. check executability with AT_EXECVE_CHECK
> > > 3. read the content and interpret it
> > >
> > I'm not sure about the O_DENY_WRITE approach, but the problem is worth solving.
> >
> > AT_EXECVE_CHECK is not just for scripting languages. It could also
> > work with bytecodes like Java, for example. If we let the Java runtime
> > call AT_EXECVE_CHECK before loading the bytecode, the LSM could
> > develop a policy based on that.
>
> Sure, I'm using "script" to make it simple, but this applies to other
> use cases.
>
That makes sense.

> >
> > > The deny-write feature was to guarantee that there is no race condition
> > > between step 2 and 3.  All these checks are supposed to be done by a
> > > trusted interpreter (which is allowed to be executed).  The
> > > AT_EXECVE_CHECK call enables the caller to know if the kernel (and
> > > associated security policies) allowed the *current* content of the file
> > > to be executed.  Whatever happen before or after that (wrt.
> > > O_DENY_WRITE) should be covered by the security policy.
> > >
> > Agree, the race problem needs to be solved in order for AT_EXECVE_CHECK.
> >
> > Enforcing non-write for the path that stores scripts or bytecodes can
> > be challenging due to historical or backward compatibility reasons.
> > Since AT_EXECVE_CHECK provides a mechanism to check the file right
> > before it is used, we can assume it will detect any "problem" that
> > happened before that, (e.g. the file was overwritten). However, that
> > also imposes two additional requirements:
> > 1> the file doesn't change while AT_EXECVE_CHECK does the check.
>
> This is already the case, so any kind of LSM checks are good.
>
May I ask how this is done? some code in do_open_execat() does this ?
Apologies if this is a basic question.

> > 2>The file content kept by the process remains unchanged after passing
> > the AT_EXECVE_CHECK.
>
> The goal of this patch was to avoid such race condition in the case
> where executable files can be updated.  But in most cases it should not
> be a security issue (because processes allowed to write to executable
> files should be trusted), but this could still lead to bugs (because of
> inconsistent file content, half-updated).
>
There is also a time gap between:
a> the time of AT_EXECVE_CHECK
b> the time that the app opens the file for execution.
right ? another potential attack path (though this is not the case I
mentioned previously).

For the case I mentioned previously, I have to think more if the race
condition is a bug or security issue.
IIUC, two solutions are discussed so far:
1> the process could write to fs to update the script.  However, for
execution, the process still uses the copy that passed the
AT_EXECVE_CHECK. (snapshot solution by Andy Lutomirski)
or 2> the process blocks the write while opening the file as read only
and executing the script. (this seems to be the approach of this
patch).

I wonder if there are other ideas.

Thanks and regards,
-Jeff

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Pasha Tatashin @ 2025-08-26 18:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826160307.GC2130239@nvidia.com>

On Tue, Aug 26, 2025 at 4:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:25AM +0000, Pasha Tatashin wrote:
> > Introduce a sysfs interface for the Live Update Orchestrator
> > under /sys/kernel/liveupdate/. This interface provides a way for
> > userspace tools and scripts to monitor the current state of the LUO
> > state machine.
>
> Now that you have a cdev these files may be more logically placed
> under the cdev's sysfs and not under kernel? This can be done easially
> using the attribute mechanisms in the struct device.
>
> Again sort of back to my earlier point that everything should be
> logically linked to the cdev as though there could be many cdevs, even
> though there are not. It just keeps the code design more properly
> layered and understanble rather than doing something unique..

I am going to drop this patch entirely, and only rely on "luoctl
state" (see https://tinyurl.com/luoddesign) to query the state from
"/dev/liveupdate"

Pasha

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Mickaël Salaün @ 2025-08-26 17:47 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <20250826123041.GB1603531@mit.edu>

On Tue, Aug 26, 2025 at 08:30:41AM -0400, Theodore Ts'o wrote:
> Is there a single, unified design and requirements document that
> describes the threat model, and what you are trying to achieve with
> AT_EXECVE_CHECK and O_DENY_WRITE?  I've been looking at the cover
> letters for AT_EXECVE_CHECK and O_DENY_WRITE, and the documentation
> that has landed for AT_EXECVE_CHECK and it really doesn't describe
> what *are* the checks that AT_EXECVE_CHECK is trying to achieve:
> 
>    "The AT_EXECVE_CHECK execveat(2) flag, and the
>    SECBIT_EXEC_RESTRICT_FILE and SECBIT_EXEC_DENY_INTERACTIVE
>    securebits are intended for script interpreters and dynamic linkers
>    to enforce a consistent execution security policy handled by the
>    kernel."

From the documentation:

  Passing the AT_EXECVE_CHECK flag to execveat(2) only performs a check
  on a regular file and returns 0 if execution of this file would be
  allowed, ignoring the file format and then the related interpreter
  dependencies (e.g. ELF libraries, script’s shebang).

> 
> Um, what security policy?

Whether the file is allowed to be executed.  This includes file
permission, mount point option, ACL, LSM policies...

> What checks?

Executability checks?

> What is a sample exploit
> which is blocked by AT_EXECVE_CHECK?

Executing/interpreting any data: sh script.txt

> 
> And then on top of it, why can't you do these checks by modifying the
> script interpreters?

The script interpreter requires modification to use AT_EXECVE_CHECK.

There is no other way for user space to reliably check executability of
files (taking into account all enforced security
policies/configurations).

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 17:08 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bB9r_pMzd0VbLsAGTwh8kvV_o3rFM_W--drutewomr1ZQ@mail.gmail.com>

On Tue, Aug 26, 2025 at 05:03:59PM +0000, Pasha Tatashin wrote:

> Perhaps, we should add session extensions to the kernel as follow-up
> after this series lands, we would also need to rewrite luod design
> accordingly to move some of the sessions logic into the kernel.

This is what I imagined at least..

I wouldn't even try to do anything with pid if it can't solve the
whole problem.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 17:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162203.GE2130239@nvidia.com>

> > The existing interface, with the addition of passing a pidfd, provides
> > the necessary flexibility without being invasive. The change would be
> > localized to the new code that performs the FD retrieval and wouldn't
> > involve spoofing current or making widespread changes.
> > For example, to handle cgroup charging for a memfd, the flow inside
> > memfd_luo_retrieve() would look something like this:
> >
> > task = get_pid_task(target_pid, PIDTYPE_PID);
> > mm = get_task_mm(task);
> >     // ...
> >     folio = kho_restore_folio(phys);
> >     // Charge to the target mm, not 'current->mm'
> >     mem_cgroup_charge(folio, mm, ...);
> > mmput(mm);
> > put_task_struct(task);
>
> Execpt it doesn't work like that in all places, iommufd for example
> uses GFP_KERNEL_ACCOUNT which relies on current.

That's a good point. For kernel allocations, I don't see a clean way
to account for a different process.

We should not be doing major allocations during the retrieval process
itself. Ideally, the kernel would restore an FD using only the
preserved folio data (that we can cleanly charge), and then let the
user process perform any subsequent actions that might cause new
kernel memory allocations. However, I can see how that might not be
practical for all handlers.

Perhaps, we should add session extensions to the kernel as follow-up
after this series lands, we would also need to rewrite luod design
accordingly to move some of the sessions logic into the kernel.

Thank you,
Pasha

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 16:22 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bAbqMb0ZYvsC9tsf6w5myfUyqo3N4fUP3CwVA_kUDQteg@mail.gmail.com>

On Tue, Aug 26, 2025 at 04:10:31PM +0000, Pasha Tatashin wrote:
> 
> > I think in the calls the idea was it was reasonable to start without
> > sessions fds at all, but in this case we shouldn't be mucking with
> > pids or current.
> 
> The existing interface, with the addition of passing a pidfd, provides
> the necessary flexibility without being invasive. The change would be
> localized to the new code that performs the FD retrieval and wouldn't
> involve spoofing current or making widespread changes.
> For example, to handle cgroup charging for a memfd, the flow inside
> memfd_luo_retrieve() would look something like this:
> 
> task = get_pid_task(target_pid, PIDTYPE_PID);
> mm = get_task_mm(task);
>     // ...
>     folio = kho_restore_folio(phys);
>     // Charge to the target mm, not 'current->mm'
>     mem_cgroup_charge(folio, mm, ...);
> mmput(mm);
> put_task_struct(task);

Execpt it doesn't work like that in all places, iommufd for example
uses GFP_KERNEL_ACCOUNT which relies on current.

How you fix that when current is the wrong cgroup, I have no idea if
it is even possible.

Jason

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Jason Gunthorpe @ 2025-08-26 16:20 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-30-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:

> +	/*
> +	 * Most of the space should be taken by preserved folios. So take its
> +	 * size, plus a page for other properties.
> +	 */
> +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> +	if (!fdt) {
> +		err = -ENOMEM;
> +		goto err_unpin;
> +	}

This doesn't seem to have any versioning scheme, it really should..

> +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> +				       (void **)&preserved_folios);
> +	if (err) {
> +		pr_err("Failed to reserve folios property in FDT: %s\n",
> +		       fdt_strerror(err));
> +		err = -ENOMEM;
> +		goto err_free_fdt;
> +	}

Yuk.

This really wants some luo helper

'luo alloc array'
'luo restore array'
'luo free array'

Which would get a linearized list of pages in the vmap to hold the
array and then allocate some structure to record the page list and
return back the u64 of the phys_addr of the top of the structure to
store in whatever.

Getting fdt to allocate the array inside the fds is just not going to
work for anything of size.

> +	for (; i < nr_pfolios; i++) {
> +		const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> +		phys_addr_t phys;
> +		u64 index;
> +		int flags;
> +
> +		if (!pfolio->foliodesc)
> +			continue;
> +
> +		phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> +		folio = kho_restore_folio(phys);
> +		if (!folio) {
> +			pr_err("Unable to restore folio at physical address: %llx\n",
> +			       phys);
> +			goto put_file;
> +		}
> +		index = pfolio->index;
> +		flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> +
> +		/* Set up the folio for insertion. */
> +		/*
> +		 * TODO: Should find a way to unify this and
> +		 * shmem_alloc_and_add_folio().
> +		 */
> +		__folio_set_locked(folio);
> +		__folio_set_swapbacked(folio);
> 
> +		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> +		if (ret) {
> +			pr_err("shmem: failed to charge folio index %d: %d\n",
> +			       i, ret);
> +			goto unlock_folio;
> +		}

[..]

> +		folio_add_lru(folio);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}

Probably some consolidation will be needed to make this less
duplicated..

But overall I think just using the memfd_luo_preserved_folio as the
serialization is entirely file, I don't think this needs anything more
complicated.

What it does need is an alternative to the FDT with versioning.

Which seems to me to be entirely fine as:

 struct memfd_luo_v0 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios;
 };

 struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);

Which also shows the actual data needing to be serialized comes from
more than one struct and has to be marshaled in code, somehow, to a
single struct.

Then I imagine a fairly simple forwards/backwards story. If something
new is needed that is non-optional, lets say you compress the folios
list to optimize holes:

 struct memfd_luo_v1 {
    __aligned_u64 size;
    __aligned_u64 pos;
    __aligned_u64 folios_list_with_holes;
 };

Obviously a v0 kernel cannot parse this, but in this case a v1 aware
kernel could optionally duplicate and write out the v0 format as well:

 luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
 luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Then the rule is fairly simple, when the sucessor kernel goes to
deserialize it asks luo for the versions it supports:

 if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
    restore_v1(&memfd_luo_v1)
 else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
    restore_v0(&memfd_luo_v0)
 else
    luo_failure("Do not understand this");

luo core just manages this list of versioned data per serialized
object. There is only one version per object.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 16:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826151327.GA2130239@nvidia.com>

On Tue, Aug 26, 2025 at 3:13 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 26, 2025 at 03:02:13PM +0000, Pasha Tatashin wrote:
> > I'm trying to understand the drawbacks of the PID-based approach.
> > Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
> > not a good idea?
>
> It will be a major invasive change all over the place in the kernel
> to change things that assume current to do something else. We should
> try to avoid this.
>
> > In this flow, the client isn't providing an arbitrary PID; the trusted
> > luod agent is providing the PID of a process it has an active
> > connection with.
>
> PIDs are wobbly thing, you can never really trust them unless they are
> in a pidfd.

Makes, sense, using a PID by value is fragile due to reuse. Luod would
acquire a pidfd for the client process from its socket connection and
pass that pidfd to the kernel in the RESTORE_FD ioctl. The kernel
would then be operating on a stable, secure handle to the target
process.

> > The idea was to let luod handle the session/security story, and the
> > kernel handle the core preservation mechanism. Adding sessions to the
> > kernel, delegates the management and part of the security model into
> > the kernel. I am not sure if it is necessary, what can be cleanly
> > managed in userspace should stay in userspace.
>
> session fds were an update imagined to allow the kernel to partition
> things the session FD it self could be shared with other processes.

I understand the model you're proposing: luod acts as a factory,
issuing session FDs that are then passed to clients, allowing them to
perform restore operations within their own context. While we can
certainly extend the design to support that, I am still trying to
determine if it's strictly necessary, especially if the same outcome
(correct resource attribution) can be achieved with less kernel
complexity. My primary concern is that functionality that can be
cleanly managed in userspace should remain there.

> I think in the calls the idea was it was reasonable to start without
> sessions fds at all, but in this case we shouldn't be mucking with
> pids or current.

The existing interface, with the addition of passing a pidfd, provides
the necessary flexibility without being invasive. The change would be
localized to the new code that performs the FD retrieval and wouldn't
involve spoofing current or making widespread changes.
For example, to handle cgroup charging for a memfd, the flow inside
memfd_luo_retrieve() would look something like this:

task = get_pid_task(target_pid, PIDTYPE_PID);
mm = get_task_mm(task);
    // ...
    folio = kho_restore_folio(phys);
    // Charge to the target mm, not 'current->mm'
    mem_cgroup_charge(folio, mm, ...);
mmput(mm);
put_task_struct(task);

This approach seems quite contained, and does not modify the existing
interfaces. It avoids the need for the kernel to manage the entire
session state and its associated security model.

Pasha

^ permalink raw reply

* Re: [PATCH v3 19/30] liveupdate: luo_sysfs: add sysfs state monitoring
From: Jason Gunthorpe @ 2025-08-26 16:03 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, rppt, dmatlack, rientjes,
	corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-20-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:25AM +0000, Pasha Tatashin wrote:
> Introduce a sysfs interface for the Live Update Orchestrator
> under /sys/kernel/liveupdate/. This interface provides a way for
> userspace tools and scripts to monitor the current state of the LUO
> state machine.

Now that you have a cdev these files may be more logically placed
under the cdev's sysfs and not under kernel? This can be done easially
using the attribute mechanisms in the struct device.

Again sort of back to my earlier point that everything should be
logically linked to the cdev as though there could be many cdevs, even
though there are not. It just keeps the code design more properly
layered and understanble rather than doing something unique..

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Jason Gunthorpe @ 2025-08-26 15:13 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bBrCd8t_BUeE-sVPGjsJwmtk3mCSVhTMGbseTi_Wk+4yQ@mail.gmail.com>

On Tue, Aug 26, 2025 at 03:02:13PM +0000, Pasha Tatashin wrote:
> I'm trying to understand the drawbacks of the PID-based approach.
> Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
> not a good idea?

It will be a major invasive change all over the place in the kernel
to change things that assume current to do something else. We should
try to avoid this.

> In this flow, the client isn't providing an arbitrary PID; the trusted
> luod agent is providing the PID of a process it has an active
> connection with.

PIDs are wobbly thing, you can never really trust them unless they are
in a pidfd.

> The idea was to let luod handle the session/security story, and the
> kernel handle the core preservation mechanism. Adding sessions to the
> kernel, delegates the management and part of the security model into
> the kernel. I am not sure if it is necessary, what can be cleanly
> managed in userspace should stay in userspace.

session fds were an update imagined to allow the kernel to partition
things the session FD it self could be shared with other processes.

I think in the calls the idea was it was reasonable to start without
sessions fds at all, but in this case we shouldn't be mucking with
pids or current.

Since it seems that is important it should be addressed by issuing the
restore ioctl inside the correct process context, that is a much
easier thing to delegate to the kernel than trying to deal with
spoofing current/etc.

Jason

^ permalink raw reply

* Re: [PATCH v3 00/30] Live Update Orchestrator
From: Pasha Tatashin @ 2025-08-26 15:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, jasonmiu, graf, changyuanl, rppt, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, lennart, brauner, linux-api, linux-fsdevel,
	saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826142406.GE1970008@nvidia.com>

On Tue, Aug 26, 2025 at 2:24 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Aug 26, 2025 at 01:54:31PM +0000, Pasha Tatashin wrote:
> > > > https://github.com/googleprodkernel/linux-liveupdate/tree/luo/v3
> > > >
> > > > Changelog from v2:
> > > > - Addressed comments from Mike Rapoport and Jason Gunthorpe
> > > > - Only one user agent (LiveupdateD) can open /dev/liveupdate
> > > > - With the above changes, sessions are not needed, and should be
> > > >   maintained by the user-agent itself, so removed support for
> > > >   sessions.
> > >
> > > If all the FDs are restored in the agent's context, this assigns all the
> > > resources to the agent. For example, if the agent restores a memfd, all
> > > the memory gets charged to the agent's cgroup, and the client gets none
> > > of it. This makes it impossible to do any kind of resource limits.
> > >
> > > This was one of the advantages of being able to pass around sessions
> > > instead of FDs. The agent can pass on the right session to the right
> > > client, and then the client does the restore, getting all the resources
> > > charged to it.
> > >
> > > If we don't allow this, I think we will make LUO/LiveupdateD unsuitable
> > > for many kinds of workloads. Do you have any ideas on how to do proper
> > > resource attribution with the current patches? If not, then perhaps we
> > > should reconsider this change?
> >
> > Hi Pratyush,
> >
> > That's an excellent point, and you're right that we must have a
> > solution for correct resource charging.
> >
> > I'd prefer to keep the session logic in the userspace agent (luod
> > https://tinyurl.com/luoddesign).
> >
> > For the charging problem, I believe there's a clear path forward with
> > the current ioctl-based API. The design of the ioctl commands (with a
> > size field in each struct) is intentionally extensible. In a follow-up
> > patch, we can extend the liveupdate_ioctl_fd_restore struct to include
> > a target pid field. The luod agent, would then be able to restore an
> > FD on behalf of a client and instruct the kernel to charge the
> > associated resources to that client's PID.
>
> This wasn't quite the idea though..
>
> The sessions sub FD were intended to be passed directly to other
> processes though unix sockets and fd passing so they could run their
> own ioctls in their own context for both save and restore. The ioctls
> available on the sessions should be specifically narrowed to be safe
> for this.
>
> I can understand not implementing session FDs in the first version,
> but when sessions FD are available they should work like this and
> solve the namespace/cgroup/etc issues.
>
> Passing some PID in an ioctl is not a great idea...

Hi Jason,

I'm trying to understand the drawbacks of the PID-based approach.
Could you elaborate on why passing a PID in the RESTORE_FD ioctl is
not a good idea?

From my perspective, luod would have a live, open socket to the client
process requesting the restore. It can use SO_PEERCRED to securely
identify the client's PID at that moment. The flow would be:

1. Client connects and resumes its session with luod.
2. Client requests to restore TOKEN_X.
3. luod verifies the client owns TOKEN_X for its session.
4. luod calls the RESTORE_FD ioctl, telling the kernel: "Please
restore TOKEN_X and charge the resources to PID Y (which I just
verified is on the other end of this socket)."
5. The kernel performs the action.
6. luod receives the new FD from the kernel and passes it back to the
client over the socket.

In this flow, the client isn't providing an arbitrary PID; the trusted
luod agent is providing the PID of a process it has an active
connection with.

The idea was to let luod handle the session/security story, and the
kernel handle the core preservation mechanism. Adding sessions to the
kernel, delegates the management and part of the security model into
the kernel. I am not sure if it is necessary, what can be cleanly
managed in userspace should stay in userspace.

Thanks,
Pasha


>
> Jason

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox