Linux userland API discussions

Linux userland API discussions
 help / color / mirror / Atom feed

* Re: (subset) [PATCH v4 0/4] procfs: make reference pidns more user-visible
From: Christian Brauner @ 2025-09-02  9:54 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Christian Brauner, Andy Lutomirski, linux-kernel, linux-fsdevel,
	linux-api, linux-doc, linux-kselftest, Alexander Viro, Jan Kara,
	Jonathan Corbet, Shuah Khan
In-Reply-To: <20250805-procfs-pidns-api-v4-0-705f984940e7@cyphar.com>

On Tue, 05 Aug 2025 15:45:07 +1000, Aleksa Sarai wrote:
> Ever since the introduction of pid namespaces, procfs has had very
> implicit behaviour surrounding them (the pidns used by a procfs mount is
> auto-selected based on the mounting process's active pidns, and the
> pidns itself is basically hidden once the mount has been constructed).
> 
> /* pidns mount option for procfs */
> 
> [...]

Applied to the vfs-6.18.procfs branch of the vfs/vfs.git tree.
Patches in the vfs-6.18.procfs branch should appear in linux-next soon.

Please report any outstanding bugs that were missed during review in a
new review to the original patch series allowing us to drop it.

It's encouraged to provide Acked-bys and Reviewed-bys even though the
patch has now been applied. If possible patch trailers will be updated.

Note that commit hashes shown below are subject to change due to rebase,
trailer updates or similar. If in doubt, please check the listed branch.

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs.git
branch: vfs-6.18.procfs

[1/4] pidns: move is-ancestor logic to helper
      https://git.kernel.org/vfs/vfs/c/60d22c6ef41b
[2/4] procfs: add "pidns" mount option
      https://git.kernel.org/vfs/vfs/c/77e211dd1392
[4/4] selftests/proc: add tests for new pidns APIs
      https://git.kernel.org/vfs/vfs/c/568d4239002c

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-02  8:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <3d89a03f31cacb53a2ed8017899f2dab10476b62.camel@huaweicloud.com>

On Mon, 2025-09-01 at 19:01 +0200, Roberto Sassu wrote:
> On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> > Can you clarify this a bit for those of us who are not well-versed in
> > exactly what "measurement" does?

Ah, sorry, I missed that.

Measurement refers to the process of collecting the file digest and
storing it in the measurement list, as opposed to appraisal which
instead compares the collected file digest with a reference value
(assumed to be good), and denies access in case of a mismatch.

Integrity violations are detected and reported only for measurement.

Roberto

> > On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> > <roberto.sassu@huaweicloud.com> wrote:
> > > > Now, in cases where you have IMA or something and you only permit signed
> > > > binaries to execute, you could argue there is a different race here (an
> > > > attacker creates a malicious script, runs it, and then replaces it with
> > > > a valid script's contents and metadata after the fact to get
> > > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > > 
> > > Uhm, let's consider measurement, I'm more familiar with.
> > > 
> > > I think the race you wanted to express was that the attacker replaces
> > > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > > after the IMA verification but before the interpreter reads it.
> > > 
> > > Fortunately, IMA is able to cope with this situation, since this race
> > > can happen for any file open, where of course a file can be not read-
> > > locked.
> > 
> > I assume you mean that this has nothing specifically to do with
> > scripts, as IMA tries to protect ordinary (non-"execute" file access)
> > as well.  Am I right?
> 
> Yes, correct, violations are checked for all open() and mmap()
> involving regular files. It would not be special to do it for scripts.
> 
> > > If the attacker tries to concurrently open the script for write in this
> > > race window, IMA will report this event (called violation) in the
> > > measurement list, and during remote attestation it will be clear that
> > > the interpreter did not read what was measured.
> > > 
> > > We just need to run the violation check for the BPRM_CHECK hook too
> > > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > > would not be needed, for measurement).
> > 
> > This seems consistent with my interpretation above, but ...
> 
> The comment here [1] seems to be clear on why the violation check it is
> not done for execution (BPRM_CHECK hook). Since the OS read-locks the
> files during execution, this implicitly guarantees that there will not
> be concurrent writes, and thus no IMA violations.
> 
> However, recently, we took advantage of AT_EXECVE_CHECK to also
> evaluate the integrity of scripts (when not executed via ./). Since we
> are using the same hook for both executed files (read-locked) and
> scripts (I guess non-read-locked), then we need to do a violation check
> for BPRM_CHECK too, although it will be redundant for the first
> category.
> 
> > > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > > don't block i_writecount during exec"). We had a discussion [1], but
> > > probably I missed when it was decided to be applied (I saw now it was
> > > in the same thread, but didn't get that at the time). We would have
> > > needed to update our code accordingly. In the future, we will try to
> > > clarify better our expectations from the VFS.
> > 
> > ... I didn't follow this.
> > 
> > Suppose there's some valid contents of /bin/sleep.  I execute
> > /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> > for write, not by replacing it), and the kernel in question doesn't do
> > ETXTBSY.  Then the sleep process reads (and executes) the modified
> > contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> > needed?
> 
> Ok, this is actually a good opportunity to explain what it will be
> missing. If you do the operations in the order you proposed, actually a
> violation will be emitted, because the violating operation is an open()
> and the check is done for this system call.
> 
> However, if you do the opposite, first open for write and then
> execution, IMA will not be aware of that since it trusts the OS to not
> make it happen and will not check for violations.
> 
> So yes, in your case the remote attestation will fail (actually it is
> up to the remote verifier to decide...). But in the opposite case, the
> writer could wait for IMA to measure the genuine content and then
> modify the content conveniently. The remote attestation will succeed.
> 
> Adding the violation check on BPRM_CHECK should be sufficient to avoid
> such situation, but I would try to think if there are other
> implications for IMA of not read-locking the files on execution.
> 
> Roberto
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565
> 


^ permalink raw reply

* Re: [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Amir Goldstein @ 2025-09-02  6:58 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-fsdevel, patches, Jeff Layton, Chuck Lever, Alexander Aring,
	Josef Bacik, Aleksa Sarai, Jan Kara, Christian Brauner,
	Matthew Wilcox, David Howells, linux-api
In-Reply-To: <20250901231457.1179748-1-rdunlap@infradead.org>

On Tue, Sep 2, 2025 at 1:14 AM Randy Dunlap <rdunlap@infradead.org> wrote:
>
> Define the RENAME_* and AT_RENAME_* macros exactly the same as in
> recent glibc <stdio.h> so that duplicate definition build errors in
> both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
> no longer happen. When they defined in exactly the same way in
> multiple places, the build errors are prevented.
>
> Defining only the AT_RENAME_* macros is not sufficient since they
> depend on the RENAME_* macros, which may not be defined when the
> AT_RENAME_* macros are used.
>
> Build errors being fixed:
>
> for samples/vfs/test-statx.c:
>
> In file included from ../samples/vfs/test-statx.c:23:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/vfs/test-statx.c:13:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
> for samples/watch_queue/watch_test.c:
>
> In file included from usr/include/linux/watch_queue.h:6,
>                  from ../samples/watch_queue/watch_test.c:19:
> usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
>   159 | #define AT_RENAME_NOREPLACE     0x0001
> In file included from ../samples/watch_queue/watch_test.c:11:
> /usr/include/stdio.h:171:10: note: this is the location of the previous definition
>   171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
>   160 | #define AT_RENAME_EXCHANGE      0x0002
> /usr/include/stdio.h:173:10: note: this is the location of the previous definition
>   173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
>   161 | #define AT_RENAME_WHITEOUT      0x0004
> /usr/include/stdio.h:175:10: note: this is the location of the previous definition
>   175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>
> Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> ---
> Cc: Amir Goldstein <amir73il@gmail.com>
> Cc: Jeff Layton <jlayton@kernel.org>
> Cc: Chuck Lever <chuck.lever@oracle.com>
> Cc: Alexander Aring <alex.aring@gmail.com>
> Cc: Josef Bacik <josef@toxicpanda.com>
> Cc: Aleksa Sarai <cyphar@cyphar.com>
> Cc: Jan Kara <jack@suse.cz>
> Cc: Christian Brauner <brauner@kernel.org>
> Cc: Matthew Wilcox <willy@infradead.org>
> Cc: David Howells <dhowells@redhat.com>
> CC: linux-api@vger.kernel.org
> To: linux-fsdevel@vger.kernel.org
>
>  include/uapi/linux/fcntl.h |    9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> --- linux-next-20250819.orig/include/uapi/linux/fcntl.h
> +++ linux-next-20250819/include/uapi/linux/fcntl.h
> @@ -156,9 +156,12 @@
>   */
>
>  /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
> -#define AT_RENAME_NOREPLACE    0x0001
> -#define AT_RENAME_EXCHANGE     0x0002
> -#define AT_RENAME_WHITEOUT     0x0004
> +# define RENAME_NOREPLACE (1 << 0)
> +# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
> +# define RENAME_EXCHANGE (1 << 1)
> +# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
> +# define RENAME_WHITEOUT (1 << 2)
> +# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
>

This solution, apart from being terribly wrong (adjust the source to match
to value of its downstream copy), does not address the issue that Mathew
pointed out on v1 discussion [1]:

$ grep -r AT_RENAME_NOREPLACE /usr/include
/usr/include/linux/fcntl.h:#define AT_RENAME_NOREPLACE  0x0001

It's not in stdio.h at all.  This is with libc6 2.41-10

[1] https://lore.kernel.org/linux-fsdevel/aKxfGix_o4glz8-Z@casper.infradead.org/

I don't know how to resolve the mess that glibc has created.

Perhaps like this:

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index f291ab4f94ebc..dde14fa3c2007 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -155,10 +155,16 @@
  * as possible, so we can use them for generic bits in the future if necessary.
  */

-/* Flags for renameat2(2) (must match legacy RENAME_* flags). */
-#define AT_RENAME_NOREPLACE    0x0001
-#define AT_RENAME_EXCHANGE     0x0002
-#define AT_RENAME_WHITEOUT     0x0004
+/*
+ * The legacy renameat2(2) RENAME_* flags are conceptually also
syscall-specific
+ * flags, so it could makes sense to create the AT_RENAME_* aliases
for them and
+ * maybe later add support for generic AT_* flags to this syscall.
+ * However, following a mismatch of definitions in glibc and since no
kernel code
+ * currently uses the AT_RENAME_* aliases, we leave them undefined here.
+#define AT_RENAME_NOREPLACE    RENAME_NOREPLACE
+#define AT_RENAME_EXCHANGE     RENAME_EXCHANGE
+#define AT_RENAME_WHITEOUT     RENAME_WHITEOUT
+*/

 /* Flag for faccessat(2). */
 #define AT_EACCESS             0x200   /* Test access permitted for

^ permalink raw reply related

* Re: [PATCH v4] linux: Add openat2 (BZ 31664)
From: Arjun Shankar @ 2025-09-02  2:41 UTC (permalink / raw)
  To: Paul Eggert; +Cc: Aleksa Sarai, Adhemerval Zanella Netto, libc-alpha, linux-api
In-Reply-To: <cbbc9639-0443-4bf8-bbd1-9d3fdcb2fd37@cs.ucla.edu>

Hi Paul,

> On 2025-08-28 01:42, Aleksa Sarai wrote:
> >> I still fail to understand how a hypothetical "give me the supported flags"
> >> openat2 flag would be useful enough to justify complicating the openat2 API
> >> today.
> > My only concern is that it would break recompiles if/when we change it
> > back.
>
> OK, but from what I can see there's no identified possibility that
> openat2 will modify the objects its arguments point to, just as there's
> no identified possibility that plain openat will do so (in a
> hypothetical extension to remove unnecessary slashes from its filename
> argument, say).

While it is true that openat cannot be extended in this way, for
openat2 (whether or not it eventually materializes in Linux) there
already is the RFC patch series proposing CHECK_FIELDS that Aleksa
referred to earlier. And it's not just that: it has been mentioned as
a potential future direction even when the openat2 syscall was
implemented [1]. I think we should interpret this to mean that there
is indeed a possibility for openat2.

> In that case it's pretty clear that glibc should mark the open_how
> argument as pointer-to-const, just as glibc already marks the filename
> argument.

Unless the kernel marks open_how as const, glibc marking it as const
can lead to additional maintenance complications down the line: in the
future if the kernel starts modifying open_how, glibc's openat2
wrapper will no longer align with the kernel's behavior. At that
point, glibc will either need to discard the const (which will cause
any existing users of the wrapper to fail to recompile), or glibc will
need to handle the kernel's new behavior in the wrapper (which will
lead to further divergence from the behavior of the syscall that we
would claim to wrap). Neither of these seems problem-free. On the
other hand, following the kernel's declaration will mean that should
the kernel choose to mark it const, we can easily follow suit in glibc
without breaking recompiles.

Earlier on in this thread, Aleksa mentioned sched_setattr as
establishing precedent for the kernel modifying non-const objects. It
looks like glibc actually does provide a sched_setattr wrapper since
2.41. The relevant argument hasn't been marked as const and the kernel
does modify the contents, and glibc's syscall wrapper simply passes it
through. So we already do this.

Based on all this, I feel that leaving open_how as-is is the easier
and more maintenance-friendly choice for the syscall wrapper.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fddb5d430ad9fa91b49b1d34d0202ffe2fa0e179

--
Arjun Shankar
he/him/his

^ permalink raw reply

* [PATCH v2] uapi/fcntl: define RENAME_* and AT_RENAME_* macros
From: Randy Dunlap @ 2025-09-01 23:14 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: patches, Randy Dunlap, Amir Goldstein, Jeff Layton, Chuck Lever,
	Alexander Aring, Josef Bacik, Aleksa Sarai, Jan Kara,
	Christian Brauner, Matthew Wilcox, David Howells, linux-api

Define the RENAME_* and AT_RENAME_* macros exactly the same as in
recent glibc <stdio.h> so that duplicate definition build errors in
both samples/watch_queue/watch_test.c and samples/vfs/test-statx.c
no longer happen. When they defined in exactly the same way in
multiple places, the build errors are prevented.

Defining only the AT_RENAME_* macros is not sufficient since they
depend on the RENAME_* macros, which may not be defined when the
AT_RENAME_* macros are used.

Build errors being fixed:

for samples/vfs/test-statx.c:

In file included from ../samples/vfs/test-statx.c:23:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/vfs/test-statx.c:13:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT

for samples/watch_queue/watch_test.c:

In file included from usr/include/linux/watch_queue.h:6,
                 from ../samples/watch_queue/watch_test.c:19:
usr/include/linux/fcntl.h:159:9: warning: ‘AT_RENAME_NOREPLACE’ redefined
  159 | #define AT_RENAME_NOREPLACE     0x0001
In file included from ../samples/watch_queue/watch_test.c:11:
/usr/include/stdio.h:171:10: note: this is the location of the previous definition
  171 | # define AT_RENAME_NOREPLACE RENAME_NOREPLACE
usr/include/linux/fcntl.h:160:9: warning: ‘AT_RENAME_EXCHANGE’ redefined
  160 | #define AT_RENAME_EXCHANGE      0x0002
/usr/include/stdio.h:173:10: note: this is the location of the previous definition
  173 | # define AT_RENAME_EXCHANGE RENAME_EXCHANGE
usr/include/linux/fcntl.h:161:9: warning: ‘AT_RENAME_WHITEOUT’ redefined
  161 | #define AT_RENAME_WHITEOUT      0x0004
/usr/include/stdio.h:175:10: note: this is the location of the previous definition
  175 | # define AT_RENAME_WHITEOUT RENAME_WHITEOUT

Fixes: b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be allocated")
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
---
Cc: Amir Goldstein <amir73il@gmail.com>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Chuck Lever <chuck.lever@oracle.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Howells <dhowells@redhat.com>
CC: linux-api@vger.kernel.org
To: linux-fsdevel@vger.kernel.org

 include/uapi/linux/fcntl.h |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

--- linux-next-20250819.orig/include/uapi/linux/fcntl.h
+++ linux-next-20250819/include/uapi/linux/fcntl.h
@@ -156,9 +156,12 @@
  */
 
 /* Flags for renameat2(2) (must match legacy RENAME_* flags). */
-#define AT_RENAME_NOREPLACE	0x0001
-#define AT_RENAME_EXCHANGE	0x0002
-#define AT_RENAME_WHITEOUT	0x0004
+# define RENAME_NOREPLACE (1 << 0)
+# define AT_RENAME_NOREPLACE RENAME_NOREPLACE
+# define RENAME_EXCHANGE (1 << 1)
+# define AT_RENAME_EXCHANGE RENAME_EXCHANGE
+# define RENAME_WHITEOUT (1 << 2)
+# define AT_RENAME_WHITEOUT RENAME_WHITEOUT
 
 /* Flag for faccessat(2). */
 #define AT_EACCESS		0x200	/* Test access permitted for

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-01 19:02 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Mike Rapoport, Jason Gunthorpe, jasonmiu, graf, changyuanl,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <mafs03496w0kk.fsf@kernel.org>

> >> > This really wants some luo helper
> >> >
> >> > 'luo alloc array'
> >> > 'luo restore array'
> >> > 'luo free array'
> >>
> >> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
> >
> > The patch looks okay to me, but it doesn't support holes in vmap
> > areas. While that is likely acceptable for vmalloc, it could be a
> > problem if we want to preserve memfd with holes and using vmap
> > preservation as a method, which would require a different approach.
> > Still, this would help with preserving memfd.
>
> I agree. I think we should do it the other way round. Build a sparse
> array first, and then use that to build vmap preservation. Our emails

Yes, sparse array support would help both: vmalloc and memfd preservation.

> seem to have crossed, but see my reply to Mike [0] that describes my
> idea a bit more, along with WIP code.
>
> [0] https://lore.kernel.org/lkml/mafs0ldmyw1hp.fsf@kernel.org/
>
> >
> > However, I wonder if we should add a separate preservation library on
> > top of the kho and not as part of kho (or at least keep them in a
> > separate file from core logic). This would allow us to preserve more
> > advanced data structures such as this and define preservation version
> > control, similar to Jason's store_object/restore_object proposal.
>
> This is how I have done it in my code: created a separate file called
> kho_array.c. If we have enough such data structures, we can probably
> move it under kernel/liveupdate/lib/.

Yes, let's place it under kernel/liveupdate/lib/. We will add more
preservation types over time.

> As for the store_object/restore_object proposal: see an alternate idea
> at [1].
>
> [1] https://lore.kernel.org/lkml/mafs0h5xmw12a.fsf@kernel.org/

What you are proposing makes sense. We can update the LUO API to be
responsible for passing the compatible string outside of the data
payload. However, I think we first need to settle on the actual API
for storing and restoring a versioned blob of data and place that code
into kernel/liveupdate/lib/. Depending on which API we choose, we can
then modify the LUO to work accordingly.

>
> --
> Regards,
> Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v4 1/7] kernel/api: introduce kernel API specification framework
From: Randy Dunlap @ 2025-09-01 17:23 UTC (permalink / raw)
  To: Sasha Levin, linux-api, linux-doc, linux-kernel, tools
In-Reply-To: <20250825181434.3340805-2-sashal@kernel.org>

Hi Sasha,


On 8/25/25 11:14 AM, Sasha Levin wrote:
> Add a comprehensive framework for formally documenting kernel APIs with
> inline specifications. This framework provides:
> 
> - Structured API documentation with parameter specifications, return
>   values, error conditions, and execution context requirements
> - Runtime validation capabilities for debugging (CONFIG_KAPI_RUNTIME_CHECKS)
> - Export of specifications via debugfs for tooling integration
> - Support for both internal kernel APIs and system calls
> 
> The framework stores specifications in a dedicated ELF section and
> provides infrastructure for:
> - Compile-time validation of specifications
> - Runtime querying of API documentation
> - Machine-readable export formats
> - Integration with existing SYSCALL_DEFINE macros
> 
> This commit introduces the core infrastructure without modifying any
> existing APIs. Subsequent patches will add specifications to individual
> subsystems.
> 
> Signed-off-by: Sasha Levin <sashal@kernel.org>
> ---
>  .gitignore                                    |    1 +
>  Documentation/admin-guide/kernel-api-spec.rst |  507 ++++++

To me, none of this feels like Documentation/admin-guide/ material.
I don't think that many sysadmins will be using it.

Maybe Documentation/dev-tools/ ?
Closer to developer material that admin?


>  MAINTAINERS                                   |    9 +
>  arch/um/kernel/dyn.lds.S                      |    3 +
>  arch/um/kernel/uml.lds.S                      |    3 +
>  arch/x86/kernel/vmlinux.lds.S                 |    3 +
>  include/asm-generic/vmlinux.lds.h             |   20 +
>  include/linux/kernel_api_spec.h               | 1559 +++++++++++++++++
>  include/linux/syscall_api_spec.h              |  125 ++
>  include/linux/syscalls.h                      |   38 +
>  init/Kconfig                                  |    2 +
>  kernel/Makefile                               |    1 +
>  kernel/api/Kconfig                            |   35 +
>  kernel/api/Makefile                           |    7 +
>  kernel/api/kernel_api_spec.c                  | 1155 ++++++++++++
>  15 files changed, 3468 insertions(+)
>  create mode 100644 Documentation/admin-guide/kernel-api-spec.rst
>  create mode 100644 include/linux/kernel_api_spec.h
>  create mode 100644 include/linux/syscall_api_spec.h
>  create mode 100644 kernel/api/Kconfig
>  create mode 100644 kernel/api/Makefile
>  create mode 100644 kernel/api/kernel_api_spec.c
thanks.
-- 
~Randy


^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:21 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: Mike Rapoport, Jason Gunthorpe, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <CA+CK2bC96fxHBb78DvNhyfdjsDfPCLY5J5cN8W0hUDt9KAPBJQ@mail.gmail.com>

Hi Pasha,

On Mon, Sep 01 2025, Pasha Tatashin wrote:

> On Mon, Sep 1, 2025 at 4:23 PM Mike Rapoport <rppt@kernel.org> wrote:
>>
>> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
>> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>> >
>> > > +   /*
>> > > +    * Most of the space should be taken by preserved folios. So take its
>> > > +    * size, plus a page for other properties.
>> > > +    */
>> > > +   fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> > > +   if (!fdt) {
>> > > +           err = -ENOMEM;
>> > > +           goto err_unpin;
>> > > +   }
>> >
>> > This doesn't seem to have any versioning scheme, it really should..
>> >
>> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> > > +                                  (void **)&preserved_folios);
>> > > +   if (err) {
>> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
>> > > +                  fdt_strerror(err));
>> > > +           err = -ENOMEM;
>> > > +           goto err_free_fdt;
>> > > +   }
>> >
>> > Yuk.
>> >
>> > This really wants some luo helper
>> >
>> > 'luo alloc array'
>> > 'luo restore array'
>> > 'luo free array'
>>
>> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
>> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
>
> The patch looks okay to me, but it doesn't support holes in vmap
> areas. While that is likely acceptable for vmalloc, it could be a
> problem if we want to preserve memfd with holes and using vmap
> preservation as a method, which would require a different approach.
> Still, this would help with preserving memfd.

I agree. I think we should do it the other way round. Build a sparse
array first, and then use that to build vmap preservation. Our emails
seem to have crossed, but see my reply to Mike [0] that describes my
idea a bit more, along with WIP code.

[0] https://lore.kernel.org/lkml/mafs0ldmyw1hp.fsf@kernel.org/

>
> However, I wonder if we should add a separate preservation library on
> top of the kho and not as part of kho (or at least keep them in a
> separate file from core logic). This would allow us to preserve more
> advanced data structures such as this and define preservation version
> control, similar to Jason's store_object/restore_object proposal.

This is how I have done it in my code: created a separate file called
kho_array.c. If we have enough such data structures, we can probably
move it under kernel/liveupdate/lib/.

As for the store_object/restore_object proposal: see an alternate idea
at [1].

[1] https://lore.kernel.org/lkml/mafs0h5xmw12a.fsf@kernel.org/

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-01 17:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CALCETrUtJmWxKYSi6QQAGpQR_ETNfoBidCu_VEq8Lx9iJAOyEw@mail.gmail.com>

On Mon, 2025-09-01 at 09:25 -0700, Andy Lutomirski wrote:
> Can you clarify this a bit for those of us who are not well-versed in
> exactly what "measurement" does?
> 
> On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
> <roberto.sassu@huaweicloud.com> wrote:
> > > Now, in cases where you have IMA or something and you only permit signed
> > > binaries to execute, you could argue there is a different race here (an
> > > attacker creates a malicious script, runs it, and then replaces it with
> > > a valid script's contents and metadata after the fact to get
> > > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
> > 
> > Uhm, let's consider measurement, I'm more familiar with.
> > 
> > I think the race you wanted to express was that the attacker replaces
> > the good script, verified with AT_EXECVE_CHECK, with the bad script
> > after the IMA verification but before the interpreter reads it.
> > 
> > Fortunately, IMA is able to cope with this situation, since this race
> > can happen for any file open, where of course a file can be not read-
> > locked.
> 
> I assume you mean that this has nothing specifically to do with
> scripts, as IMA tries to protect ordinary (non-"execute" file access)
> as well.  Am I right?

Yes, correct, violations are checked for all open() and mmap()
involving regular files. It would not be special to do it for scripts.

> > If the attacker tries to concurrently open the script for write in this
> > race window, IMA will report this event (called violation) in the
> > measurement list, and during remote attestation it will be clear that
> > the interpreter did not read what was measured.
> > 
> > We just need to run the violation check for the BPRM_CHECK hook too
> > (then, probably for us the O_DENY_WRITE flag or alternative solution
> > would not be needed, for measurement).
> 
> This seems consistent with my interpretation above, but ...

The comment here [1] seems to be clear on why the violation check it is
not done for execution (BPRM_CHECK hook). Since the OS read-locks the
files during execution, this implicitly guarantees that there will not
be concurrent writes, and thus no IMA violations.

However, recently, we took advantage of AT_EXECVE_CHECK to also
evaluate the integrity of scripts (when not executed via ./). Since we
are using the same hook for both executed files (read-locked) and
scripts (I guess non-read-locked), then we need to do a violation check
for BPRM_CHECK too, although it will be redundant for the first
category.

> > Please, let us know when you apply patches like 2a010c412853 ("fs:
> > don't block i_writecount during exec"). We had a discussion [1], but
> > probably I missed when it was decided to be applied (I saw now it was
> > in the same thread, but didn't get that at the time). We would have
> > needed to update our code accordingly. In the future, we will try to
> > clarify better our expectations from the VFS.
> 
> ... I didn't follow this.
> 
> Suppose there's some valid contents of /bin/sleep.  I execute
> /bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
> for write, not by replacing it), and the kernel in question doesn't do
> ETXTBSY.  Then the sleep process reads (and executes) the modified
> contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
> needed?

Ok, this is actually a good opportunity to explain what it will be
missing. If you do the operations in the order you proposed, actually a
violation will be emitted, because the violating operation is an open()
and the check is done for this system call.

However, if you do the opposite, first open for write and then
execution, IMA will not be aware of that since it trusts the OS to not
make it happen and will not check for violations.

So yes, in your case the remote attestation will fail (actually it is
up to the remote verifier to decide...). But in the opposite case, the
writer could wait for IMA to measure the genuine content and then
modify the content conveniently. The remote attestation will succeed.

Adding the violation check on BPRM_CHECK should be sufficient to avoid
such situation, but I would try to think if there are other
implications for IMA of not read-locking the files on execution.

Roberto

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/security/integrity/ima/ima_main.c?h=v6.17-rc4#n565

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250828124320.GB7333@nvidia.com>

Hi Jason,

On Thu, Aug 28 2025, Jason Gunthorpe wrote:

> On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:
>
>> I think we need something a luo_xarray data structure that users like
>> memfd (and later hugetlb and guest_memfd and maybe others) can build to
>> make serialization easier. It will cover both contiguous arrays and
>> arrays with some holes in them.
>
> I'm not sure xarray is the right way to go, it is very complex data
> structure and building a kho variation of it seems like it is a huge
> amount of work.
>
> I'd stick with simple kvalloc type approaches until we really run into
> trouble.
>
> You can always map a sparse xarray into a kvalloc linear list by
> including the xarray index in each entry.
>
> Especially for memfd where we don't actually expect any sparsity in
> real uses cases there is no reason to invest a huge effort to optimize
> for it..

Full xarray is too complex, sure. But I think a simple sparse array with
xarray-like properties (4-byte pointers, values using xa_mk_value()) is
fairly simple to implement. More advanced features of xarray like
multi-index entries can be added later if needed.

In fact, I have a WIP version of such an array and have used it for
memfd preservation, and it looks quite alright to me. You can find the
code at [0]. It is roughly 300 lines of code. I still need to clean it
up to make it post-able, but it does work.

Building kvalloc on top of this becomes trivial.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af

>
>> As I explained above, the versioning is already there. Beyond that, why
>> do you think a raw C struct is better than FDT? It is just another way
>> of expressing the same information. FDT is a bit more cumbersome to
>> write and read, but comes at the benefit of more introspect-ability.
>
> Doesn't have the size limitations, is easier to work list, runs
> faster.
>
>> >  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>> >  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);
>> 
>> I think what you describe here is essentially how LUO works currently,
>> just that the mechanisms are a bit different.
>
> The bit different is a very important bit though :)
>
> The versioning should be first class, not hidden away as some emergent
> property of registering multiple serializers or something like that.

That makes sense. How about some simple changes to the LUO interfaces to
make the version more prominent:

	int (*prepare)(struct liveupdate_file_handler *handler,
		       struct file *file, u64 *data, char **compatible);

This lets the subsystem fill in the compatible (AKA version) (string
here, but you can make it an integer if you want) when it serialized its
data.

And on restore side, LUO can pass in the compatible:

	int (*retrieve)(struct liveupdate_file_handler *handler,
			u64 data, char *compatible, struct file **file);


-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pratyush Yadav @ 2025-09-01 17:01 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLXIcUwt0HVzRpYW@kernel.org>

Hi Mike,

On Mon, Sep 01 2025, Mike Rapoport wrote:

> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
>> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>> 
>> > +	/*
>> > +	 * Most of the space should be taken by preserved folios. So take its
>> > +	 * size, plus a page for other properties.
>> > +	 */
>> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
>> > +	if (!fdt) {
>> > +		err = -ENOMEM;
>> > +		goto err_unpin;
>> > +	}
>> 
>> This doesn't seem to have any versioning scheme, it really should..
>> 
>> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
>> > +				       (void **)&preserved_folios);
>> > +	if (err) {
>> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
>> > +		       fdt_strerror(err));
>> > +		err = -ENOMEM;
>> > +		goto err_free_fdt;
>> > +	}
>> 
>> Yuk.
>> 
>> This really wants some luo helper
>> 
>> 'luo alloc array'
>> 'luo restore array'
>> 'luo free array'
>
> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1
>
> Will wait for kbuild and then send proper patches.

I have been working on something similar, but in a more generic way.

I have implemented a sparse KHO-preservable array (called kho_array)
with xarray like properties. It can take in 4-byte aligned pointers and
supports saving non-pointer values similar to xa_mk_value(). For now it
doesn't support multi-index entries, but if needed the data format can
be extended to support it as well.

The structure is very similar to what you have implemented. It uses a
linked list of pages with some metadata at the head of each page.

I have used it for memfd preservation, and I think it is quite
versatile. For example, your kho_preserve_vmalloc() can be very easily
built on top of this kho_array by simply saving each physical page
address at consecutive indices in the array.

The code is still WIP and currently a bit hacky, but I will clean it up
in a couple days and I think it should be ready for posting. You can
find the current version at [0][1]. Would be good to hear your thoughts,
and if you agree with the approach, I can also port
kho_preserve_vmalloc() to work on top of kho_array as well.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=cf4c04c1e9ac854e3297018ad6dada17c54a59af
[1] https://git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/commit/?h=kho-array&id=5eb0d7316274a9c87acaeedd86941979fc4baf96

-- 
Regards,
Pratyush Yadav

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Pasha Tatashin @ 2025-09-01 16:54 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLXIcUwt0HVzRpYW@kernel.org>

On Mon, Sep 1, 2025 at 4:23 PM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >
> > > +   /*
> > > +    * Most of the space should be taken by preserved folios. So take its
> > > +    * size, plus a page for other properties.
> > > +    */
> > > +   fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > > +   if (!fdt) {
> > > +           err = -ENOMEM;
> > > +           goto err_unpin;
> > > +   }
> >
> > This doesn't seem to have any versioning scheme, it really should..
> >
> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > > +                                  (void **)&preserved_folios);
> > > +   if (err) {
> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
> > > +                  fdt_strerror(err));
> > > +           err = -ENOMEM;
> > > +           goto err_free_fdt;
> > > +   }
> >
> > Yuk.
> >
> > This really wants some luo helper
> >
> > 'luo alloc array'
> > 'luo restore array'
> > 'luo free array'
>
> We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
> https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1

The patch looks okay to me, but it doesn't support holes in vmap
areas. While that is likely acceptable for vmalloc, it could be a
problem if we want to preserve memfd with holes and using vmap
preservation as a method, which would require a different approach.
Still, this would help with preserving memfd.

However, I wonder if we should add a separate preservation library on
top of the kho and not as part of kho (or at least keep them in a
separate file from core logic). This would allow us to preserve more
advanced data structures such as this and define preservation version
control, similar to Jason's store_object/restore_object proposal.

>
> Will wait for kbuild and then send proper patches.
>
>
> --
> Sincerely yours,
> Mike.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-09-01 16:25 UTC (permalink / raw)
  To: Roberto Sassu
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Andy Lutomirski,
	Arnd Bergmann, Christian Heimes, Dmitry Vyukov, Elliott Hughes,
	Fan Wu, Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <54e27d05bae55749a975bc7cbe109b237b2b1323.camel@huaweicloud.com>

Can you clarify this a bit for those of us who are not well-versed in
exactly what "measurement" does?

On Mon, Sep 1, 2025 at 2:42 AM Roberto Sassu
<roberto.sassu@huaweicloud.com> wrote:
> > Now, in cases where you have IMA or something and you only permit signed
> > binaries to execute, you could argue there is a different race here (an
> > attacker creates a malicious script, runs it, and then replaces it with
> > a valid script's contents and metadata after the fact to get
> > AT_EXECVE_CHECK to permit the execution). However, I'm not sure that
>
> Uhm, let's consider measurement, I'm more familiar with.
>
> I think the race you wanted to express was that the attacker replaces
> the good script, verified with AT_EXECVE_CHECK, with the bad script
> after the IMA verification but before the interpreter reads it.
>
> Fortunately, IMA is able to cope with this situation, since this race
> can happen for any file open, where of course a file can be not read-
> locked.

I assume you mean that this has nothing specifically to do with
scripts, as IMA tries to protect ordinary (non-"execute" file access)
as well.  Am I right?

>
> If the attacker tries to concurrently open the script for write in this
> race window, IMA will report this event (called violation) in the
> measurement list, and during remote attestation it will be clear that
> the interpreter did not read what was measured.
>
> We just need to run the violation check for the BPRM_CHECK hook too
> (then, probably for us the O_DENY_WRITE flag or alternative solution
> would not be needed, for measurement).

This seems consistent with my interpretation above, but ...

>
> Please, let us know when you apply patches like 2a010c412853 ("fs:
> don't block i_writecount during exec"). We had a discussion [1], but
> probably I missed when it was decided to be applied (I saw now it was
> in the same thread, but didn't get that at the time). We would have
> needed to update our code accordingly. In the future, we will try to
> clarify better our expectations from the VFS.

... I didn't follow this.

Suppose there's some valid contents of /bin/sleep.  I execute
/bin/sleep 1m.  While it's running, I modify /bin/sleep (by opening it
for write, not by replacing it), and the kernel in question doesn't do
ETXTBSY.  Then the sleep process reads (and executes) the modified
contents.  Wouldn't a subsequent attestation fail?  Why is ETXTBSY
needed?

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Mike Rapoport @ 2025-09-01 16:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, dmatlack,
	rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl,
	masahiroy, akpm, tj, yoann.congal, mmaurer, roman.gushchin,
	chenridong, axboe, mark.rutland, jannh, vincent.guittot, hannes,
	dan.j.williams, david, joel.granados, rostedt, anna.schumaker,
	song, zhangguopeng, linux, linux-kernel, linux-doc, linux-mm,
	gregkh, tglx, mingo, bp, dave.hansen, x86, hpa, rafael, dakr,
	bartosz.golaszewski, cw00.choi, myungjoo.ham, yesanishhere,
	Jonathan.Cameron, quic_zijuhu, aleksander.lobakin, ira.weiny,
	andriy.shevchenko, leon, lukas, bhelgaas, wagi, djeffery,
	stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> 
> > +	/*
> > +	 * Most of the space should be taken by preserved folios. So take its
> > +	 * size, plus a page for other properties.
> > +	 */
> > +	fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > +	if (!fdt) {
> > +		err = -ENOMEM;
> > +		goto err_unpin;
> > +	}
> 
> This doesn't seem to have any versioning scheme, it really should..
> 
> > +	err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +				       (void **)&preserved_folios);
> > +	if (err) {
> > +		pr_err("Failed to reserve folios property in FDT: %s\n",
> > +		       fdt_strerror(err));
> > +		err = -ENOMEM;
> > +		goto err_free_fdt;
> > +	}
> 
> Yuk.
> 
> This really wants some luo helper
> 
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'

We can just add kho_{preserve,restore}_vmalloc(). I've drafted it here:
https://git.kernel.org/pub/scm/linux/kernel/git/rppt/linux.git/log/?h=kho/vmalloc/v1

Will wait for kbuild and then send proper patches.
 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Andy Lutomirski @ 2025-09-01 16:01 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai,
	Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CAG48ez0p1B9nmG3ZyNRywaSYTtEULSpbxueia912nVpg2Q7WYA@mail.gmail.com>

On Mon, Sep 1, 2025 at 4:06 AM Jann Horn <jannh@google.com> wrote:
>
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
>
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.
>
> > P1:
> > f = open("/tmp/3", "w")
> > f.write('initial contents')
> > f.flush()
> >
> > P2:
> > import mmap
> > f = open("/tmp/3", "r")
> > map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
> >
> > Back to P1:
> > f.seek(0)
> > f.write('changed')
> >
> > Back to P2:
> > map[:]
> >
> > Then P2 gives me:
> >
> > b'initial contents'
>
> Because when you executed `f.write('changed')`, Python internally
> buffered the write. "changed" is never actually written into the file
> in your example. If you add a `f.flush()` in P1 after this, running
> `map[:]` in P2 again will show you the new data.
>

These days, one can type in Python, ask an LLM to translate to C, and
get almost-correct output :)  Or one can use os.write(), which is
exactly what I should have done.

--Andy

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Serge E. Hallyn @ 2025-09-01 13:18 UTC (permalink / raw)
  To: Jann Horn
  Cc: Serge E. Hallyn, Andy Lutomirski, Aleksa Sarai,
	Mickaël Salaün, Christian Brauner, Al Viro, Kees Cook,
	Paul Moore, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CAG48ez0p1B9nmG3ZyNRywaSYTtEULSpbxueia912nVpg2Q7WYA@mail.gmail.com>

On Mon, Sep 01, 2025 at 01:05:16PM +0200, Jann Horn wrote:
> On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > > >
> > > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > > there.
> > > > >
> > > > > It would be interesting to know how to patch user space to get the same
> > > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > > >
> > > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > > nice but IIRC there are ways to get around it anyway).
> > >
> > > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > > affecting the file, but I don't think that writes to the file will
> > > break the MAP_PRIVATE CoW if it's not already broken.
> > >
> > > IPython says:
> > >
> > > In [1]: import mmap, tempfile
> > >
> > > In [2]: f = tempfile.TemporaryFile()
> > >
> > > In [3]: f.write(b'initial contents')
> > > Out[3]: 16
> > >
> > > In [4]: f.flush()
> > >
> > > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > > prot=mmap.PROT_READ)
> > >
> > > In [6]: map[:]
> > > Out[6]: b'initial contents'
> > >
> > > In [7]: f.seek(0)
> > > Out[7]: 0
> > >
> > > In [8]: f.write(b'changed')
> > > Out[8]: 7
> > >
> > > In [9]: f.flush()
> > >
> > > In [10]: map[:]
> > > Out[10]: b'changed contents'
> >
> > That was surprising to me, however, if I split the reader
> > and writer into different processes, so
> 
> Testing this in python is a terrible idea because it obfuscates the
> actual syscalls from you.

Hah, I was just trying to fit in :), but of course you're right.
Redoing it in straight c, I'm getting the updates.

-serge

// mmap-w.c (creates an overwrites)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

#define FIRST "Initial contents"
#define SECOND "updated contents"

int main() {
	int fd, rc;
	char c;

	fd = open("/tmp/m", O_CREAT | O_RDWR, 0644);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}
	rc = write(fd, FIRST, sizeof(FIRST));
	if (rc < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = fsync(fd);
	if (rc < 0) {
		printf("flush failed: %m\n");
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	printf("updating the contents\n");

	rc = lseek(fd, 0, SEEK_SET);
	if (rc < 0) {
		printf("seek failed; %m\n");
		_exit(1);
	}

	rc = write(fd, SECOND, sizeof(SECOND));
	if (fd < 0) {
		printf("write failed: %m\n");
		_exit(1);
	}
	rc = close(fd);
	if (rc < 0) {
		printf("close failed: %m\n");
		_exit(1);
	}
	printf("done\n");
}

// mmap-r.c (checks and re-checks contents)
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <string.h>

#define FIRST "Initial contents"
#define SECOND "Updated contents"

int main() {
	int fd, rc;
	char *m;
	char c;

	fd = open("/tmp/m", O_RDONLY);
	if (fd < 0) {
		printf("failed to open /tmp/m: %m\n");
		_exit(1);
	}

	m = mmap(NULL, 40, PROT_READ, MAP_PRIVATE, fd, 0);
	if (m == MAP_FAILED) {
		printf("mmap failed: %m\n");
		_exit(1);
	}

	if (strncmp(m, FIRST, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	read(STDIN_FILENO, &c, 1);

	if (strncmp(m, SECOND, 7) != 0) {
		printf("m is %c%c%c%c%c%c%c\n",
			m[0], m[1], m[2], m[3], m[4], m[5], m[6]);
		_exit(1);
	}

	printf("done\n");
}

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Jann Horn @ 2025-09-01 11:05 UTC (permalink / raw)
  To: Serge E. Hallyn
  Cc: Andy Lutomirski, Aleksa Sarai, Mickaël Salaün,
	Christian Brauner, Al Viro, Kees Cook, Paul Moore, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jeff Xu, Jonathan Corbet, Jordan R Abrahams,
	Lakshmi Ramasubramanian, Luca Boccassi, Matt Bobrowski,
	Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet, Robert Waite,
	Roberto Sassu, Scott Shell, Steve Dower, Steve Grubb,
	kernel-hardening, linux-api, linux-fsdevel, linux-integrity,
	linux-kernel, linux-security-module
In-Reply-To: <aLDDk4x7QBKxLmoi@mail.hallyn.com>

On Thu, Aug 28, 2025 at 11:01 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> > On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> > >
> > > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > > that this is work that needs to be done in userspace. Change the apps,
> > > > > stop pushing more and more cruft into the VFS that has no business
> > > > > there.
> > > >
> > > > It would be interesting to know how to patch user space to get the same
> > > > guarantees...  Do you think I would propose a kernel patch otherwise?
> > >
> > > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > > nice but IIRC there are ways to get around it anyway).
> >
> > Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> > affecting the file, but I don't think that writes to the file will
> > break the MAP_PRIVATE CoW if it's not already broken.
> >
> > IPython says:
> >
> > In [1]: import mmap, tempfile
> >
> > In [2]: f = tempfile.TemporaryFile()
> >
> > In [3]: f.write(b'initial contents')
> > Out[3]: 16
> >
> > In [4]: f.flush()
> >
> > In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> > prot=mmap.PROT_READ)
> >
> > In [6]: map[:]
> > Out[6]: b'initial contents'
> >
> > In [7]: f.seek(0)
> > Out[7]: 0
> >
> > In [8]: f.write(b'changed')
> > Out[8]: 7
> >
> > In [9]: f.flush()
> >
> > In [10]: map[:]
> > Out[10]: b'changed contents'
>
> That was surprising to me, however, if I split the reader
> and writer into different processes, so

Testing this in python is a terrible idea because it obfuscates the
actual syscalls from you.

> P1:
> f = open("/tmp/3", "w")
> f.write('initial contents')
> f.flush()
>
> P2:
> import mmap
> f = open("/tmp/3", "r")
> map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)
>
> Back to P1:
> f.seek(0)
> f.write('changed')
>
> Back to P2:
> map[:]
>
> Then P2 gives me:
>
> b'initial contents'

Because when you executed `f.write('changed')`, Python internally
buffered the write. "changed" is never actually written into the file
in your example. If you add a `f.flush()` in P1 after this, running
`map[:]` in P2 again will show you the new data.

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Roberto Sassu @ 2025-09-01  9:24 UTC (permalink / raw)
  To: Aleksa Sarai, Mickaël Salaün
  Cc: Christian Brauner, Al Viro, Kees Cook, Paul Moore, Serge Hallyn,
	Andy Lutomirski, Arnd Bergmann, Christian Heimes, Dmitry Vyukov,
	Elliott Hughes, Fan Wu, Florian Weimer, Jann Horn, Jeff Xu,
	Jonathan Corbet, Jordan R Abrahams, Lakshmi Ramasubramanian,
	Luca Boccassi, Matt Bobrowski, Miklos Szeredi, Mimi Zohar,
	Nicolas Bouchinet, Robert Waite, Roberto Sassu, Scott Shell,
	Steve Dower, Steve Grubb, kernel-hardening, linux-api,
	linux-fsdevel, linux-integrity, linux-kernel,
	linux-security-module
In-Reply-To: <2025-08-27-obscene-great-toy-diary-X1gVRV@cyphar.com>

On Thu, 2025-08-28 at 10:14 +1000, Aleksa Sarai wrote:
> On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > Nothing has changed in that regard and I'm not interested in stuffing
> > > the VFS APIs full of special-purpose behavior to work around the fact
> > > that this is work that needs to be done in userspace. Change the apps,
> > > stop pushing more and more cruft into the VFS that has no business
> > > there.
> > 
> > It would be interesting to know how to patch user space to get the same
> > guarantees...  Do you think I would propose a kernel patch otherwise?
> 
> You could mmap the script file with MAP_PRIVATE. This is the *actual*
> protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> nice but IIRC there are ways to get around it anyway). Of course, most
> interpreters don't mmap their scripts, but this is a potential solution.
> If the security policy is based on validating the script text in some
> way, this avoids the TOCTOU.
> 
> Now, in cases where you have IMA or something and you only permit signed
> binaries to execute, you could argue there is a different race here (an
> attacker creates a malicious script, runs it, and then replaces it with
> a valid script's contents and metadata after the fact to get
> AT_EXECVE_CHECK to permit the execution). However, I'm not sure that

Uhm, let's consider measurement, I'm more familiar with.

I think the race you wanted to express was that the attacker replaces
the good script, verified with AT_EXECVE_CHECK, with the bad script
after the IMA verification but before the interpreter reads it.

Fortunately, IMA is able to cope with this situation, since this race
can happen for any file open, where of course a file can be not read-
locked.

If the attacker tries to concurrently open the script for write in this
race window, IMA will report this event (called violation) in the
measurement list, and during remote attestation it will be clear that
the interpreter did not read what was measured.

We just need to run the violation check for the BPRM_CHECK hook too
(then, probably for us the O_DENY_WRITE flag or alternative solution
would not be needed, for measurement).

Please, let us know when you apply patches like 2a010c412853 ("fs:
don't block i_writecount during exec"). We had a discussion [1], but
probably I missed when it was decided to be applied (I saw now it was
in the same thread, but didn't get that at the time). We would have
needed to update our code accordingly. In the future, we will try to
clarify better our expectations from the VFS.

Thanks

Roberto

[1]: https://lore.kernel.org/linux-fsdevel/88d5a92379755413e1ec3c981d9a04e6796da110.camel@huaweicloud.com/#t

> this is even possible with IMA (can an unprivileged user even set
> security.ima?). But even then, I would expect users that really need
> this would also probably use fs-verity or dm-verity that would block
> this kind of attack since it would render the files read-only anyway.
> 
> This is why a more detailed threat model of what kinds of attacks are
> relevant is useful. I was there for the talk you gave and subsequent
> discussion at last year's LPC, but I felt that your threat model was
> not really fleshed out at all. I am still not sure what capabilities you
> expect the attacker to have nor what is being used to authenticate
> binaries (other than AT_EXECVE_CHECK). Maybe I'm wrong with my above
> assumptions, but I can't know without knowing what threat model you have
> in mind, *in detail*.
> 
> For example, if you are dealing with an attacker that has CAP_SYS_ADMIN,
> there are plenty of ways for an attacker to execute their own code
> without using interpreters (create a new tmpfs with fsopen(2) for
> instance). Executable memfds are even easier and don't require
> privileges on most systems (yes, you can block them with vm.memfd_noexec
> but CAP_SYS_ADMIN can disable that -- and there's always fsopen(2) or
> mount(2)).
> 
> (As an aside, it's a shame that AT_EXECVE_CHECK burned one of the
> top-level AT_* bits for a per-syscall flag -- the block comment I added
> in b4fef22c2fb9 ("uapi: explain how per-syscall AT_* flags should be
> allocated") was meant to avoid this happening but it seems you and the
> reviewers missed that...)
> 


^ permalink raw reply

* Re: [PATCH v3 2/2] man2/mount.2: tfix (mountpoint => mount point)
From: Alejandro Colomar @ 2025-08-31  9:16 UTC (permalink / raw)
  To: Askar Safin
  Cc: Aleksa Sarai, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <20250826083227.2611457-3-safinaskar@zohomail.com>

[-- Attachment #1: Type: text/plain, Size: 1001 bytes --]

On Tue, Aug 26, 2025 at 08:32:27AM +0000, Askar Safin wrote:
> Here we fix the only remaining mention of "mountpoint"
> in all man pages
> 
> Signed-off-by: Askar Safin <safinaskar@zohomail.com>

Patch applied.  Thanks!


Cheers,
Alex

> ---
>  man/man2/mount.2 | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/man/man2/mount.2 b/man/man2/mount.2
> index 599c2d6fa..9b11fff51 100644
> --- a/man/man2/mount.2
> +++ b/man/man2/mount.2
> @@ -311,7 +311,7 @@ Since Linux 2.6.16,
>  can be set or cleared on a per-mount-point basis as well as on
>  the underlying filesystem superblock.
>  The mounted filesystem will be writable only if neither the filesystem
> -nor the mountpoint are flagged as read-only.
> +nor the mount point are flagged as read-only.
>  .\"
>  .SS Remounting an existing mount
>  An existing mount may be remounted by specifying
> -- 
> 2.47.2
> 

-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 1/2] man2/mount.2: expand and clarify docs for MS_REMOUNT | MS_BIND
From: Alejandro Colomar @ 2025-08-31  9:15 UTC (permalink / raw)
  To: Aleksa Sarai
  Cc: Askar Safin, Alexander Viro, linux-api, linux-fsdevel,
	David Howells, Christian Brauner, linux-man
In-Reply-To: <2025-08-27.1756287489-unsure-quiet-flakes-gymnast-P41YdV@cyphar.com>

[-- Attachment #1: Type: text/plain, Size: 3379 bytes --]

Hi Aleksa, Askar,

On Wed, Aug 27, 2025 at 07:42:09PM +1000, Aleksa Sarai wrote:
> On 2025-08-26, Askar Safin <safinaskar@zohomail.com> wrote:
> > My edit is based on experiments and reading Linux code
> > 
> > Signed-off-by: Askar Safin <safinaskar@zohomail.com>

Thanks!  I've applied the patch, with some tweaks.
<https://www.alejandro-colomar.es/src/alx/linux/man-pages/man-pages.git/commit/?h=contrib&id=b479b1fe01569d4926cbc59fa31caab8cd01fdad>
(use port 80; this stops AI crawlers.)

> > ---
> >  man/man2/mount.2 | 27 ++++++++++++++++++++++++---
> >  1 file changed, 24 insertions(+), 3 deletions(-)
> > 
> > diff --git a/man/man2/mount.2 b/man/man2/mount.2
> > index 5d83231f9..599c2d6fa 100644
> > --- a/man/man2/mount.2
> > +++ b/man/man2/mount.2
> > @@ -405,7 +405,25 @@ flag can be used with
> >  to modify only the per-mount-point flags.
> >  .\" See https://lwn.net/Articles/281157/
> >  This is particularly useful for setting or clearing the "read-only"
> > -flag on a mount without changing the underlying filesystem.
> > +flag on a mount without changing the underlying filesystem parameters.
> 
> When reading the whole sentence, this feels a bit incomplete
> ("filesystem parameters ... of what?"). Maybe
> 
>   This is particularly useful for setting or clearing the "read-only"
>   flag on a mount without changing the underlying filesystem's
>   filesystem parameters.
> 
> or
> 
>   This is particularly useful for setting or clearing the "read-only"
>   flag on a mount without changing the filesystem parameters of the
>   underlying filesystem.
> 
> would be better?

Yep; I've taken the second proposal.

> 
> That one nit aside, feel free to take my
> 
> Reviewed-by: Aleksa Sarai <cyphar@cyphar.com>

Thanks!  Appended.

> > +The
> > +.I data
> > +argument is ignored if
> > +.B MS_REMOUNT
> > +and
> > +.B MS_BIND
> > +are specified.

I have removed the mention of MS_REMOUNT and MS_BIND, since the first
sentence in the paragraph already mentions them.  Otherwise, it felt a
bit confusing why some sentences mentioned it and others not.

> > +The mount point will
> > +have its existing per-mount-point flags

I have reworded this to use present instead of future, and also reversed
the order of the clauses; if feels more readable now.


Have a lovely day!
Alex

> > +cleared and replaced with those in
> > +.IR mountflags .
> > +This means that
> > +if you wish to preserve
> > +any existing per-mount-point flags,
> > +you need to include them in
> > +.IR mountflags ,
> > +along with the per-mount-point flags you wish to set
> > +(or with the flags you wish to clear missing).
> >  Specifying
> >  .I mountflags
> >  as:
> > @@ -416,8 +434,11 @@ MS_REMOUNT | MS_BIND | MS_RDONLY
> >  .EE
> >  .in
> >  .P
> > -will make access through this mountpoint read-only, without affecting
> > -other mounts.
> > +will make access through this mount point read-only
> > +(clearing all other per-mount-point flags),
> > +without affecting
> > +other mounts
> > +of this filesystem.
> >  .\"
> >  .SS Creating a bind mount
> >  If
> > -- 
> > 2.47.2
> > 
> 
> -- 
> Aleksa Sarai
> Senior Software Engineer (Containers)
> SUSE Linux GmbH
> https://www.cyphar.com/



-- 
<https://www.alejandro-colomar.es>
Use port 80 (that is, <...:80/>).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3 09/30] liveupdate: kho: move to kernel/liveupdate
From: Mike Rapoport @ 2025-08-30  8:35 UTC (permalink / raw)
  To: Pasha Tatashin
  Cc: pratyush, jasonmiu, graf, changyuanl, dmatlack, rientjes, corbet,
	rdunlap, ilpo.jarvinen, kanie, ojeda, aliceryhl, masahiroy, akpm,
	tj, yoann.congal, mmaurer, roman.gushchin, chenridong, axboe,
	mark.rutland, jannh, vincent.guittot, hannes, dan.j.williams,
	david, joel.granados, rostedt, anna.schumaker, song, zhangguopeng,
	linux, linux-kernel, linux-doc, linux-mm, gregkh, tglx, mingo, bp,
	dave.hansen, x86, hpa, rafael, dakr, bartosz.golaszewski,
	cw00.choi, myungjoo.ham, yesanishhere, Jonathan.Cameron,
	quic_zijuhu, aleksander.lobakin, ira.weiny, andriy.shevchenko,
	leon, lukas, bhelgaas, wagi, djeffery, stuart.w.hayes, ptyadav,
	lennart, brauner, linux-api, linux-fsdevel, saeedm, ajayachandra,
	jgg, parav, leonro, witu
In-Reply-To: <20250807014442.3829950-10-pasha.tatashin@soleen.com>

On Thu, Aug 07, 2025 at 01:44:15AM +0000, Pasha Tatashin wrote:
> Move KHO to kernel/liveupdate/ in preparation of placing all Live Update
> core kernel related files to the same place.
> 
> Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com>
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>
>
> ---
> diff --git a/kernel/liveupdate/Makefile b/kernel/liveupdate/Makefile
> new file mode 100644
> index 000000000000..72cf7a8e6739
> --- /dev/null
> +++ b/kernel/liveupdate/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0
> +#
> +# Makefile for the linux kernel.

Nit: this line does not provide much, let's drop it

> +
> +obj-$(CONFIG_KEXEC_HANDOVER)		+= kexec_handover.o
> +obj-$(CONFIG_KEXEC_HANDOVER_DEBUG)	+= kexec_handover_debug.o
> diff --git a/kernel/kexec_handover.c b/kernel/liveupdate/kexec_handover.c
> similarity index 99%
> rename from kernel/kexec_handover.c
> rename to kernel/liveupdate/kexec_handover.c
> index 07755184f44b..05f5694ea057 100644
> --- a/kernel/kexec_handover.c
> +++ b/kernel/liveupdate/kexec_handover.c
> @@ -23,8 +23,8 @@
>   * KHO is tightly coupled with mm init and needs access to some of mm
>   * internal APIs.
>   */
> -#include "../mm/internal.h"
> -#include "kexec_internal.h"
> +#include "../../mm/internal.h"
> +#include "../kexec_internal.h"
>  #include "kexec_handover_internal.h"
>  
>  #define KHO_FDT_COMPATIBLE "kho-v1"
> @@ -824,7 +824,7 @@ static int __kho_finalize(void)
>  	err |= fdt_finish_reservemap(root);
>  	err |= fdt_begin_node(root, "");
>  	err |= fdt_property_string(root, "compatible", KHO_FDT_COMPATIBLE);
> -	/**
> +	/*
>  	 * Reserve the preserved-memory-map property in the root FDT, so
>  	 * that all property definitions will precede subnodes created by
>  	 * KHO callers.
> diff --git a/kernel/kexec_handover_debug.c b/kernel/liveupdate/kexec_handover_debug.c
> similarity index 100%
> rename from kernel/kexec_handover_debug.c
> rename to kernel/liveupdate/kexec_handover_debug.c
> diff --git a/kernel/kexec_handover_internal.h b/kernel/liveupdate/kexec_handover_internal.h
> similarity index 100%
> rename from kernel/kexec_handover_internal.h
> rename to kernel/liveupdate/kexec_handover_internal.h
> -- 
> 2.50.1.565.gc32cd1483b-goog
> 

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-29 19:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pasha Tatashin, pratyush, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250826162019.GD2130239@nvidia.com>

On Tue, Aug 26, 2025 at 9:20 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
>
> > +     /*
> > +      * Most of the space should be taken by preserved folios. So take its
> > +      * size, plus a page for other properties.
> > +      */
> > +     fdt = memfd_luo_create_fdt(PAGE_ALIGN(preserved_size) + PAGE_SIZE);
> > +     if (!fdt) {
> > +             err = -ENOMEM;
> > +             goto err_unpin;
> > +     }
>
> This doesn't seem to have any versioning scheme, it really should..
>
> > +     err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > +                                    (void **)&preserved_folios);
> > +     if (err) {
> > +             pr_err("Failed to reserve folios property in FDT: %s\n",
> > +                    fdt_strerror(err));
> > +             err = -ENOMEM;
> > +             goto err_free_fdt;
> > +     }
>
> Yuk.
>
> This really wants some luo helper
>
> 'luo alloc array'
> 'luo restore array'
> 'luo free array'

Yes, that will be one step forward.

Another idea is that having a middle layer manages the life cycle of
the reserved memory for you. Kind of like a slab allocator for the
preserved memory. It allows bulk free if there is an error on the live
update prepare(), you need to free all previously allocated memory
anyway. If there is some preserved memory that needs to stay after a
long term after the live update kernel boot up, use some special flags
to indicate so don't mix the free_all pool.
>
> Which would get a linearized list of pages in the vmap to hold the
> array and then allocate some structure to record the page list and
> return back the u64 of the phys_addr of the top of the structure to
> store in whatever.
>
> Getting fdt to allocate the array inside the fds is just not going to
> work for anything of size.
>
> > +     for (; i < nr_pfolios; i++) {
> > +             const struct memfd_luo_preserved_folio *pfolio = &pfolios[i];
> > +             phys_addr_t phys;
> > +             u64 index;
> > +             int flags;
> > +
> > +             if (!pfolio->foliodesc)
> > +                     continue;
> > +
> > +             phys = PFN_PHYS(PRESERVED_FOLIO_PFN(pfolio->foliodesc));
> > +             folio = kho_restore_folio(phys);
> > +             if (!folio) {
> > +                     pr_err("Unable to restore folio at physical address: %llx\n",
> > +                            phys);
> > +                     goto put_file;
> > +             }
> > +             index = pfolio->index;
> > +             flags = PRESERVED_FOLIO_FLAGS(pfolio->foliodesc);
> > +
> > +             /* Set up the folio for insertion. */
> > +             /*
> > +              * TODO: Should find a way to unify this and
> > +              * shmem_alloc_and_add_folio().
> > +              */
> > +             __folio_set_locked(folio);
> > +             __folio_set_swapbacked(folio);
> >
> > +             ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
> > +             if (ret) {
> > +                     pr_err("shmem: failed to charge folio index %d: %d\n",
> > +                            i, ret);
> > +                     goto unlock_folio;
> > +             }
>
> [..]
>
> > +             folio_add_lru(folio);
> > +             folio_unlock(folio);
> > +             folio_put(folio);
> > +     }
>
> Probably some consolidation will be needed to make this less
> duplicated..
>
> But overall I think just using the memfd_luo_preserved_folio as the
> serialization is entirely file, I don't think this needs anything more
> complicated.
>
> What it does need is an alternative to the FDT with versioning.
>
> Which seems to me to be entirely fine as:
>
>  struct memfd_luo_v0 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios;
>  };
>
>  struct memfd_luo_v0 memfd_luo_v0 = {.size = size, pos = file->f_pos, folios = folios};
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>
> Which also shows the actual data needing to be serialized comes from
> more than one struct and has to be marshaled in code, somehow, to a
> single struct.
>
> Then I imagine a fairly simple forwards/backwards story. If something
> new is needed that is non-optional, lets say you compress the folios
> list to optimize holes:
>
>  struct memfd_luo_v1 {
>     __aligned_u64 size;
>     __aligned_u64 pos;
>     __aligned_u64 folios_list_with_holes;
>  };
>
> Obviously a v0 kernel cannot parse this, but in this case a v1 aware
> kernel could optionally duplicate and write out the v0 format as well:
>
>  luo_store_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0);
>  luo_store_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1);

Question: Do we have a matching FDT node to match the memfd C
structure hierarchy? Otherwise all the C struct will lump into one FDT
node. Maybe one FDT node for all C struct is fine. Then there is a
risk of overflowing the 4K buffer limit on the FDT node.

I would like to get independent of FDT for the versioning.

FDT on the top level sounds OK. Not ideal but workable. We are getting
deeper and deeper into complex internal data structures. Do we still
want every data structure referenced by a FDT identifier?

> Then the rule is fairly simple, when the sucessor kernel goes to
> deserialize it asks luo for the versions it supports:
>
>  if (luo_restore_object(&memfd_luo_v1, sizeof(memfd_luo_v1), <.. identifier for this fd..>, /*version=*/1))
>     restore_v1(&memfd_luo_v1)
>  else if (luo_restore_object(&memfd_luo_v0, sizeof(memfd_luo_v0), <.. identifier for this fd..>, /*version=*/0))
>     restore_v0(&memfd_luo_v0)
>  else
>     luo_failure("Do not understand this");
>
> luo core just manages this list of versioned data per serialized
> object. There is only one version per object.

Obviously, this can be done.

Is that approach you want to expand to every other C struct as well?
See the above FDT node complexity.

I am getting the feeling that we are hand crafting screws to build an
airplane. Can it be done? Of course. Does it scale well? I am not
sure. There are many developers who are currently hand-crafting this
kind of screws to be used on the different components of the airplane.

We need a machine that can stamp out screws with our specifications,
faster. I want such a machine. Other developers might want one as
well.

The initial discussion of the idea of such a machine is pretty
discouraged. There are huge communication barriers because of the
fixation on hand crafted screws. I understand exploring such machine
ideas alone might distract the engineer from hand crafting more
screws, one of them might realize that, oh, I want such a machine as
well.

At this stage, do you see that exploring such a machine idea can be
beneficial or harmful to the project? If such an idea is considered
harmful, we should stop discussing such an idea at all. Go back to
building more batches of hand crafted screws, which are waiting by the
next critical component.

Also if such a machine can produce screws up to your specification,
but it has a different look and feel than the hand crafted screws. We
can stamp out the screw faster.  Would you consider putting such a
machined screw on your most critical component of the engine?

Best Regards,

Chris

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-29 18:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Jason Gunthorpe, Pasha Tatashin, pratyush, jasonmiu, graf,
	changyuanl, dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen,
	kanie, ojeda, aliceryhl, masahiroy, akpm, tj, yoann.congal,
	mmaurer, roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, ptyadav, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <aLABxkpPcbxyv6m_@kernel.org>

On Thu, Aug 28, 2025 at 12:14 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Tue, Aug 26, 2025 at 01:20:19PM -0300, Jason Gunthorpe wrote:
> > On Thu, Aug 07, 2025 at 01:44:35AM +0000, Pasha Tatashin wrote:
> >
> > > +   err = fdt_property_placeholder(fdt, "folios", preserved_size,
> > > +                                  (void **)&preserved_folios);
> > > +   if (err) {
> > > +           pr_err("Failed to reserve folios property in FDT: %s\n",
> > > +                  fdt_strerror(err));
> > > +           err = -ENOMEM;
> > > +           goto err_free_fdt;
> > > +   }
> >
> > Yuk.
> >
> > This really wants some luo helper
> >
> > 'luo alloc array'
> > 'luo restore array'
> > 'luo free array'
> >
> > Which would get a linearized list of pages in the vmap to hold the
> > array and then allocate some structure to record the page list and
> > return back the u64 of the phys_addr of the top of the structure to
> > store in whatever.
> >
> > Getting fdt to allocate the array inside the fds is just not going to
> > work for anything of size.
>
> I agree that we need a side-car structure for preserving large (potentially
> sparse) arrays, but I think it should be a part of KHO rather than LUO.

I agree this can be used by components outside of LUO as well. Ideally
as some helper library so every component can use it. I don't have a
strong opinion on KHO or the stand alone library. I am fine with both.

Chris

^ permalink raw reply

* Re: [PATCH v3 29/30] luo: allow preserving memfd
From: Chris Li @ 2025-08-28 23:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Pasha Tatashin, jasonmiu, graf, changyuanl, rppt,
	dmatlack, rientjes, corbet, rdunlap, ilpo.jarvinen, kanie, ojeda,
	aliceryhl, masahiroy, akpm, tj, yoann.congal, mmaurer,
	roman.gushchin, chenridong, axboe, mark.rutland, jannh,
	vincent.guittot, hannes, dan.j.williams, david, joel.granados,
	rostedt, anna.schumaker, song, zhangguopeng, linux, linux-kernel,
	linux-doc, linux-mm, gregkh, tglx, mingo, bp, dave.hansen, x86,
	hpa, rafael, dakr, bartosz.golaszewski, cw00.choi, myungjoo.ham,
	yesanishhere, Jonathan.Cameron, quic_zijuhu, aleksander.lobakin,
	ira.weiny, andriy.shevchenko, leon, lukas, bhelgaas, wagi,
	djeffery, stuart.w.hayes, lennart, brauner, linux-api,
	linux-fsdevel, saeedm, ajayachandra, parav, leonro, witu
In-Reply-To: <20250828124320.GB7333@nvidia.com>

On Thu, Aug 28, 2025 at 5:43 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Aug 27, 2025 at 05:03:55PM +0200, Pratyush Yadav wrote:
>
> > I think we need something a luo_xarray data structure that users like
> > memfd (and later hugetlb and guest_memfd and maybe others) can build to
> > make serialization easier. It will cover both contiguous arrays and
> > arrays with some holes in them.
>
> I'm not sure xarray is the right way to go, it is very complex data
> structure and building a kho variation of it seems like it is a huge
> amount of work.
>
> I'd stick with simple kvalloc type approaches until we really run into
> trouble.
>
> You can always map a sparse xarray into a kvalloc linear list by
> including the xarray index in each entry.

Each entry will be 16 byte, 8 for index and 8 for XAvalue, right?

> Especially for memfd where we don't actually expect any sparsity in
> real uses cases there is no reason to invest a huge effort to optimize
> for it..

Ack.

>
> > As I explained above, the versioning is already there. Beyond that, why
> > do you think a raw C struct is better than FDT? It is just another way
> > of expressing the same information. FDT is a bit more cumbersome to
> > write and read, but comes at the benefit of more introspect-ability.
>
> Doesn't have the size limitations, is easier to work list, runs
> faster.

Yes, especially when you have a large array.

Chris

^ permalink raw reply

* Re: [RFC PATCH v1 0/2] Add O_DENY_WRITE (complement AT_EXECVE_CHECK)
From: Serge E. Hallyn @ 2025-08-28 21:01 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Aleksa Sarai, Mickaël Salaün, Christian Brauner,
	Al Viro, Kees Cook, Paul Moore, Serge Hallyn, Arnd Bergmann,
	Christian Heimes, Dmitry Vyukov, Elliott Hughes, Fan Wu,
	Florian Weimer, Jann Horn, Jeff Xu, Jonathan Corbet,
	Jordan R Abrahams, Lakshmi Ramasubramanian, Luca Boccassi,
	Matt Bobrowski, Miklos Szeredi, Mimi Zohar, Nicolas Bouchinet,
	Robert Waite, Roberto Sassu, Scott Shell, Steve Dower,
	Steve Grubb, kernel-hardening, linux-api, linux-fsdevel,
	linux-integrity, linux-kernel, linux-security-module
In-Reply-To: <CALCETrWHKga33bvzUHnd-mRQUeNXTtXSS8Y8+40d5bxv-CqBhw@mail.gmail.com>

On Wed, Aug 27, 2025 at 05:32:02PM -0700, Andy Lutomirski wrote:
> On Wed, Aug 27, 2025 at 5:14 PM Aleksa Sarai <cyphar@cyphar.com> wrote:
> >
> > On 2025-08-26, Mickaël Salaün <mic@digikod.net> wrote:
> > > On Tue, Aug 26, 2025 at 11:07:03AM +0200, Christian Brauner wrote:
> > > > Nothing has changed in that regard and I'm not interested in stuffing
> > > > the VFS APIs full of special-purpose behavior to work around the fact
> > > > that this is work that needs to be done in userspace. Change the apps,
> > > > stop pushing more and more cruft into the VFS that has no business
> > > > there.
> > >
> > > It would be interesting to know how to patch user space to get the same
> > > guarantees...  Do you think I would propose a kernel patch otherwise?
> >
> > You could mmap the script file with MAP_PRIVATE. This is the *actual*
> > protection the kernel uses against overwriting binaries (yes, ETXTBSY is
> > nice but IIRC there are ways to get around it anyway).
> 
> Wait, really?  MAP_PRIVATE prevents writes to the mapping from
> affecting the file, but I don't think that writes to the file will
> break the MAP_PRIVATE CoW if it's not already broken.
> 
> IPython says:
> 
> In [1]: import mmap, tempfile
> 
> In [2]: f = tempfile.TemporaryFile()
> 
> In [3]: f.write(b'initial contents')
> Out[3]: 16
> 
> In [4]: f.flush()
> 
> In [5]: map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE,
> prot=mmap.PROT_READ)
> 
> In [6]: map[:]
> Out[6]: b'initial contents'
> 
> In [7]: f.seek(0)
> Out[7]: 0
> 
> In [8]: f.write(b'changed')
> Out[8]: 7
> 
> In [9]: f.flush()
> 
> In [10]: map[:]
> Out[10]: b'changed contents'

That was surprising to me, however, if I split the reader
and writer into different processes, so

P1:
f = open("/tmp/3", "w")
f.write('initial contents')
f.flush()

P2:
import mmap
f = open("/tmp/3", "r")
map = mmap.mmap(f.fileno(), f.tell(), flags=mmap.MAP_PRIVATE, prot=mmap.PROT_READ)

Back to P1:
f.seek(0)
f.write('changed')

Back to P2:
map[:]

Then P2 gives me:

b'initial contents'

-serge

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox