* Re: [PATCH v3 0/8] module: Move 'struct module_signature' to UAPI
From: Nicolas Schier @ 2026-03-20 20:06 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: David Howells, David Woodhouse, Luis Chamberlain, Petr Pavlu,
Daniel Gomez, Sami Tolvanen, Aaron Tomlin, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Mimi Zohar, Roberto Sassu, Dmitry Kasatkin,
Eric Snowberg, Paul Moore, James Morris, Serge E. Hallyn,
Nathan Chancellor, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
Hao Luo, Jiri Olsa, Shuah Khan, keyrings, linux-kernel,
linux-modules, linux-s390, linux-integrity, linux-security-module,
linux-kbuild, bpf, linux-kselftest
In-Reply-To: <20260305-module-signature-uapi-v3-0-92f45ea6028c@linutronix.de>
[-- Attachment #1: Type: text/plain, Size: 2929 bytes --]
On Thu, Mar 05, 2026 at 10:31:36AM +0100, Thomas Weißschuh wrote:
> This structure definition is used outside the kernel proper.
> For example in kmod and the kernel build environment.
>
> To allow reuse, move it to a new UAPI header.
>
> While it is not a true UAPI, it is a common practice to have
> non-UAPI interface definitions in the kernel's UAPI headers.
>
> This came up as part of my CONFIG_MODULE_HASHES series [0].
> But it is useful on its own and so we get it out of the way.
>
> [0] https://lore.kernel.org/lkml/aZ3OfJJSJgfOb0rJ@levanger/
>
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> ---
> Changes in v3:
> - Also adapt the include path for the custom sign-file rule in the bpf selftests.
> (My manual run of BPF CI still fails, due to an BUG() on s390,
> I don't see how this is due to this patch)
> - Link to v2: https://lore.kernel.org/r/20260305-module-signature-uapi-v2-0-dc4d81129dee@linutronix.de
>
> Changes in v2:
> - Drop spurious definition of MODULE_SIGNATURE_TYPE_MERKLE.
> - s/modules/module/ in two patch subjects.
> - Pick up review tags.
> - Link to v1: https://lore.kernel.org/r/20260302-module-signature-uapi-v1-0-207d955e0d69@linutronix.de
>
> ---
> Thomas Weißschuh (8):
> extract-cert: drop unused definition of PKEY_ID_PKCS7
> module: Drop unused signature types
> module: Give 'enum pkey_id_type' a more specific name
> module: Give MODULE_SIG_STRING a more descriptive name
> module: Move 'struct module_signature' to UAPI
> tools uapi headers: add linux/module_signature.h
> sign-file: use 'struct module_signature' from the UAPI headers
> selftests/bpf: verify_pkcs7_sig: Use 'struct module_signature' from the UAPI headers
>
> arch/s390/kernel/machine_kexec_file.c | 6 ++--
> certs/extract-cert.c | 2 --
> include/linux/module_signature.h | 30 +---------------
> include/uapi/linux/module_signature.h | 41 ++++++++++++++++++++++
> kernel/module/signing.c | 4 +--
> kernel/module_signature.c | 2 +-
> scripts/Makefile | 1 +
> scripts/sign-file.c | 19 +++-------
> security/integrity/ima/ima_modsig.c | 6 ++--
> tools/include/uapi/linux/module_signature.h | 41 ++++++++++++++++++++++
> tools/testing/selftests/bpf/Makefile | 1 +
> .../selftests/bpf/prog_tests/verify_pkcs7_sig.c | 28 ++-------------
> 12 files changed, 101 insertions(+), 80 deletions(-)
> ---
> base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
> change-id: 20260302-module-signature-uapi-61fa80b1e2bb
>
Thanks for these patches!
For the whole series:
Reviewed-by: Nicolas Schier <nsc@kernel.org>
--
Nicolas
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v6 3/9] landlock: Control pathname UNIX domain socket resolution by path
From: Mickaël Salaün @ 2026-03-20 17:51 UTC (permalink / raw)
To: Günther Noack
Cc: John Johansen, Tingmao Wang, Justin Suess,
Sebastian Andrzej Siewior, Kuniyuki Iwashima, Jann Horn,
linux-security-module, Samasth Norway Ananda, Matthieu Buffet,
Mikhail Ivanov, konstantin.meskhidze, Demi Marie Obenour,
Alyssa Ross, Tahera Fahimi
In-Reply-To: <20260320.f59cddcb6c6b@gnoack.org>
On Fri, Mar 20, 2026 at 05:15:40PM +0100, Günther Noack wrote:
> Hello!
>
> On Wed, Mar 18, 2026 at 05:52:48PM +0100, Mickaël Salaün wrote:
> > On Sun, Mar 15, 2026 at 11:21:44PM +0100, Günther Noack wrote:
> > > * Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
> > > controls the look up operations for named UNIX domain sockets. The
> >
> > lookup
>
> Done.
>
>
> > > resolution happens during connect() and sendmsg() (depending on
> > > socket type).
> > > * Hook into the path lookup in unix_find_bsd() in af_unix.c, using a
> > > LSM hook. Make policy decisions based on the new access rights
> > > * Increment the Landlock ABI version.
> > > * Minor test adaptions to keep the tests working.
> >
> > adaptations
>
> Done.
>
>
> > > * Document the design rationale for scoped access rights,
> > > and cross-reference it from the header documentation.
> > >
> > > With this access right, access is granted if either of the following
> > > conditions is met:
> > >
> > > * The target socket's filesystem path was allow-listed using a
> > > LANDLOCK_RULE_PATH_BENEATH rule, *or*:
> > > * The target socket was created in the same Landlock domain in which
> > > LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted.
> > >
> > > In case of a denial, connect() and sendmsg() return EACCES, which is
> > > the same error as it is returned if the user does not have the write
> > > bit in the traditional Unix file system permissions of that file.
> >
> > UNIX
>
> DONE
>
>
> > > Document the (possible future) interaction between scoped flags and
> > > other access rights in struct landlock_ruleset_attr, and summarize the
> > > rationale, as discussed in code review leading up to [2].
> > >
> > > This feature was created with substantial discussion and input from
> > > Justin Suess, Tingmao Wang and Mickaël Salaün.
> > >
> > > Cc: Tingmao Wang <m@maowtm.org>
> > > Cc: Justin Suess <utilityemal77@gmail.com>
> > > Cc: Mickaël Salaün <mic@digikod.net>
> > > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > > Suggested-by: Jann Horn <jannh@google.com>
> > > Link[1]: https://github.com/landlock-lsm/linux/issues/36
> > > Link[2]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
> > > Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> > > ---
> > > Documentation/security/landlock.rst | 40 +++++++
> > > include/uapi/linux/landlock.h | 19 ++++
> > > security/landlock/access.h | 2 +-
> > > security/landlock/audit.c | 1 +
> > > security/landlock/fs.c | 110 ++++++++++++++++++-
> > > security/landlock/limits.h | 2 +-
> > > security/landlock/syscalls.c | 2 +-
> > > tools/testing/selftests/landlock/base_test.c | 2 +-
> > > tools/testing/selftests/landlock/fs_test.c | 5 +-
> > > 9 files changed, 176 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst
> > > index 3e4d4d04cfae..4bbe250a6829 100644
> > > --- a/Documentation/security/landlock.rst
> > > +++ b/Documentation/security/landlock.rst
> > > @@ -89,6 +89,46 @@ this is required to keep access controls consistent over the whole system, and
> > > this avoids unattended bypasses through file descriptor passing (i.e. confused
> > > deputy attack).
> > >
> > > +.. _scoped-flags-interaction:
> > > +
> > > +Interaction between scoped flags and other access rights
> > > +--------------------------------------------------------
> > > +
> > > +The ``scoped`` flags in ``struct landlock_ruleset_attr`` restrict the
> > > +use of *outgoing* IPC from the created Landlock domain, while they
> > > +permit reaching out to IPC endpoints *within* the created Landlock
> > > +domain.
> > > +
> > > +In the future, scoped flags *may* interact with other access rights,
> > > +e.g. so that abstract UNIX sockets can be allow-listed by name, or so
> > > +that signals can be allow-listed by signal number or target process.
> > > +
> > > +When introducing ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``, we defined it to
> > > +implicitly have the same scoping semantics as a
> > > +``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` flag would have: connecting to
> > > +UNIX sockets within the same domain (where
> > > +``LANDLOCK_ACCESS_FS_RESOLVE_UNIX`` is used) is unconditionally
> > > +allowed.
> > > +
> > > +The reasoning is:
> > > +
> > > +* Like other IPC mechanisms, connecting to named UNIX sockets in the
> > > + same domain should be expected and harmless. (If needed, users can
> > > + further refine their Landlock policies with nested domains or by
> > > + restricting ``LANDLOCK_ACCESS_FS_MAKE_SOCK``.)
> > > +* We reserve the option to still introduce
> > > + ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the future. (This would
> > > + be useful if we wanted to have a Landlock rule to permit IPC access
> > > + to other Landlock domains.)
> > > +* But we can postpone the point in time when users have to deal with
> > > + two interacting flags visible in the userspace API. (In particular,
> > > + it is possible that it won't be needed in practice, in which case we
> > > + can avoid the second flag altogether.)
> > > +* If we *do* introduce ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the
> > > + future, setting this scoped flag in a ruleset does *not reduce* the
> > > + restrictions, because access within the same scope is already
> > > + allowed based on ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``.
> > > +
> > > Tests
> > > =====
> > >
> > > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > > index f88fa1f68b77..751e3c143cba 100644
> > > --- a/include/uapi/linux/landlock.h
> > > +++ b/include/uapi/linux/landlock.h
> > > @@ -248,6 +248,24 @@ struct landlock_net_port_attr {
> > > *
> > > * This access right is available since the fifth version of the Landlock
> > > * ABI.
> > > + * - %LANDLOCK_ACCESS_FS_RESOLVE_UNIX: Look up pathname UNIX domain sockets
> > > + * (:manpage:`unix(7)`). On UNIX domain sockets, this restricts both calls to
> > > + * :manpage:`connect(2)` as well as calls to :manpage:`sendmsg(2)` with an
> > > + * explicit recipient address.
> > > + *
> > > + * This access right only applies to connections to UNIX server sockets which
> > > + * were created outside of the newly created Landlock domain (e.g. from within
> > > + * a parent domain or from an unrestricted process). Newly created UNIX
> > > + * servers within the same Landlock domain continue to be accessible. In this
> > > + * regard, %LANDLOCK_ACCESS_RESOLVE_UNIX has the same semantics as the
> >
> > LANDLOCK_ACCESS_FS_RESOLVE_UNIX
>
> Whoops, done.
>
>
> > > + * ``LANDLOCK_SCOPE_*`` flags.
> > > + *
> > > + * If a resolve attempt is denied, the operation returns an ``EACCES`` error,
> > > + * in line with other filesystem access rights (but different to denials for
> > > + * abstract UNIX domain sockets).
> >
> > This access right is available since the ninth version of the Landlock ABI.
>
> Thanks, added.
>
>
> > > + *
> > > + * The rationale for this design is described in
> > > + * :ref:`Documentation/security/landlock.rst <scoped-flags-interaction>`.
> > > *
> > > * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> > > * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> > > @@ -333,6 +351,7 @@ struct landlock_net_port_attr {
> > > #define LANDLOCK_ACCESS_FS_REFER (1ULL << 13)
> > > #define LANDLOCK_ACCESS_FS_TRUNCATE (1ULL << 14)
> > > #define LANDLOCK_ACCESS_FS_IOCTL_DEV (1ULL << 15)
> > > +#define LANDLOCK_ACCESS_FS_RESOLVE_UNIX (1ULL << 16)
> > > /* clang-format on */
> > >
> > > /**
> > > diff --git a/security/landlock/access.h b/security/landlock/access.h
> > > index 42c95747d7bd..89dc8e7b93da 100644
> > > --- a/security/landlock/access.h
> > > +++ b/security/landlock/access.h
> > > @@ -34,7 +34,7 @@
> > > LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > > /* clang-format on */
> > >
> > > -typedef u16 access_mask_t;
> > > +typedef u32 access_mask_t;
> >
> > This change and the underlying implications are not explained in the
> > commit message, especially regarding the stack delta.
>
> Thanks, will add it.
>
>
> > > /* Makes sure all filesystem access rights can be stored. */
> > > static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS);
> > > diff --git a/security/landlock/audit.c b/security/landlock/audit.c
> > > index 60ff217ab95b..8d0edf94037d 100644
> > > --- a/security/landlock/audit.c
> > > +++ b/security/landlock/audit.c
> > > @@ -37,6 +37,7 @@ static const char *const fs_access_strings[] = {
> > > [BIT_INDEX(LANDLOCK_ACCESS_FS_REFER)] = "fs.refer",
> > > [BIT_INDEX(LANDLOCK_ACCESS_FS_TRUNCATE)] = "fs.truncate",
> > > [BIT_INDEX(LANDLOCK_ACCESS_FS_IOCTL_DEV)] = "fs.ioctl_dev",
> > > + [BIT_INDEX(LANDLOCK_ACCESS_FS_RESOLVE_UNIX)] = "fs.resolve_unix",
> > > };
> > >
> > > static_assert(ARRAY_SIZE(fs_access_strings) == LANDLOCK_NUM_ACCESS_FS);
> > > diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> > > index 97065d51685a..0486f5ab06c9 100644
> > > --- a/security/landlock/fs.c
> > > +++ b/security/landlock/fs.c
> > > @@ -27,6 +27,7 @@
> > > #include <linux/lsm_hooks.h>
> > > #include <linux/mount.h>
> > > #include <linux/namei.h>
> > > +#include <linux/net.h>
> > > #include <linux/path.h>
> > > #include <linux/pid.h>
> > > #include <linux/rcupdate.h>
> > > @@ -36,6 +37,7 @@
> > > #include <linux/types.h>
> > > #include <linux/wait_bit.h>
> > > #include <linux/workqueue.h>
> > > +#include <net/af_unix.h>
> > > #include <uapi/linux/fiemap.h>
> > > #include <uapi/linux/landlock.h>
> > >
> > > @@ -314,7 +316,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> > > LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > > LANDLOCK_ACCESS_FS_READ_FILE | \
> > > LANDLOCK_ACCESS_FS_TRUNCATE | \
> > > - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > > + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> > > + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
> > > /* clang-format on */
> > >
> > > /*
> > > @@ -1557,6 +1560,110 @@ static int hook_path_truncate(const struct path *const path)
> > > return current_check_access_path(path, LANDLOCK_ACCESS_FS_TRUNCATE);
> > > }
> > >
> > > +/**
> > > + * unmask_scoped_access - Remove access right bits in @masks in all layers
> > > + * where @client and @server have the same domain
> > > + *
> > > + * This does the same as domain_is_scoped(), but unmasks bits in @masks.
> > > + * It can not return early as domain_is_scoped() does.
> >
> > I'd like a summary of your previous excellent explanation of
> > unmask_scoped_access() in this comment.
>
> Adding:
>
> A scoped access for a given access right bit is allowed iff, for all
> layer depths where the access bit is set, the client and server
> domain are the same. This function clears the access rights @access
> in @masks at all layer depths where the client and server domain are
> the same, so that, when they are all cleared, the access is allowed.
>
> It's not as detailed as drawing a picture in the other mail, but I
> hope it helps.
Good
>
>
> > > + * @client: Client domain
> > > + * @server: Server domain
> > > + * @masks: Layer access masks to unmask
> > > + * @access: Access bit that controls scoping
> > > + */
> > > +static void unmask_scoped_access(const struct landlock_ruleset *const client,
> > > + const struct landlock_ruleset *const server,
> > > + struct layer_access_masks *const masks,
> > > + const access_mask_t access)
> > > +{
> > > + int client_layer, server_layer;
> > > + const struct landlock_hierarchy *client_walker, *server_walker;
> > > +
> > > + /* This should not happen. */
> > > + if (WARN_ON_ONCE(!client))
> > > + return;
> > > +
> > > + /* Server has no Landlock domain; nothing to clear. */
> > > + if (!server)
> > > + return;
> > > +
> >
> > Please also copy the BUILD_BUG_ON() from domain_is_scoped().
>
> I don't understand what this check is good for. It says:
>
> /*
> * client_layer must be a signed integer with greater capacity
> * than client->num_layers to ensure the following loop stops.
> */
> BUILD_BUG_ON(sizeof(client_layer) > sizeof(client->num_layers));
>
> For the loop to terminate, in my understanding, client_layer must be
> able to store client->num_layers - 1 down to - 1, but that is anyway a
> given since num_layers can't exceed 16 and client_layer is signed. It
> seems that the termination of this would anyway be caught in our tests
> as well?
>
> Could you please clarify what this BUILD_BUG_ON() is trying to assert?
The intent was to make sure client_layer is indeed an int and not an
unsigned int for instance. Hopefully tests catch that but using a
build-time assert catch the potential issue and document it. Also, it
would be weird to not have the same checks in both copies of code.
>
>
> > > + client_layer = client->num_layers - 1;
> > > + client_walker = client->hierarchy;
> > > + server_layer = server->num_layers - 1;
> > > + server_walker = server->hierarchy;
> > > +
> > > + /*
> > > + * Clears the access bits at all layers where the client domain is the
> > > + * same as the server domain. We start the walk at min(client_layer,
> > > + * server_layer). The layer bits until there can not be cleared because
> > > + * either the client or the server domain is missing.
> > > + */
> > > + for (; client_layer > server_layer; client_layer--)
> > > + client_walker = client_walker->parent;
> > > +
> > > + for (; server_layer > client_layer; server_layer--)
> > > + server_walker = server_walker->parent;
> > > +
> > > + for (; client_layer >= 0; client_layer--) {
> > > + if (masks->access[client_layer] & access &&
> > > + client_walker == server_walker)
> > > + masks->access[client_layer] &= ~access;
> > > +
> > > + client_walker = client_walker->parent;
> > > + server_walker = server_walker->parent;
> > > + }
> > > +}
> > > +
> > > +static int hook_unix_find(const struct path *const path, struct sock *other,
> > > + int flags)
> > > +{
> > > + const struct landlock_ruleset *dom_other;
> > > + const struct landlock_cred_security *subject;
> > > + struct layer_access_masks layer_masks;
> > > + struct landlock_request request = {};
> > > + static const struct access_masks fs_resolve_unix = {
> > > + .fs = LANDLOCK_ACCESS_FS_RESOLVE_UNIX,
> > > + };
> > > +
> > > + /* Lookup for the purpose of saving coredumps is OK. */
> > > + if (unlikely(flags & SOCK_COREDUMP))
> > > + return 0;
> > > +
> > > + /* Access to the same (or a lower) domain is always allowed. */
> >
> > This comment is related to the unmask_scoped_access() call.
>
> Thanks, I moved it down.
>
>
> > > + subject = landlock_get_applicable_subject(current_cred(),
> > > + fs_resolve_unix, NULL);
> > > +
> > > + if (!subject)
> > > + return 0;
> > > +
> > > + if (!landlock_init_layer_masks(subject->domain, fs_resolve_unix.fs,
> > > + &layer_masks, LANDLOCK_KEY_INODE))
> >
> > This case is not possible because landlock_get_applicable_subject()
> > already check it. Other hooks just ignore the returned value in this
> > case.
>
> Hm, fair enough. I added a comment to explain why we are ignoring the
> return value, as it wasn't as obvious to me. In the other places, we
> are using the result of the landlock_init_layer_masks() function
> (because in the generic case, it can be a subset of the original
> access rights).
Another way to document it would be to use a WARN_ON_ONCE(), but that
would not be very useful in this case and add code which cannot be
tested/covered.
>
>
> > > + return 0;
> > > +
> > > + /* Checks the layers in which we are connecting within the same domain. */
> > > + unix_state_lock(other);
> > > + if (unlikely(sock_flag(other, SOCK_DEAD) || !other->sk_socket ||
> > > + !other->sk_socket->file)) {
> > > + unix_state_unlock(other);
> > > + return 0;
> >
> > Is it safe to not return -ECONNREFUSED?
>
> Yes. My reasoning is:
>
> In all three places where this gets called in af_unix.c (stream
> connect, dgram connect, dgram send), these functions check for socket
> death shortly after, and if they find the socket to be SOCK_DEAD, they
> will *retry* the UNIX lookup. The code commentary about this says
> that this is for a race condition where the VFS has "overslept" the
> socket death, so I presume that the retry aims at getting a race-free
> sitation on the next attempt.
>
> Since sock_orphan() is a one-way teardown operation, when we observe
> SOCK_DEAD in our hook, we can be sure that the caller will see it as
> well when it does the same check a bit later after our hook.
>
> If we *were* to return -ECONNREFUSED, the caller would immediately
> return an error though, and it would not retry as it normally does
> when it encounters this race condition. So we have to return 0 here.
OK, sound good.
>
>
> > > + }
> > > + dom_other = landlock_cred(other->sk_socket->file->f_cred)->domain;
> > > + unix_state_unlock(other);
> > > +
> > > + unmask_scoped_access(subject->domain, dom_other, &layer_masks,
> > > + fs_resolve_unix.fs);
> >
> > dom_other is not safe to use without the lock.
>
> Thanks, fixed by extending the lock scope across that function call,
> as discussed in other thread in more detail.
>
>
> > > + /* Checks the connections to allow-listed paths. */
> > > + if (is_access_to_paths_allowed(subject->domain, path,
> > > + fs_resolve_unix.fs, &layer_masks,
> > > + &request, NULL, 0, NULL, NULL, NULL))
> > > + return 0;
> > > +
> > > + landlock_log_denial(subject, &request);
> > > + return -EACCES;
> > > +}
> > > +
> > > /* File hooks */
> > >
> > > /**
> > > @@ -1834,6 +1941,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
> > > LSM_HOOK_INIT(path_unlink, hook_path_unlink),
> > > LSM_HOOK_INIT(path_rmdir, hook_path_rmdir),
> > > LSM_HOOK_INIT(path_truncate, hook_path_truncate),
> > > + LSM_HOOK_INIT(unix_find, hook_unix_find),
> > >
> > > LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
> > > LSM_HOOK_INIT(file_open, hook_file_open),
> > > diff --git a/security/landlock/limits.h b/security/landlock/limits.h
> > > index eb584f47288d..b454ad73b15e 100644
> > > --- a/security/landlock/limits.h
> > > +++ b/security/landlock/limits.h
> > > @@ -19,7 +19,7 @@
> > > #define LANDLOCK_MAX_NUM_LAYERS 16
> > > #define LANDLOCK_MAX_NUM_RULES U32_MAX
> > >
> > > -#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_IOCTL_DEV
> > > +#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_RESOLVE_UNIX
> > > #define LANDLOCK_MASK_ACCESS_FS ((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
> > > #define LANDLOCK_NUM_ACCESS_FS __const_hweight64(LANDLOCK_MASK_ACCESS_FS)
> > >
> > > diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> > > index 3b33839b80c7..a6e23657f3ce 100644
> > > --- a/security/landlock/syscalls.c
> > > +++ b/security/landlock/syscalls.c
> > > @@ -166,7 +166,7 @@ static const struct file_operations ruleset_fops = {
> > > * If the change involves a fix that requires userspace awareness, also update
> > > * the errata documentation in Documentation/userspace-api/landlock.rst .
> > > */
> > > -const int landlock_abi_version = 8;
> > > +const int landlock_abi_version = 9;
> > >
> > > /**
> > > * sys_landlock_create_ruleset - Create a new ruleset
> > > diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
> > > index 0fea236ef4bd..30d37234086c 100644
> > > --- a/tools/testing/selftests/landlock/base_test.c
> > > +++ b/tools/testing/selftests/landlock/base_test.c
> > > @@ -76,7 +76,7 @@ TEST(abi_version)
> > > const struct landlock_ruleset_attr ruleset_attr = {
> > > .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
> > > };
> > > - ASSERT_EQ(8, landlock_create_ruleset(NULL, 0,
> > > + ASSERT_EQ(9, landlock_create_ruleset(NULL, 0,
> > > LANDLOCK_CREATE_RULESET_VERSION));
> > >
> > > ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
> > > diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
> > > index 968a91c927a4..b318627e7561 100644
> > > --- a/tools/testing/selftests/landlock/fs_test.c
> > > +++ b/tools/testing/selftests/landlock/fs_test.c
> > > @@ -575,9 +575,10 @@ TEST_F_FORK(layout1, inval)
> > > LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > > LANDLOCK_ACCESS_FS_READ_FILE | \
> > > LANDLOCK_ACCESS_FS_TRUNCATE | \
> > > - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > > + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> > > + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
> > >
> > > -#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
> > > +#define ACCESS_LAST LANDLOCK_ACCESS_FS_RESOLVE_UNIX
> > >
> > > #define ACCESS_ALL ( \
> > > ACCESS_FILE | \
> > > --
> > > 2.53.0
> > >
> > >
>
> –Günther
>
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: Roberto Sassu @ 2026-03-20 17:26 UTC (permalink / raw)
To: steven chen, Mimi Zohar, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu
In-Reply-To: <8f66014c-d7c8-4a33-be7b-cfd945af4a3a@linux.microsoft.com>
On Fri, 2026-03-20 at 10:24 -0700, steven chen wrote:
> On 3/20/2026 10:10 AM, Roberto Sassu wrote:
> > On Fri, 2026-03-20 at 09:58 -0700, steven chen wrote:
> > > On 3/20/2026 5:41 AM, Mimi Zohar wrote:
> > > > On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
> > > >
> > > > > > - Support for deleting N measurement records (and pre-pending the remaining
> > > > > > measurement records)
> > > > > Is there any problem to bring work of "stage" step together to the
> > > > > deletion step?
> > > > >
> > > > > "Trim N" method does everything that "staged" method can do, right?
> > > > > what's the "stage" method can do but "trim N" method can't do?
> > > > >
> > > > > in user space, if in "staged" state, no other user space agent can
> > > > > access the IMA measure list, right?
> > > > >
> > > > > Could you explain the benefit of bringing the "stage" step?
> > > > The performance improvement is because "staging" the IMA measurement list takes
> > > > the lock in order to move the measurement list pointer and then releases it.
> > > > New measurements can then be appended to a new measurement list. Deleting
> > > > records is done without taking the lock to walk the staged measurement list.
> > > >
> > > > Without staging the measurement list, walking the measurement list to trim N
> > > > records requires taking and holding the lock. The performance is dependent on
> > > > the size of the measurement list.
> > > >
> > > > Your question isn't really about "staging" the measurement list records, but
> > > > requiring a userspace signal to delete them. To answer that question, deleting
> > > > N records (third patch) could imply staging all the measurement records and
> > > > immediately deleting N records without an explicit userspace signal.
> > > >
> > > > I expect the requested "documentation" patch will provide the motivation for the
> > > > delayed deletion of the measurement list.
> > > >
> > > > Mimi
> > > "Staging" is great on reducing kernel IMA measurement list locking time.
> > >
> > > How about just do "stage N" entries and then delete the staged list in
> > > one shot?
> > > It means merge two APIs into one API
> > > int ima_queue_stage(void)
> > > int ima_queue_delete_staged(unsigned long req_value)
> > >
> > > The kernel lock time will be the same. And user space lock time will be
> > > reduced.
> > It is not the same. The walk on the staged list is done without holding
> > ima_extend_list_mutex.
> >
> > Roberto
>
> Is it possible to merge two APIs work into one API?
> int ima_queue_stage(void)
> int ima_queue_delete_staged(unsigned long req_value)
It will be done transparently for the user. IMA will call both
functions for the same securityfs write.
Roberto
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: steven chen @ 2026-03-20 17:40 UTC (permalink / raw)
To: Roberto Sassu, Mimi Zohar, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu
In-Reply-To: <a523c0cf05e10838bf08e4d2e9a05df402f4c9b0.camel@huaweicloud.com>
On 3/20/2026 10:26 AM, Roberto Sassu wrote:
> On Fri, 2026-03-20 at 10:24 -0700, steven chen wrote:
>> On 3/20/2026 10:10 AM, Roberto Sassu wrote:
>>> On Fri, 2026-03-20 at 09:58 -0700, steven chen wrote:
>>>> On 3/20/2026 5:41 AM, Mimi Zohar wrote:
>>>>> On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
>>>>>
>>>>>>> - Support for deleting N measurement records (and pre-pending the remaining
>>>>>>> measurement records)
>>>>>> Is there any problem to bring work of "stage" step together to the
>>>>>> deletion step?
>>>>>>
>>>>>> "Trim N" method does everything that "staged" method can do, right?
>>>>>> what's the "stage" method can do but "trim N" method can't do?
>>>>>>
>>>>>> in user space, if in "staged" state, no other user space agent can
>>>>>> access the IMA measure list, right?
>>>>>>
>>>>>> Could you explain the benefit of bringing the "stage" step?
>>>>> The performance improvement is because "staging" the IMA measurement list takes
>>>>> the lock in order to move the measurement list pointer and then releases it.
>>>>> New measurements can then be appended to a new measurement list. Deleting
>>>>> records is done without taking the lock to walk the staged measurement list.
>>>>>
>>>>> Without staging the measurement list, walking the measurement list to trim N
>>>>> records requires taking and holding the lock. The performance is dependent on
>>>>> the size of the measurement list.
>>>>>
>>>>> Your question isn't really about "staging" the measurement list records, but
>>>>> requiring a userspace signal to delete them. To answer that question, deleting
>>>>> N records (third patch) could imply staging all the measurement records and
>>>>> immediately deleting N records without an explicit userspace signal.
>>>>>
>>>>> I expect the requested "documentation" patch will provide the motivation for the
>>>>> delayed deletion of the measurement list.
>>>>>
>>>>> Mimi
>>>> "Staging" is great on reducing kernel IMA measurement list locking time.
>>>>
>>>> How about just do "stage N" entries and then delete the staged list in
>>>> one shot?
>>>> It means merge two APIs into one API
>>>> int ima_queue_stage(void)
>>>> int ima_queue_delete_staged(unsigned long req_value)
>>>>
>>>> The kernel lock time will be the same. And user space lock time will be
>>>> reduced.
>>> It is not the same. The walk on the staged list is done without holding
>>> ima_extend_list_mutex.
>>>
>>> Roberto
>> Is it possible to merge two APIs work into one API?
>> int ima_queue_stage(void)
>> int ima_queue_delete_staged(unsigned long req_value)
> It will be done transparently for the user. IMA will call both
> functions for the same securityfs write.
>
> Roberto
If merge two APIs into one API, it will reduce user space measurement
list lock time.
Thanks,
Steven
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: Roberto Sassu @ 2026-03-20 17:10 UTC (permalink / raw)
To: steven chen, Mimi Zohar, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu
In-Reply-To: <c9258708-2db2-4c08-998f-e67a681781da@linux.microsoft.com>
On Fri, 2026-03-20 at 09:58 -0700, steven chen wrote:
> On 3/20/2026 5:41 AM, Mimi Zohar wrote:
> > On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
> >
> > > > - Support for deleting N measurement records (and pre-pending the remaining
> > > > measurement records)
> > > Is there any problem to bring work of "stage" step together to the
> > > deletion step?
> > >
> > > "Trim N" method does everything that "staged" method can do, right?
> > > what's the "stage" method can do but "trim N" method can't do?
> > >
> > > in user space, if in "staged" state, no other user space agent can
> > > access the IMA measure list, right?
> > >
> > > Could you explain the benefit of bringing the "stage" step?
> > The performance improvement is because "staging" the IMA measurement list takes
> > the lock in order to move the measurement list pointer and then releases it.
> > New measurements can then be appended to a new measurement list. Deleting
> > records is done without taking the lock to walk the staged measurement list.
> >
> > Without staging the measurement list, walking the measurement list to trim N
> > records requires taking and holding the lock. The performance is dependent on
> > the size of the measurement list.
> >
> > Your question isn't really about "staging" the measurement list records, but
> > requiring a userspace signal to delete them. To answer that question, deleting
> > N records (third patch) could imply staging all the measurement records and
> > immediately deleting N records without an explicit userspace signal.
> >
> > I expect the requested "documentation" patch will provide the motivation for the
> > delayed deletion of the measurement list.
> >
> > Mimi
>
> "Staging" is great on reducing kernel IMA measurement list locking time.
>
> How about just do "stage N" entries and then delete the staged list in
> one shot?
> It means merge two APIs into one API
> int ima_queue_stage(void)
> int ima_queue_delete_staged(unsigned long req_value)
>
> The kernel lock time will be the same. And user space lock time will be
> reduced.
It is not the same. The walk on the staged list is done without holding
ima_extend_list_mutex.
Roberto
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: steven chen @ 2026-03-20 17:24 UTC (permalink / raw)
To: Roberto Sassu, Mimi Zohar, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <332fc1447c03893988620189a40501cccaa8b4c5.camel@huaweicloud.com>
On 3/20/2026 10:10 AM, Roberto Sassu wrote:
> On Fri, 2026-03-20 at 09:58 -0700, steven chen wrote:
>> On 3/20/2026 5:41 AM, Mimi Zohar wrote:
>>> On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
>>>
>>>>> - Support for deleting N measurement records (and pre-pending the remaining
>>>>> measurement records)
>>>> Is there any problem to bring work of "stage" step together to the
>>>> deletion step?
>>>>
>>>> "Trim N" method does everything that "staged" method can do, right?
>>>> what's the "stage" method can do but "trim N" method can't do?
>>>>
>>>> in user space, if in "staged" state, no other user space agent can
>>>> access the IMA measure list, right?
>>>>
>>>> Could you explain the benefit of bringing the "stage" step?
>>> The performance improvement is because "staging" the IMA measurement list takes
>>> the lock in order to move the measurement list pointer and then releases it.
>>> New measurements can then be appended to a new measurement list. Deleting
>>> records is done without taking the lock to walk the staged measurement list.
>>>
>>> Without staging the measurement list, walking the measurement list to trim N
>>> records requires taking and holding the lock. The performance is dependent on
>>> the size of the measurement list.
>>>
>>> Your question isn't really about "staging" the measurement list records, but
>>> requiring a userspace signal to delete them. To answer that question, deleting
>>> N records (third patch) could imply staging all the measurement records and
>>> immediately deleting N records without an explicit userspace signal.
>>>
>>> I expect the requested "documentation" patch will provide the motivation for the
>>> delayed deletion of the measurement list.
>>>
>>> Mimi
>> "Staging" is great on reducing kernel IMA measurement list locking time.
>>
>> How about just do "stage N" entries and then delete the staged list in
>> one shot?
>> It means merge two APIs into one API
>> int ima_queue_stage(void)
>> int ima_queue_delete_staged(unsigned long req_value)
>>
>> The kernel lock time will be the same. And user space lock time will be
>> reduced.
> It is not the same. The walk on the staged list is done without holding
> ima_extend_list_mutex.
>
> Roberto
Is it possible to merge two APIs work into one API?
int ima_queue_stage(void)
int ima_queue_delete_staged(unsigned long req_value)
Thank,
Steven
^ permalink raw reply
* Re: [PATCH v6 9/9] landlock: Document FS access right for pathname UNIX sockets
From: Günther Noack @ 2026-03-20 17:04 UTC (permalink / raw)
To: Mickaël Salaün
Cc: John Johansen, Justin Suess, linux-security-module, Tingmao Wang,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross, Jann Horn,
Tahera Fahimi, Sebastian Andrzej Siewior, Kuniyuki Iwashima
In-Reply-To: <20260318.geom7Taxahpi@digikod.net>
On Wed, Mar 18, 2026 at 05:54:19PM +0100, Mickaël Salaün wrote:
> Please always add some minimal description.
Done.
> Also, as already requested, could you run the check-linux.sh all on each
> patch? That would avoid me to fix things like the date (which would now
> be OK because of the new patch in my next branch, but still).
Will do.
> On Sun, Mar 15, 2026 at 11:21:50PM +0100, Günther Noack wrote:
> > Cc: Justin Suess <utilityemal77@gmail.com>
> > Cc: Mickaël Salaün <mic@digikod.net>
> > Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> > ---
> > Documentation/userspace-api/landlock.rst | 15 ++++++++++++++-
> > 1 file changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
> > index 13134bccdd39..e60ebd07c5cc 100644
> > --- a/Documentation/userspace-api/landlock.rst
> > +++ b/Documentation/userspace-api/landlock.rst
> > @@ -77,7 +77,8 @@ to be explicit about the denied-by-default access rights.
> > LANDLOCK_ACCESS_FS_MAKE_SYM |
> > LANDLOCK_ACCESS_FS_REFER |
> > LANDLOCK_ACCESS_FS_TRUNCATE |
> > - LANDLOCK_ACCESS_FS_IOCTL_DEV,
> > + LANDLOCK_ACCESS_FS_IOCTL_DEV |
> > + LANDLOCK_ACCESS_FS_RESOLVE_UNIX,
> > .handled_access_net =
> > LANDLOCK_ACCESS_NET_BIND_TCP |
> > LANDLOCK_ACCESS_NET_CONNECT_TCP,
> > @@ -127,6 +128,11 @@ version, and only use the available subset of access rights:
> > /* Removes LANDLOCK_SCOPE_* for ABI < 6 */
> > ruleset_attr.scoped &= ~(LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET |
> > LANDLOCK_SCOPE_SIGNAL);
> > + __attribute__((fallthrough));
>
> Case 6 should be handled too:
>
> case 6 ... 8:
Thank you, good catch!
>
> > + case 7:
> > + case 8:
> > + /* Removes LANDLOCK_ACCESS_FS_RESOLVE_UNIX for ABI < 9 */
> > + ruleset_attr.handled_access_fs &= ~LANDLOCK_ACCESS_FS_RESOLVE_UNIX;
> > }
> >
-Günther
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: steven chen @ 2026-03-20 16:58 UTC (permalink / raw)
To: Mimi Zohar, Roberto Sassu, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <ffe1d4645a66a690892163be8e16c4b5d24a690d.camel@linux.ibm.com>
On 3/20/2026 5:41 AM, Mimi Zohar wrote:
> On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
>
>>> - Support for deleting N measurement records (and pre-pending the remaining
>>> measurement records)
>> Is there any problem to bring work of "stage" step together to the
>> deletion step?
>>
>> "Trim N" method does everything that "staged" method can do, right?
>> what's the "stage" method can do but "trim N" method can't do?
>>
>> in user space, if in "staged" state, no other user space agent can
>> access the IMA measure list, right?
>>
>> Could you explain the benefit of bringing the "stage" step?
> The performance improvement is because "staging" the IMA measurement list takes
> the lock in order to move the measurement list pointer and then releases it.
> New measurements can then be appended to a new measurement list. Deleting
> records is done without taking the lock to walk the staged measurement list.
>
> Without staging the measurement list, walking the measurement list to trim N
> records requires taking and holding the lock. The performance is dependent on
> the size of the measurement list.
>
> Your question isn't really about "staging" the measurement list records, but
> requiring a userspace signal to delete them. To answer that question, deleting
> N records (third patch) could imply staging all the measurement records and
> immediately deleting N records without an explicit userspace signal.
>
> I expect the requested "documentation" patch will provide the motivation for the
> delayed deletion of the measurement list.
>
> Mimi
"Staging" is great on reducing kernel IMA measurement list locking time.
How about just do "stage N" entries and then delete the staged list in
one shot?
It means merge two APIs into one API
int ima_queue_stage(void)
int ima_queue_delete_staged(unsigned long req_value)
The kernel lock time will be the same. And user space lock time will be
reduced.
Thanks,
Steven
>
>
>
>
>
^ permalink raw reply
* Re: [PATCH v6 7/9] landlock/selftests: Check that coredump sockets stay unrestricted
From: Günther Noack @ 2026-03-20 16:44 UTC (permalink / raw)
To: Mickaël Salaün
Cc: John Johansen, linux-security-module, Tingmao Wang, Justin Suess,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross, Jann Horn,
Tahera Fahimi, Sebastian Andrzej Siewior, Kuniyuki Iwashima
In-Reply-To: <20260318.bahDieJohf1u@digikod.net>
On Wed, Mar 18, 2026 at 05:53:59PM +0100, Mickaël Salaün wrote:
> On Sun, Mar 15, 2026 at 11:21:48PM +0100, Günther Noack wrote:
> > Even when a process is restricted with the new
> > LANDLOCK_ACCESS_FS_RESOLVE_SOCKET right, the kernel can continue
>
> LANDLOCK_ACCESS_FS_RESOLVE_UNIX (twice)
Thanks, well spotted! Fixed.
–Günther
^ permalink raw reply
* Re: [PATCH v6 3/9] landlock: Control pathname UNIX domain socket resolution by path
From: Günther Noack @ 2026-03-20 16:15 UTC (permalink / raw)
To: Mickaël Salaün
Cc: John Johansen, Tingmao Wang, Justin Suess,
Sebastian Andrzej Siewior, Kuniyuki Iwashima, Jann Horn,
linux-security-module, Samasth Norway Ananda, Matthieu Buffet,
Mikhail Ivanov, konstantin.meskhidze, Demi Marie Obenour,
Alyssa Ross, Tahera Fahimi
In-Reply-To: <20260318.peecoo2Ooyep@digikod.net>
Hello!
On Wed, Mar 18, 2026 at 05:52:48PM +0100, Mickaël Salaün wrote:
> On Sun, Mar 15, 2026 at 11:21:44PM +0100, Günther Noack wrote:
> > * Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
> > controls the look up operations for named UNIX domain sockets. The
>
> lookup
Done.
> > resolution happens during connect() and sendmsg() (depending on
> > socket type).
> > * Hook into the path lookup in unix_find_bsd() in af_unix.c, using a
> > LSM hook. Make policy decisions based on the new access rights
> > * Increment the Landlock ABI version.
> > * Minor test adaptions to keep the tests working.
>
> adaptations
Done.
> > * Document the design rationale for scoped access rights,
> > and cross-reference it from the header documentation.
> >
> > With this access right, access is granted if either of the following
> > conditions is met:
> >
> > * The target socket's filesystem path was allow-listed using a
> > LANDLOCK_RULE_PATH_BENEATH rule, *or*:
> > * The target socket was created in the same Landlock domain in which
> > LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted.
> >
> > In case of a denial, connect() and sendmsg() return EACCES, which is
> > the same error as it is returned if the user does not have the write
> > bit in the traditional Unix file system permissions of that file.
>
> UNIX
DONE
> > Document the (possible future) interaction between scoped flags and
> > other access rights in struct landlock_ruleset_attr, and summarize the
> > rationale, as discussed in code review leading up to [2].
> >
> > This feature was created with substantial discussion and input from
> > Justin Suess, Tingmao Wang and Mickaël Salaün.
> >
> > Cc: Tingmao Wang <m@maowtm.org>
> > Cc: Justin Suess <utilityemal77@gmail.com>
> > Cc: Mickaël Salaün <mic@digikod.net>
> > Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> > Cc: Kuniyuki Iwashima <kuniyu@google.com>
> > Suggested-by: Jann Horn <jannh@google.com>
> > Link[1]: https://github.com/landlock-lsm/linux/issues/36
> > Link[2]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
> > Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> > ---
> > Documentation/security/landlock.rst | 40 +++++++
> > include/uapi/linux/landlock.h | 19 ++++
> > security/landlock/access.h | 2 +-
> > security/landlock/audit.c | 1 +
> > security/landlock/fs.c | 110 ++++++++++++++++++-
> > security/landlock/limits.h | 2 +-
> > security/landlock/syscalls.c | 2 +-
> > tools/testing/selftests/landlock/base_test.c | 2 +-
> > tools/testing/selftests/landlock/fs_test.c | 5 +-
> > 9 files changed, 176 insertions(+), 7 deletions(-)
> >
> > diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst
> > index 3e4d4d04cfae..4bbe250a6829 100644
> > --- a/Documentation/security/landlock.rst
> > +++ b/Documentation/security/landlock.rst
> > @@ -89,6 +89,46 @@ this is required to keep access controls consistent over the whole system, and
> > this avoids unattended bypasses through file descriptor passing (i.e. confused
> > deputy attack).
> >
> > +.. _scoped-flags-interaction:
> > +
> > +Interaction between scoped flags and other access rights
> > +--------------------------------------------------------
> > +
> > +The ``scoped`` flags in ``struct landlock_ruleset_attr`` restrict the
> > +use of *outgoing* IPC from the created Landlock domain, while they
> > +permit reaching out to IPC endpoints *within* the created Landlock
> > +domain.
> > +
> > +In the future, scoped flags *may* interact with other access rights,
> > +e.g. so that abstract UNIX sockets can be allow-listed by name, or so
> > +that signals can be allow-listed by signal number or target process.
> > +
> > +When introducing ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``, we defined it to
> > +implicitly have the same scoping semantics as a
> > +``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` flag would have: connecting to
> > +UNIX sockets within the same domain (where
> > +``LANDLOCK_ACCESS_FS_RESOLVE_UNIX`` is used) is unconditionally
> > +allowed.
> > +
> > +The reasoning is:
> > +
> > +* Like other IPC mechanisms, connecting to named UNIX sockets in the
> > + same domain should be expected and harmless. (If needed, users can
> > + further refine their Landlock policies with nested domains or by
> > + restricting ``LANDLOCK_ACCESS_FS_MAKE_SOCK``.)
> > +* We reserve the option to still introduce
> > + ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the future. (This would
> > + be useful if we wanted to have a Landlock rule to permit IPC access
> > + to other Landlock domains.)
> > +* But we can postpone the point in time when users have to deal with
> > + two interacting flags visible in the userspace API. (In particular,
> > + it is possible that it won't be needed in practice, in which case we
> > + can avoid the second flag altogether.)
> > +* If we *do* introduce ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the
> > + future, setting this scoped flag in a ruleset does *not reduce* the
> > + restrictions, because access within the same scope is already
> > + allowed based on ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``.
> > +
> > Tests
> > =====
> >
> > diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> > index f88fa1f68b77..751e3c143cba 100644
> > --- a/include/uapi/linux/landlock.h
> > +++ b/include/uapi/linux/landlock.h
> > @@ -248,6 +248,24 @@ struct landlock_net_port_attr {
> > *
> > * This access right is available since the fifth version of the Landlock
> > * ABI.
> > + * - %LANDLOCK_ACCESS_FS_RESOLVE_UNIX: Look up pathname UNIX domain sockets
> > + * (:manpage:`unix(7)`). On UNIX domain sockets, this restricts both calls to
> > + * :manpage:`connect(2)` as well as calls to :manpage:`sendmsg(2)` with an
> > + * explicit recipient address.
> > + *
> > + * This access right only applies to connections to UNIX server sockets which
> > + * were created outside of the newly created Landlock domain (e.g. from within
> > + * a parent domain or from an unrestricted process). Newly created UNIX
> > + * servers within the same Landlock domain continue to be accessible. In this
> > + * regard, %LANDLOCK_ACCESS_RESOLVE_UNIX has the same semantics as the
>
> LANDLOCK_ACCESS_FS_RESOLVE_UNIX
Whoops, done.
> > + * ``LANDLOCK_SCOPE_*`` flags.
> > + *
> > + * If a resolve attempt is denied, the operation returns an ``EACCES`` error,
> > + * in line with other filesystem access rights (but different to denials for
> > + * abstract UNIX domain sockets).
>
> This access right is available since the ninth version of the Landlock ABI.
Thanks, added.
> > + *
> > + * The rationale for this design is described in
> > + * :ref:`Documentation/security/landlock.rst <scoped-flags-interaction>`.
> > *
> > * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> > * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> > @@ -333,6 +351,7 @@ struct landlock_net_port_attr {
> > #define LANDLOCK_ACCESS_FS_REFER (1ULL << 13)
> > #define LANDLOCK_ACCESS_FS_TRUNCATE (1ULL << 14)
> > #define LANDLOCK_ACCESS_FS_IOCTL_DEV (1ULL << 15)
> > +#define LANDLOCK_ACCESS_FS_RESOLVE_UNIX (1ULL << 16)
> > /* clang-format on */
> >
> > /**
> > diff --git a/security/landlock/access.h b/security/landlock/access.h
> > index 42c95747d7bd..89dc8e7b93da 100644
> > --- a/security/landlock/access.h
> > +++ b/security/landlock/access.h
> > @@ -34,7 +34,7 @@
> > LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > /* clang-format on */
> >
> > -typedef u16 access_mask_t;
> > +typedef u32 access_mask_t;
>
> This change and the underlying implications are not explained in the
> commit message, especially regarding the stack delta.
Thanks, will add it.
> > /* Makes sure all filesystem access rights can be stored. */
> > static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS);
> > diff --git a/security/landlock/audit.c b/security/landlock/audit.c
> > index 60ff217ab95b..8d0edf94037d 100644
> > --- a/security/landlock/audit.c
> > +++ b/security/landlock/audit.c
> > @@ -37,6 +37,7 @@ static const char *const fs_access_strings[] = {
> > [BIT_INDEX(LANDLOCK_ACCESS_FS_REFER)] = "fs.refer",
> > [BIT_INDEX(LANDLOCK_ACCESS_FS_TRUNCATE)] = "fs.truncate",
> > [BIT_INDEX(LANDLOCK_ACCESS_FS_IOCTL_DEV)] = "fs.ioctl_dev",
> > + [BIT_INDEX(LANDLOCK_ACCESS_FS_RESOLVE_UNIX)] = "fs.resolve_unix",
> > };
> >
> > static_assert(ARRAY_SIZE(fs_access_strings) == LANDLOCK_NUM_ACCESS_FS);
> > diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> > index 97065d51685a..0486f5ab06c9 100644
> > --- a/security/landlock/fs.c
> > +++ b/security/landlock/fs.c
> > @@ -27,6 +27,7 @@
> > #include <linux/lsm_hooks.h>
> > #include <linux/mount.h>
> > #include <linux/namei.h>
> > +#include <linux/net.h>
> > #include <linux/path.h>
> > #include <linux/pid.h>
> > #include <linux/rcupdate.h>
> > @@ -36,6 +37,7 @@
> > #include <linux/types.h>
> > #include <linux/wait_bit.h>
> > #include <linux/workqueue.h>
> > +#include <net/af_unix.h>
> > #include <uapi/linux/fiemap.h>
> > #include <uapi/linux/landlock.h>
> >
> > @@ -314,7 +316,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> > LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > LANDLOCK_ACCESS_FS_READ_FILE | \
> > LANDLOCK_ACCESS_FS_TRUNCATE | \
> > - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> > + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
> > /* clang-format on */
> >
> > /*
> > @@ -1557,6 +1560,110 @@ static int hook_path_truncate(const struct path *const path)
> > return current_check_access_path(path, LANDLOCK_ACCESS_FS_TRUNCATE);
> > }
> >
> > +/**
> > + * unmask_scoped_access - Remove access right bits in @masks in all layers
> > + * where @client and @server have the same domain
> > + *
> > + * This does the same as domain_is_scoped(), but unmasks bits in @masks.
> > + * It can not return early as domain_is_scoped() does.
>
> I'd like a summary of your previous excellent explanation of
> unmask_scoped_access() in this comment.
Adding:
A scoped access for a given access right bit is allowed iff, for all
layer depths where the access bit is set, the client and server
domain are the same. This function clears the access rights @access
in @masks at all layer depths where the client and server domain are
the same, so that, when they are all cleared, the access is allowed.
It's not as detailed as drawing a picture in the other mail, but I
hope it helps.
> > + * @client: Client domain
> > + * @server: Server domain
> > + * @masks: Layer access masks to unmask
> > + * @access: Access bit that controls scoping
> > + */
> > +static void unmask_scoped_access(const struct landlock_ruleset *const client,
> > + const struct landlock_ruleset *const server,
> > + struct layer_access_masks *const masks,
> > + const access_mask_t access)
> > +{
> > + int client_layer, server_layer;
> > + const struct landlock_hierarchy *client_walker, *server_walker;
> > +
> > + /* This should not happen. */
> > + if (WARN_ON_ONCE(!client))
> > + return;
> > +
> > + /* Server has no Landlock domain; nothing to clear. */
> > + if (!server)
> > + return;
> > +
>
> Please also copy the BUILD_BUG_ON() from domain_is_scoped().
I don't understand what this check is good for. It says:
/*
* client_layer must be a signed integer with greater capacity
* than client->num_layers to ensure the following loop stops.
*/
BUILD_BUG_ON(sizeof(client_layer) > sizeof(client->num_layers));
For the loop to terminate, in my understanding, client_layer must be
able to store client->num_layers - 1 down to - 1, but that is anyway a
given since num_layers can't exceed 16 and client_layer is signed. It
seems that the termination of this would anyway be caught in our tests
as well?
Could you please clarify what this BUILD_BUG_ON() is trying to assert?
> > + client_layer = client->num_layers - 1;
> > + client_walker = client->hierarchy;
> > + server_layer = server->num_layers - 1;
> > + server_walker = server->hierarchy;
> > +
> > + /*
> > + * Clears the access bits at all layers where the client domain is the
> > + * same as the server domain. We start the walk at min(client_layer,
> > + * server_layer). The layer bits until there can not be cleared because
> > + * either the client or the server domain is missing.
> > + */
> > + for (; client_layer > server_layer; client_layer--)
> > + client_walker = client_walker->parent;
> > +
> > + for (; server_layer > client_layer; server_layer--)
> > + server_walker = server_walker->parent;
> > +
> > + for (; client_layer >= 0; client_layer--) {
> > + if (masks->access[client_layer] & access &&
> > + client_walker == server_walker)
> > + masks->access[client_layer] &= ~access;
> > +
> > + client_walker = client_walker->parent;
> > + server_walker = server_walker->parent;
> > + }
> > +}
> > +
> > +static int hook_unix_find(const struct path *const path, struct sock *other,
> > + int flags)
> > +{
> > + const struct landlock_ruleset *dom_other;
> > + const struct landlock_cred_security *subject;
> > + struct layer_access_masks layer_masks;
> > + struct landlock_request request = {};
> > + static const struct access_masks fs_resolve_unix = {
> > + .fs = LANDLOCK_ACCESS_FS_RESOLVE_UNIX,
> > + };
> > +
> > + /* Lookup for the purpose of saving coredumps is OK. */
> > + if (unlikely(flags & SOCK_COREDUMP))
> > + return 0;
> > +
> > + /* Access to the same (or a lower) domain is always allowed. */
>
> This comment is related to the unmask_scoped_access() call.
Thanks, I moved it down.
> > + subject = landlock_get_applicable_subject(current_cred(),
> > + fs_resolve_unix, NULL);
> > +
> > + if (!subject)
> > + return 0;
> > +
> > + if (!landlock_init_layer_masks(subject->domain, fs_resolve_unix.fs,
> > + &layer_masks, LANDLOCK_KEY_INODE))
>
> This case is not possible because landlock_get_applicable_subject()
> already check it. Other hooks just ignore the returned value in this
> case.
Hm, fair enough. I added a comment to explain why we are ignoring the
return value, as it wasn't as obvious to me. In the other places, we
are using the result of the landlock_init_layer_masks() function
(because in the generic case, it can be a subset of the original
access rights).
> > + return 0;
> > +
> > + /* Checks the layers in which we are connecting within the same domain. */
> > + unix_state_lock(other);
> > + if (unlikely(sock_flag(other, SOCK_DEAD) || !other->sk_socket ||
> > + !other->sk_socket->file)) {
> > + unix_state_unlock(other);
> > + return 0;
>
> Is it safe to not return -ECONNREFUSED?
Yes. My reasoning is:
In all three places where this gets called in af_unix.c (stream
connect, dgram connect, dgram send), these functions check for socket
death shortly after, and if they find the socket to be SOCK_DEAD, they
will *retry* the UNIX lookup. The code commentary about this says
that this is for a race condition where the VFS has "overslept" the
socket death, so I presume that the retry aims at getting a race-free
sitation on the next attempt.
Since sock_orphan() is a one-way teardown operation, when we observe
SOCK_DEAD in our hook, we can be sure that the caller will see it as
well when it does the same check a bit later after our hook.
If we *were* to return -ECONNREFUSED, the caller would immediately
return an error though, and it would not retry as it normally does
when it encounters this race condition. So we have to return 0 here.
> > + }
> > + dom_other = landlock_cred(other->sk_socket->file->f_cred)->domain;
> > + unix_state_unlock(other);
> > +
> > + unmask_scoped_access(subject->domain, dom_other, &layer_masks,
> > + fs_resolve_unix.fs);
>
> dom_other is not safe to use without the lock.
Thanks, fixed by extending the lock scope across that function call,
as discussed in other thread in more detail.
> > + /* Checks the connections to allow-listed paths. */
> > + if (is_access_to_paths_allowed(subject->domain, path,
> > + fs_resolve_unix.fs, &layer_masks,
> > + &request, NULL, 0, NULL, NULL, NULL))
> > + return 0;
> > +
> > + landlock_log_denial(subject, &request);
> > + return -EACCES;
> > +}
> > +
> > /* File hooks */
> >
> > /**
> > @@ -1834,6 +1941,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
> > LSM_HOOK_INIT(path_unlink, hook_path_unlink),
> > LSM_HOOK_INIT(path_rmdir, hook_path_rmdir),
> > LSM_HOOK_INIT(path_truncate, hook_path_truncate),
> > + LSM_HOOK_INIT(unix_find, hook_unix_find),
> >
> > LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
> > LSM_HOOK_INIT(file_open, hook_file_open),
> > diff --git a/security/landlock/limits.h b/security/landlock/limits.h
> > index eb584f47288d..b454ad73b15e 100644
> > --- a/security/landlock/limits.h
> > +++ b/security/landlock/limits.h
> > @@ -19,7 +19,7 @@
> > #define LANDLOCK_MAX_NUM_LAYERS 16
> > #define LANDLOCK_MAX_NUM_RULES U32_MAX
> >
> > -#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_IOCTL_DEV
> > +#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_RESOLVE_UNIX
> > #define LANDLOCK_MASK_ACCESS_FS ((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
> > #define LANDLOCK_NUM_ACCESS_FS __const_hweight64(LANDLOCK_MASK_ACCESS_FS)
> >
> > diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> > index 3b33839b80c7..a6e23657f3ce 100644
> > --- a/security/landlock/syscalls.c
> > +++ b/security/landlock/syscalls.c
> > @@ -166,7 +166,7 @@ static const struct file_operations ruleset_fops = {
> > * If the change involves a fix that requires userspace awareness, also update
> > * the errata documentation in Documentation/userspace-api/landlock.rst .
> > */
> > -const int landlock_abi_version = 8;
> > +const int landlock_abi_version = 9;
> >
> > /**
> > * sys_landlock_create_ruleset - Create a new ruleset
> > diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
> > index 0fea236ef4bd..30d37234086c 100644
> > --- a/tools/testing/selftests/landlock/base_test.c
> > +++ b/tools/testing/selftests/landlock/base_test.c
> > @@ -76,7 +76,7 @@ TEST(abi_version)
> > const struct landlock_ruleset_attr ruleset_attr = {
> > .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
> > };
> > - ASSERT_EQ(8, landlock_create_ruleset(NULL, 0,
> > + ASSERT_EQ(9, landlock_create_ruleset(NULL, 0,
> > LANDLOCK_CREATE_RULESET_VERSION));
> >
> > ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
> > diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
> > index 968a91c927a4..b318627e7561 100644
> > --- a/tools/testing/selftests/landlock/fs_test.c
> > +++ b/tools/testing/selftests/landlock/fs_test.c
> > @@ -575,9 +575,10 @@ TEST_F_FORK(layout1, inval)
> > LANDLOCK_ACCESS_FS_WRITE_FILE | \
> > LANDLOCK_ACCESS_FS_READ_FILE | \
> > LANDLOCK_ACCESS_FS_TRUNCATE | \
> > - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> > + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> > + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
> >
> > -#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
> > +#define ACCESS_LAST LANDLOCK_ACCESS_FS_RESOLVE_UNIX
> >
> > #define ACCESS_ALL ( \
> > ACCESS_FILE | \
> > --
> > 2.53.0
> >
> >
–Günther
^ permalink raw reply
* [PATCH v7] backing_file: store an internal O_PATH user_path_file
From: Amir Goldstein @ 2026-03-20 13:53 UTC (permalink / raw)
To: Christian Brauner
Cc: Al Viro, Miklos Szeredi, Paul Moore, Gao Xiang,
linux-security-module, selinux, linux-erofs, linux-fsdevel,
linux-unionfs
Instead of storing the user_path, store an internal O_PATH file,
embedded in the backing_file struct for the user_path with a copy of
the original user file creds and with an independent a security context.
This internal O_PATH file is going to be used by LSM mprotect hooks
to check the permissions to mprotect against a copy of the original
mmaped file credentials.
The internal user_path_file is only exported as a const pointer and
its refcnt is initialized to FILE_REF_DEAD, because it is not an actual
refcounted object. The file_ref_init() helper was changed to accept
the FILE_REF_ constant instead of the fake +1 integer count.
The internal O_PATH file is opened using a new kernel helper
kernel_path_file_open(), which skips all the generic code in
do_dentry_open() and does only the essentials, so that the internal
O_PATH file could be opened with a negative dentry.
This is needed for backing_tmpfile_open() to open a backing O_PATH
tmpfile before instantiating the dentry.
The callers of backing_tmpfile_open() are responsible for calling
backing_tmpfile_finish() after making the path positive.
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
Christian,
7th time is a charm (?).
Here is another try to introduce backing_file_user_path_file(),
after dealing with the the syzbot report [1] v6 [2] introduces.
Thanks,
Amir.
Changes since v6:
- Create helper for internal O_PATH open with negative path
- Create backing_tmpfile_finish() API to fixup the negative path
[1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
[2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
fs/backing-file.c | 32 +++++++++++-------
fs/erofs/ishare.c | 13 ++++++--
fs/file_table.c | 63 +++++++++++++++++++++++++++++-------
fs/fuse/passthrough.c | 3 +-
fs/internal.h | 12 +++++--
fs/open.c | 34 ++++++++++++++++---
fs/overlayfs/dir.c | 5 ++-
fs/overlayfs/file.c | 1 +
include/linux/backing-file.h | 30 +++++++++++++++--
include/linux/file_ref.h | 4 +--
10 files changed, 159 insertions(+), 38 deletions(-)
diff --git a/fs/backing-file.c b/fs/backing-file.c
index 45da8600d5644..77935c93333ab 100644
--- a/fs/backing-file.c
+++ b/fs/backing-file.c
@@ -11,6 +11,7 @@
#include <linux/fs.h>
#include <linux/backing-file.h>
#include <linux/splice.h>
+#include <linux/uio.h>
#include <linux/mm.h>
#include "internal.h"
@@ -18,9 +19,10 @@
/**
* backing_file_open - open a backing file for kernel internal use
* @user_path: path that the user reuqested to open
+ * @user_cred: credentials that the user used for open
* @flags: open flags
* @real_path: path of the backing file
- * @cred: credentials for open
+ * @cred: credentials for open of the backing file
*
* Open a backing file for a stackable filesystem (e.g., overlayfs).
* @user_path may be on the stackable filesystem and @real_path on the
@@ -29,20 +31,21 @@
* returned file into a container structure that also stores the stacked
* file's path, which can be retrieved using backing_file_user_path().
*/
-struct file *backing_file_open(const struct path *user_path, int flags,
+struct file *backing_file_open(const struct path *user_path,
+ const struct cred *user_cred, int flags,
const struct path *real_path,
const struct cred *cred)
{
struct file *f;
int error;
- f = alloc_empty_backing_file(flags, cred);
+ f = alloc_empty_backing_file(flags, cred, user_cred);
if (IS_ERR(f))
return f;
- path_get(user_path);
- backing_file_set_user_path(f, user_path);
- error = vfs_open(real_path, f);
+ error = backing_file_open_user_path(f, user_path);
+ if (!error)
+ error = vfs_open(real_path, f);
if (error) {
fput(f);
f = ERR_PTR(error);
@@ -52,7 +55,8 @@ struct file *backing_file_open(const struct path *user_path, int flags,
}
EXPORT_SYMBOL_GPL(backing_file_open);
-struct file *backing_tmpfile_open(const struct path *user_path, int flags,
+struct file *backing_tmpfile_open(const struct path *user_path,
+ const struct cred *user_cred, int flags,
const struct path *real_parentpath,
umode_t mode, const struct cred *cred)
{
@@ -60,13 +64,13 @@ struct file *backing_tmpfile_open(const struct path *user_path, int flags,
struct file *f;
int error;
- f = alloc_empty_backing_file(flags, cred);
+ f = alloc_empty_backing_file(flags, cred, user_cred);
if (IS_ERR(f))
return f;
- path_get(user_path);
- backing_file_set_user_path(f, user_path);
- error = vfs_tmpfile(real_idmap, real_parentpath, f, mode);
+ error = backing_file_open_user_path(f, user_path);
+ if (!error)
+ error = vfs_tmpfile(real_idmap, real_parentpath, f, mode);
if (error) {
fput(f);
f = ERR_PTR(error);
@@ -75,6 +79,12 @@ struct file *backing_tmpfile_open(const struct path *user_path, int flags,
}
EXPORT_SYMBOL(backing_tmpfile_open);
+void backing_tmpfile_finish(struct file *file)
+{
+ backing_file_set_user_path_inode(file);
+}
+EXPORT_SYMBOL_GPL(backing_tmpfile_finish);
+
struct backing_aio {
struct kiocb iocb;
refcount_t ref;
diff --git a/fs/erofs/ishare.c b/fs/erofs/ishare.c
index 829d50d5c717d..f3a5fb0bffaf0 100644
--- a/fs/erofs/ishare.c
+++ b/fs/erofs/ishare.c
@@ -103,18 +103,25 @@ static int erofs_ishare_file_open(struct inode *inode, struct file *file)
{
struct inode *sharedinode = EROFS_I(inode)->sharedinode;
struct file *realfile;
+ int err;
if (file->f_flags & O_DIRECT)
return -EINVAL;
- realfile = alloc_empty_backing_file(O_RDONLY|O_NOATIME, current_cred());
+ realfile = alloc_empty_backing_file(O_RDONLY|O_NOATIME, current_cred(),
+ file->f_cred);
if (IS_ERR(realfile))
return PTR_ERR(realfile);
+
+ err = backing_file_open_user_path(realfile, &file->f_path);
+ if (err) {
+ fput(realfile);
+ return err;
+ }
+
ihold(sharedinode);
realfile->f_op = &erofs_file_fops;
realfile->f_inode = sharedinode;
realfile->f_mapping = sharedinode->i_mapping;
- path_get(&file->f_path);
- backing_file_set_user_path(realfile, &file->f_path);
file_ra_state_init(&realfile->f_ra, file->f_mapping);
realfile->private_data = EROFS_I(inode);
diff --git a/fs/file_table.c b/fs/file_table.c
index aaa5faaace1e9..a4d1064d50896 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -27,6 +27,7 @@
#include <linux/task_work.h>
#include <linux/swap.h>
#include <linux/kmemleak.h>
+#include <linux/backing-file.h>
#include <linux/atomic.h>
@@ -43,11 +44,11 @@ static struct kmem_cache *bfilp_cachep __ro_after_init;
static struct percpu_counter nr_files __cacheline_aligned_in_smp;
-/* Container for backing file with optional user path */
+/* Container for backing file with optional user path file */
struct backing_file {
struct file file;
union {
- struct path user_path;
+ struct file user_path_file;
freeptr_t bf_freeptr;
};
};
@@ -56,24 +57,54 @@ struct backing_file {
const struct path *backing_file_user_path(const struct file *f)
{
- return &backing_file(f)->user_path;
+ return &backing_file(f)->user_path_file.f_path;
}
EXPORT_SYMBOL_GPL(backing_file_user_path);
-void backing_file_set_user_path(struct file *f, const struct path *path)
+const struct file *backing_file_user_path_file(const struct file *f)
{
- backing_file(f)->user_path = *path;
+ return &backing_file(f)->user_path_file;
}
-EXPORT_SYMBOL_GPL(backing_file_set_user_path);
+EXPORT_SYMBOL_GPL(backing_file_user_path_file);
-static inline void file_free(struct file *f)
+int backing_file_open_user_path(struct file *f, const struct path *path)
+{
+ if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
+ return -EIO;
+ kernel_path_file_open(&backing_file(f)->user_path_file, path);
+ return 0;
+}
+EXPORT_SYMBOL_GPL(backing_file_open_user_path);
+
+void backing_file_set_user_path_inode(struct file *f)
+{
+ if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
+ return;
+ file_set_d_inode(&backing_file(f)->user_path_file);
+}
+EXPORT_SYMBOL_GPL(backing_file_set_user_path_inode);
+
+static void destroy_file(struct file *f)
{
security_file_free(f);
+ put_cred(f->f_cred);
+}
+
+static inline void file_free(struct file *f)
+{
+ destroy_file(f);
if (likely(!(f->f_mode & FMODE_NOACCOUNT)))
percpu_counter_dec(&nr_files);
- put_cred(f->f_cred);
if (unlikely(f->f_mode & FMODE_BACKING)) {
- path_put(backing_file_user_path(f));
+ struct file *user_path_file = &backing_file(f)->user_path_file;
+
+ /*
+ * no refcount on the user_path_file - they die together,
+ * so __fput() is not called for user_path_file. path_put()
+ * is the only relevant cleanup from __fput().
+ */
+ destroy_file(user_path_file);
+ path_put(&user_path_file->__f_path);
kmem_cache_free(bfilp_cachep, backing_file(f));
} else {
kmem_cache_free(filp_cachep, f);
@@ -201,7 +232,7 @@ static int init_file(struct file *f, int flags, const struct cred *cred)
* fget-rcu pattern users need to be able to handle spurious
* refcount bumps we should reinitialize the reused file first.
*/
- file_ref_init(&f->f_ref, 1);
+ file_ref_init(&f->f_ref, FILE_REF_ONEREF);
return 0;
}
@@ -290,7 +321,8 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred)
* This is only for kernel internal use, and the allocate file must not be
* installed into file tables or such.
*/
-struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
+struct file *alloc_empty_backing_file(int flags, const struct cred *cred,
+ const struct cred *user_cred)
{
struct backing_file *ff;
int error;
@@ -305,6 +337,15 @@ struct file *alloc_empty_backing_file(int flags, const struct cred *cred)
return ERR_PTR(error);
}
+ error = init_file(&ff->user_path_file, O_PATH, user_cred);
+ /* user_path_file is not refcounterd - it dies with the backing file */
+ file_ref_init(&ff->user_path_file.f_ref, FILE_REF_DEAD);
+ if (unlikely(error)) {
+ destroy_file(&ff->file);
+ kmem_cache_free(bfilp_cachep, ff);
+ return ERR_PTR(error);
+ }
+
ff->file.f_mode |= FMODE_BACKING | FMODE_NOACCOUNT;
return &ff->file;
}
diff --git a/fs/fuse/passthrough.c b/fs/fuse/passthrough.c
index 72de97c03d0ee..60018c6359342 100644
--- a/fs/fuse/passthrough.c
+++ b/fs/fuse/passthrough.c
@@ -10,6 +10,7 @@
#include <linux/file.h>
#include <linux/backing-file.h>
#include <linux/splice.h>
+#include <linux/uio.h>
static void fuse_file_accessed(struct file *file)
{
@@ -167,7 +168,7 @@ struct fuse_backing *fuse_passthrough_open(struct file *file, int backing_id)
goto out;
/* Allocate backing file per fuse file to store fuse path */
- backing_file = backing_file_open(&file->f_path, file->f_flags,
+ backing_file = backing_file_open(&file->f_path, file->f_cred, file->f_flags,
&fb->file->f_path, fb->cred);
err = PTR_ERR(backing_file);
if (IS_ERR(backing_file)) {
diff --git a/fs/internal.h b/fs/internal.h
index cbc384a1aa096..4a9e5e00678d9 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -106,8 +106,16 @@ extern void chroot_fs_refs(const struct path *, const struct path *);
*/
struct file *alloc_empty_file(int flags, const struct cred *cred);
struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
-struct file *alloc_empty_backing_file(int flags, const struct cred *cred);
-void backing_file_set_user_path(struct file *f, const struct path *path);
+struct file *alloc_empty_backing_file(int flags, const struct cred *cred,
+ const struct cred *user_cred);
+int backing_file_open_user_path(struct file *f, const struct path *path);
+void backing_file_set_user_path_inode(struct file *f);
+void kernel_path_file_open(struct file *f, const struct path *path);
+
+static inline void file_set_d_inode(struct file *f)
+{
+ f->f_inode = d_inode(f->f_path.dentry);
+}
static inline void file_put_write_access(struct file *file)
{
diff --git a/fs/open.c b/fs/open.c
index 91f1139591abe..a7b3b04cd9ae7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -884,10 +884,38 @@ static inline int file_get_write_access(struct file *f)
return error;
}
+static const struct file_operations empty_fops = {};
+
+static void do_path_file_open(struct file *f)
+{
+ f->f_mode = FMODE_PATH | FMODE_OPENED;
+ file_set_fsnotify_mode(f, FMODE_NONOTIFY);
+ f->f_op = &empty_fops;
+}
+
+/**
+ * kernel_path_file_open - open an O_PATH file for kernel internal use
+ * @f: pre-allocated file with f_flags and f_cred initialized
+ * @path: path to reference (may have a negative dentry)
+ *
+ * Open a minimal O_PATH file that only references a path.
+ * Unlike vfs_open(), this does not require a positive dentry and does not
+ * set up f_mapping and other fields not needed for O_PATH.
+ * If path is negative at the time of this call, the caller is responsible for
+ * callingn backing_file_set_user_path_inode() after making the path positive.
+
+ */
+void kernel_path_file_open(struct file *f, const struct path *path)
+{
+ f->__f_path = *path;
+ path_get(&f->f_path);
+ file_set_d_inode(f);
+ do_path_file_open(f);
+}
+
static int do_dentry_open(struct file *f,
int (*open)(struct inode *, struct file *))
{
- static const struct file_operations empty_fops = {};
struct inode *inode = f->f_path.dentry->d_inode;
int error;
@@ -898,9 +926,7 @@ static int do_dentry_open(struct file *f,
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
- file_set_fsnotify_mode(f, FMODE_NONOTIFY);
- f->f_op = &empty_fops;
+ do_path_file_open(f);
return 0;
}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index ff3dbd1ca61f2..24e961f165a78 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -1374,7 +1374,8 @@ static int ovl_create_tmpfile(struct file *file, struct dentry *dentry,
return PTR_ERR(cred);
ovl_path_upper(dentry->d_parent, &realparentpath);
- realfile = backing_tmpfile_open(&file->f_path, flags, &realparentpath,
+ realfile = backing_tmpfile_open(&file->f_path, file->f_cred,
+ flags, &realparentpath,
mode, current_cred());
err = PTR_ERR_OR_ZERO(realfile);
pr_debug("tmpfile/open(%pd2, 0%o) = %i\n", realparentpath.dentry, mode, err);
@@ -1392,6 +1393,8 @@ static int ovl_create_tmpfile(struct file *file, struct dentry *dentry,
err = ovl_instantiate(dentry, inode, newdentry, false, file);
if (!err) {
file->private_data = of;
+ /* user_path_file was opened with a negative path */
+ backing_tmpfile_finish(realfile);
} else {
dput(newdentry);
ovl_file_free(of);
diff --git a/fs/overlayfs/file.c b/fs/overlayfs/file.c
index 97bed2286030d..767c128407fcc 100644
--- a/fs/overlayfs/file.c
+++ b/fs/overlayfs/file.c
@@ -49,6 +49,7 @@ static struct file *ovl_open_realfile(const struct file *file,
flags &= ~O_NOATIME;
realfile = backing_file_open(file_user_path(file),
+ file_user_cred(file),
flags, realpath, current_cred());
}
}
diff --git a/include/linux/backing-file.h b/include/linux/backing-file.h
index 1476a6ed1bfd7..52ac51ada6ff9 100644
--- a/include/linux/backing-file.h
+++ b/include/linux/backing-file.h
@@ -9,21 +9,45 @@
#define _LINUX_BACKING_FILE_H
#include <linux/file.h>
-#include <linux/uio.h>
#include <linux/fs.h>
+/*
+ * When mmapping a file on a stackable filesystem (e.g., overlayfs), the file
+ * stored in ->vm_file is a backing file whose f_inode is on the underlying
+ * filesystem.
+ *
+ * LSM can use file_user_path_file() to store context related to the user path
+ * that was opened and mmaped.
+ */
+const struct file *backing_file_user_path_file(const struct file *f);
+
+static inline const struct file *file_user_path_file(const struct file *f)
+{
+ if (f && unlikely(f->f_mode & FMODE_BACKING))
+ return backing_file_user_path_file(f);
+ return f;
+}
+
+static inline const struct cred *file_user_cred(const struct file *f)
+{
+ return file_user_path_file(f)->f_cred;
+}
+
struct backing_file_ctx {
const struct cred *cred;
void (*accessed)(struct file *file);
void (*end_write)(struct kiocb *iocb, ssize_t);
};
-struct file *backing_file_open(const struct path *user_path, int flags,
+struct file *backing_file_open(const struct path *user_path,
+ const struct cred *user_cred, int flags,
const struct path *real_path,
const struct cred *cred);
-struct file *backing_tmpfile_open(const struct path *user_path, int flags,
+struct file *backing_tmpfile_open(const struct path *user_path,
+ const struct cred *user_cred, int flags,
const struct path *real_parentpath,
umode_t mode, const struct cred *cred);
+void backing_tmpfile_finish(struct file *file);
ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags,
struct backing_file_ctx *ctx);
diff --git a/include/linux/file_ref.h b/include/linux/file_ref.h
index 31551e4cb8f34..c7512ce70f9c4 100644
--- a/include/linux/file_ref.h
+++ b/include/linux/file_ref.h
@@ -54,11 +54,11 @@ typedef struct {
/**
* file_ref_init - Initialize a file reference count
* @ref: Pointer to the reference count
- * @cnt: The initial reference count typically '1'
+ * @cnt: The initial reference count typically FILE_REF_ONEREF
*/
static inline void file_ref_init(file_ref_t *ref, unsigned long cnt)
{
- atomic_long_set(&ref->refcnt, cnt - 1);
+ atomic_long_set(&ref->refcnt, cnt);
}
bool __file_ref_put(file_ref_t *ref, unsigned long cnt);
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: Mimi Zohar @ 2026-03-20 12:41 UTC (permalink / raw)
To: steven chen, Roberto Sassu, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu
In-Reply-To: <25e0a273-9044-4e0d-9812-0171ec99e1b7@linux.microsoft.com>
On Thu, 2026-03-19 at 14:31 -0700, steven chen wrote:
> > - Support for deleting N measurement records (and pre-pending the remaining
> > measurement records)
>
> Is there any problem to bring work of "stage" step together to the
> deletion step?
>
> "Trim N" method does everything that "staged" method can do, right?
> what's the "stage" method can do but "trim N" method can't do?
>
> in user space, if in "staged" state, no other user space agent can
> access the IMA measure list, right?
>
> Could you explain the benefit of bringing the "stage" step?
The performance improvement is because "staging" the IMA measurement list takes
the lock in order to move the measurement list pointer and then releases it.
New measurements can then be appended to a new measurement list. Deleting
records is done without taking the lock to walk the staged measurement list.
Without staging the measurement list, walking the measurement list to trim N
records requires taking and holding the lock. The performance is dependent on
the size of the measurement list.
Your question isn't really about "staging" the measurement list records, but
requiring a userspace signal to delete them. To answer that question, deleting
N records (third patch) could imply staging all the measurement records and
immediately deleting N records without an explicit userspace signal.
I expect the requested "documentation" patch will provide the motivation for the
delayed deletion of the measurement list.
Mimi
^ permalink raw reply
* Re: [PATCH v2] fs: allow backing file code to open an O_PATH file with negative dentry
From: Amir Goldstein @ 2026-03-20 12:33 UTC (permalink / raw)
To: Christian Brauner
Cc: Al Viro, Miklos Szeredi, Paul Moore, Gao Xiang,
linux-security-module, selinux, linux-erofs, linux-fsdevel,
linux-unionfs, syzbot+f34aab278bf5d664e2be
In-Reply-To: <20260320122918.1726043-1-amir73il@gmail.com>
On Fri, Mar 20, 2026 at 1:29 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> The fields f_mapping and f_wb_err are irrelevant for the O_PATH file
> in backing_file_user_path_file().
>
> Create a dedicated helper kernel_path_file_open(), which skips all the
> generic code in do_dentry_open() and does only the essentials, so that
> the internal O_PATH file could be opened with a negative dentry.
>
> This is needed for backing_tmpfile_open() to open a backing O_PATH
> tmpfile before instantiating the dentry.
>
> The callers of backing_tmpfile_open() are responsible for calling
> backing_tmpfile_finish() after making the path positive.
>
> Reported-by: syzbot+f34aab278bf5d664e2be@syzkaller.appspotmail.com
> Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> ---
>
> Christian,
>
> This patch fixes the syzbot report [1] that the
> backing_file_user_path_file() patch [2] introduces.
Sorry, forgot to mention that this patch applies on top of [2]
so the vfs CI is likely to fail to apply it.
The previous patch + fix are also available on my github
https://github.com/amir73il/linux/commits/user_path_file/
Thanks,
Amir.
>
> Following your feedback on v1, this version makes an effort to stay
> out of the way of main vfs execution paths and restrict the changes
> to backing_file users.
>
> This still introduced a temporary state of an O_PATH file with negative
> path, but only for a short time and only for backing_file users and
> ones that use backing_tmpfile_open() (i.e. only overlayfs), so the
> risk is minimal.
>
> WDYT?
>
> Thanks,
> Amir.
>
> Changes since v1:
> - Create helper for internal O_PATH open with negative path
> - Create backing_tmpfile_finish() API to fixup the negative path
>
> [1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
> [2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
>
> fs/backing-file.c | 6 ++++++
> fs/file_table.c | 14 ++++++++++++--
> fs/internal.h | 7 +++++++
> fs/open.c | 34 ++++++++++++++++++++++++++++++----
> fs/overlayfs/dir.c | 2 ++
> include/linux/backing-file.h | 1 +
> 6 files changed, 58 insertions(+), 6 deletions(-)
>
> diff --git a/fs/backing-file.c b/fs/backing-file.c
> index d0a64c2103907..3357d624eac96 100644
> --- a/fs/backing-file.c
> +++ b/fs/backing-file.c
> @@ -80,6 +80,12 @@ struct file *backing_tmpfile_open(const struct path *user_path,
> }
> EXPORT_SYMBOL(backing_tmpfile_open);
>
> +void backing_tmpfile_finish(struct file *file)
> +{
> + backing_file_set_user_path_inode(file);
> +}
> +EXPORT_SYMBOL_GPL(backing_tmpfile_finish);
> +
> struct backing_aio {
> struct kiocb iocb;
> refcount_t ref;
> diff --git a/fs/file_table.c b/fs/file_table.c
> index e8b4eb2bbff85..a4d1064d50896 100644
> --- a/fs/file_table.c
> +++ b/fs/file_table.c
> @@ -69,11 +69,21 @@ EXPORT_SYMBOL_GPL(backing_file_user_path_file);
>
> int backing_file_open_user_path(struct file *f, const struct path *path)
> {
> - /* open an O_PATH file to reference the user path - should not fail */
> - return WARN_ON(vfs_open(path, &backing_file(f)->user_path_file));
> + if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
> + return -EIO;
> + kernel_path_file_open(&backing_file(f)->user_path_file, path);
> + return 0;
> }
> EXPORT_SYMBOL_GPL(backing_file_open_user_path);
>
> +void backing_file_set_user_path_inode(struct file *f)
> +{
> + if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
> + return;
> + file_set_d_inode(&backing_file(f)->user_path_file);
> +}
> +EXPORT_SYMBOL_GPL(backing_file_set_user_path_inode);
> +
> static void destroy_file(struct file *f)
> {
> security_file_free(f);
> diff --git a/fs/internal.h b/fs/internal.h
> index 7c44a58627ba3..4a9e5e00678d9 100644
> --- a/fs/internal.h
> +++ b/fs/internal.h
> @@ -109,6 +109,13 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
> struct file *alloc_empty_backing_file(int flags, const struct cred *cred,
> const struct cred *user_cred);
> int backing_file_open_user_path(struct file *f, const struct path *path);
> +void backing_file_set_user_path_inode(struct file *f);
> +void kernel_path_file_open(struct file *f, const struct path *path);
> +
> +static inline void file_set_d_inode(struct file *f)
> +{
> + f->f_inode = d_inode(f->f_path.dentry);
> +}
>
> static inline void file_put_write_access(struct file *file)
> {
> diff --git a/fs/open.c b/fs/open.c
> index 91f1139591abe..a7b3b04cd9ae7 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -884,10 +884,38 @@ static inline int file_get_write_access(struct file *f)
> return error;
> }
>
> +static const struct file_operations empty_fops = {};
> +
> +static void do_path_file_open(struct file *f)
> +{
> + f->f_mode = FMODE_PATH | FMODE_OPENED;
> + file_set_fsnotify_mode(f, FMODE_NONOTIFY);
> + f->f_op = &empty_fops;
> +}
> +
> +/**
> + * kernel_path_file_open - open an O_PATH file for kernel internal use
> + * @f: pre-allocated file with f_flags and f_cred initialized
> + * @path: path to reference (may have a negative dentry)
> + *
> + * Open a minimal O_PATH file that only references a path.
> + * Unlike vfs_open(), this does not require a positive dentry and does not
> + * set up f_mapping and other fields not needed for O_PATH.
> + * If path is negative at the time of this call, the caller is responsible for
> + * callingn backing_file_set_user_path_inode() after making the path positive.
> +
> + */
> +void kernel_path_file_open(struct file *f, const struct path *path)
> +{
> + f->__f_path = *path;
> + path_get(&f->f_path);
> + file_set_d_inode(f);
> + do_path_file_open(f);
> +}
> +
> static int do_dentry_open(struct file *f,
> int (*open)(struct inode *, struct file *))
> {
> - static const struct file_operations empty_fops = {};
> struct inode *inode = f->f_path.dentry->d_inode;
> int error;
>
> @@ -898,9 +926,7 @@ static int do_dentry_open(struct file *f,
> f->f_sb_err = file_sample_sb_err(f);
>
> if (unlikely(f->f_flags & O_PATH)) {
> - f->f_mode = FMODE_PATH | FMODE_OPENED;
> - file_set_fsnotify_mode(f, FMODE_NONOTIFY);
> - f->f_op = &empty_fops;
> + do_path_file_open(f);
> return 0;
> }
>
> diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
> index 5fd32ccc134d2..4010c87e10196 100644
> --- a/fs/overlayfs/dir.c
> +++ b/fs/overlayfs/dir.c
> @@ -1408,6 +1408,8 @@ static int ovl_create_tmpfile(struct file *file, struct dentry *dentry,
> err = ovl_instantiate(dentry, inode, newdentry, false, file);
> if (!err) {
> file->private_data = of;
> + /* user_path_file was opened with a negative path */
> + backing_tmpfile_finish(realfile);
> } else {
> dput(newdentry);
> ovl_file_free(of);
> diff --git a/include/linux/backing-file.h b/include/linux/backing-file.h
> index 8afba93f3ce07..52ac51ada6ff9 100644
> --- a/include/linux/backing-file.h
> +++ b/include/linux/backing-file.h
> @@ -47,6 +47,7 @@ struct file *backing_tmpfile_open(const struct path *user_path,
> const struct cred *user_cred, int flags,
> const struct path *real_parentpath,
> umode_t mode, const struct cred *cred);
> +void backing_tmpfile_finish(struct file *file);
> ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
> struct kiocb *iocb, int flags,
> struct backing_file_ctx *ctx);
> --
> 2.53.0
>
^ permalink raw reply
* [PATCH v2] fs: allow backing file code to open an O_PATH file with negative dentry
From: Amir Goldstein @ 2026-03-20 12:29 UTC (permalink / raw)
To: Christian Brauner
Cc: Al Viro, Miklos Szeredi, Paul Moore, Gao Xiang,
linux-security-module, selinux, linux-erofs, linux-fsdevel,
linux-unionfs, syzbot+f34aab278bf5d664e2be
The fields f_mapping and f_wb_err are irrelevant for the O_PATH file
in backing_file_user_path_file().
Create a dedicated helper kernel_path_file_open(), which skips all the
generic code in do_dentry_open() and does only the essentials, so that
the internal O_PATH file could be opened with a negative dentry.
This is needed for backing_tmpfile_open() to open a backing O_PATH
tmpfile before instantiating the dentry.
The callers of backing_tmpfile_open() are responsible for calling
backing_tmpfile_finish() after making the path positive.
Reported-by: syzbot+f34aab278bf5d664e2be@syzkaller.appspotmail.com
Signed-off-by: Amir Goldstein <amir73il@gmail.com>
---
Christian,
This patch fixes the syzbot report [1] that the
backing_file_user_path_file() patch [2] introduces.
Following your feedback on v1, this version makes an effort to stay
out of the way of main vfs execution paths and restrict the changes
to backing_file users.
This still introduced a temporary state of an O_PATH file with negative
path, but only for a short time and only for backing_file users and
ones that use backing_tmpfile_open() (i.e. only overlayfs), so the
risk is minimal.
WDYT?
Thanks,
Amir.
Changes since v1:
- Create helper for internal O_PATH open with negative path
- Create backing_tmpfile_finish() API to fixup the negative path
[1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
[2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
fs/backing-file.c | 6 ++++++
fs/file_table.c | 14 ++++++++++++--
fs/internal.h | 7 +++++++
fs/open.c | 34 ++++++++++++++++++++++++++++++----
fs/overlayfs/dir.c | 2 ++
include/linux/backing-file.h | 1 +
6 files changed, 58 insertions(+), 6 deletions(-)
diff --git a/fs/backing-file.c b/fs/backing-file.c
index d0a64c2103907..3357d624eac96 100644
--- a/fs/backing-file.c
+++ b/fs/backing-file.c
@@ -80,6 +80,12 @@ struct file *backing_tmpfile_open(const struct path *user_path,
}
EXPORT_SYMBOL(backing_tmpfile_open);
+void backing_tmpfile_finish(struct file *file)
+{
+ backing_file_set_user_path_inode(file);
+}
+EXPORT_SYMBOL_GPL(backing_tmpfile_finish);
+
struct backing_aio {
struct kiocb iocb;
refcount_t ref;
diff --git a/fs/file_table.c b/fs/file_table.c
index e8b4eb2bbff85..a4d1064d50896 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -69,11 +69,21 @@ EXPORT_SYMBOL_GPL(backing_file_user_path_file);
int backing_file_open_user_path(struct file *f, const struct path *path)
{
- /* open an O_PATH file to reference the user path - should not fail */
- return WARN_ON(vfs_open(path, &backing_file(f)->user_path_file));
+ if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
+ return -EIO;
+ kernel_path_file_open(&backing_file(f)->user_path_file, path);
+ return 0;
}
EXPORT_SYMBOL_GPL(backing_file_open_user_path);
+void backing_file_set_user_path_inode(struct file *f)
+{
+ if (WARN_ON(!(f->f_mode & FMODE_BACKING)))
+ return;
+ file_set_d_inode(&backing_file(f)->user_path_file);
+}
+EXPORT_SYMBOL_GPL(backing_file_set_user_path_inode);
+
static void destroy_file(struct file *f)
{
security_file_free(f);
diff --git a/fs/internal.h b/fs/internal.h
index 7c44a58627ba3..4a9e5e00678d9 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@ -109,6 +109,13 @@ struct file *alloc_empty_file_noaccount(int flags, const struct cred *cred);
struct file *alloc_empty_backing_file(int flags, const struct cred *cred,
const struct cred *user_cred);
int backing_file_open_user_path(struct file *f, const struct path *path);
+void backing_file_set_user_path_inode(struct file *f);
+void kernel_path_file_open(struct file *f, const struct path *path);
+
+static inline void file_set_d_inode(struct file *f)
+{
+ f->f_inode = d_inode(f->f_path.dentry);
+}
static inline void file_put_write_access(struct file *file)
{
diff --git a/fs/open.c b/fs/open.c
index 91f1139591abe..a7b3b04cd9ae7 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -884,10 +884,38 @@ static inline int file_get_write_access(struct file *f)
return error;
}
+static const struct file_operations empty_fops = {};
+
+static void do_path_file_open(struct file *f)
+{
+ f->f_mode = FMODE_PATH | FMODE_OPENED;
+ file_set_fsnotify_mode(f, FMODE_NONOTIFY);
+ f->f_op = &empty_fops;
+}
+
+/**
+ * kernel_path_file_open - open an O_PATH file for kernel internal use
+ * @f: pre-allocated file with f_flags and f_cred initialized
+ * @path: path to reference (may have a negative dentry)
+ *
+ * Open a minimal O_PATH file that only references a path.
+ * Unlike vfs_open(), this does not require a positive dentry and does not
+ * set up f_mapping and other fields not needed for O_PATH.
+ * If path is negative at the time of this call, the caller is responsible for
+ * callingn backing_file_set_user_path_inode() after making the path positive.
+
+ */
+void kernel_path_file_open(struct file *f, const struct path *path)
+{
+ f->__f_path = *path;
+ path_get(&f->f_path);
+ file_set_d_inode(f);
+ do_path_file_open(f);
+}
+
static int do_dentry_open(struct file *f,
int (*open)(struct inode *, struct file *))
{
- static const struct file_operations empty_fops = {};
struct inode *inode = f->f_path.dentry->d_inode;
int error;
@@ -898,9 +926,7 @@ static int do_dentry_open(struct file *f,
f->f_sb_err = file_sample_sb_err(f);
if (unlikely(f->f_flags & O_PATH)) {
- f->f_mode = FMODE_PATH | FMODE_OPENED;
- file_set_fsnotify_mode(f, FMODE_NONOTIFY);
- f->f_op = &empty_fops;
+ do_path_file_open(f);
return 0;
}
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 5fd32ccc134d2..4010c87e10196 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -1408,6 +1408,8 @@ static int ovl_create_tmpfile(struct file *file, struct dentry *dentry,
err = ovl_instantiate(dentry, inode, newdentry, false, file);
if (!err) {
file->private_data = of;
+ /* user_path_file was opened with a negative path */
+ backing_tmpfile_finish(realfile);
} else {
dput(newdentry);
ovl_file_free(of);
diff --git a/include/linux/backing-file.h b/include/linux/backing-file.h
index 8afba93f3ce07..52ac51ada6ff9 100644
--- a/include/linux/backing-file.h
+++ b/include/linux/backing-file.h
@@ -47,6 +47,7 @@ struct file *backing_tmpfile_open(const struct path *user_path,
const struct cred *user_cred, int flags,
const struct path *real_parentpath,
umode_t mode, const struct cred *cred);
+void backing_tmpfile_finish(struct file *file);
ssize_t backing_file_read_iter(struct file *file, struct iov_iter *iter,
struct kiocb *iocb, int flags,
struct backing_file_ctx *ctx);
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v6 3/9] landlock: Control pathname UNIX domain socket resolution by path
From: Günther Noack @ 2026-03-20 12:28 UTC (permalink / raw)
To: Mickaël Salaün
Cc: Justin Suess, Sebastian Andrzej Siewior, John Johansen,
Tingmao Wang, Kuniyuki Iwashima, Jann Horn, linux-security-module,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross,
Tahera Fahimi
In-Reply-To: <20260318.aequoaDaeb7h@digikod.net>
On Wed, Mar 18, 2026 at 06:52:57PM +0100, Mickaël Salaün wrote:
> On Wed, Mar 18, 2026 at 12:43:55PM -0400, Justin Suess wrote:
> > On Wed, Mar 18, 2026 at 05:26:20PM +0100, Mickaël Salaün wrote:
> > > On Wed, Mar 18, 2026 at 04:05:59PM +0100, Sebastian Andrzej Siewior wrote:
> > > > On 2026-03-18 10:14:52 [-0400], Justin Suess wrote:
> > > > > Sebastian,
> > > > Justin,
> > > >
> > > > > In short: dom_other is a pointer to a landlock-owned refcounted struct.
> > > > …
> > > > >
> > > > > But we copy the domain pointer, which points to a landlock allocated
> > > > > and controlled object.
> > > >
> > > > and this is not going away while we are here and preempted after
> > > > dropping the lock? (if the landlock policy is updated/ changed/ …)
> > >
> > > I agree with Sebastian, this is a bug, see my original proposal:
> > > https://lore.kernel.org/all/20260217.lievaS8eeng8@digikod.net/
> > Mickaël,
> >
> > Just to make sure we're speaking of the same thing (I spotted a bug
> > shortly after replying to Sebastian).
> >
> > This is a potential UAF if the dom_other is freed before the access
> > check takes place correct?
>
> Yes
>
> >
> > dom_other = landlock_cred(other->sk_socket->file->f_cred)->domain;
> > unix_state_unlock(other);
> >
> > unmask_scoped_access(subject->domain, dom_other, &layer_masks,
> > fs_resolve_unix.fs);
> >
> > If the dom_other->usage reaches zero, then the domain could be
> > freed after the unix_state_unlock while we're checking access??
> >
> > (I guess I assumed the sock_hold on the @other would prevent the task
> > @other belongs to from being freed.)
> >
> > Would it be better to move the access check under the unix_state_lock or
> > to acquire another reference to the ruleset (or something else)?
>
> Because the unmask_scoped_access() only read a bounded array, it's
> simpler to unlock just after.
>
> The other alternatives would be to use an RCU lock or a new reference
> but I don't think it's worth it.
Thank you, Sebastian, Mickaël and Justin for spotting this!
I agree, holding the existing lock across the unmask_scoped_access()
call seems like the simplest solution. This function only walks a
previously loaded bounded memory structure, so it does not seem worth
switching to a different lock for that.
I'm changing my WIP for V7 to hold the unix_state_lock across that
function call.
–Günther
^ permalink raw reply
* Re: [PATCH] fs: allow vfs code to open an O_PATH file with negative dentry
From: Amir Goldstein @ 2026-03-20 12:15 UTC (permalink / raw)
To: Christian Brauner
Cc: Al Viro, Miklos Szeredi, Gao Xiang, linux-security-module,
selinux, linux-erofs, linux-fsdevel, linux-unionfs,
syzbot+f34aab278bf5d664e2be, Paul Moore
In-Reply-To: <CAOQ4uxinSLXKWjMgwyA2A_UU5e+6ZjbdsFUY-+f9DMfQcxH0qA@mail.gmail.com>
On Thu, Mar 19, 2026 at 7:30 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Thu, Mar 19, 2026 at 4:55 PM Paul Moore <paul@paul-moore.com> wrote:
> >
> > On Thu, Mar 19, 2026 at 10:50 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Thu, Mar 19, 2026 at 2:13 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > On Thu, Mar 19, 2026 at 01:46:16PM +0100, Amir Goldstein wrote:
> > > > > The fields f_mapping, f_wb_err, f_sb_err are irrelevant for O_PATH file.
> > > > > Skip setting them for O_PATH file, so that the O_PATH file could be
> > > > > opened with a negative dentry.
> > > > >
> > > > > This is not something that a user should be able to do, but vfs code,
> > > > > such as ovl_tmpfile() can use this to open a backing O_PATH tmpfile
> > > > > before instantiating the dentry.
> > > > >
> > > > > Reported-by: syzbot+f34aab278bf5d664e2be@syzkaller.appspotmail.com
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > ---
> > > > >
> > > > > Christian,
> > > > >
> > > > > This patch fixes the syzbot report [1] that the
> > > > > backing_file_user_path_file() patch [2] introduces.
> > > > >
> > > > > This is not the only possible fix, but it is the cleanest one IMO.
> > > > > There is a small risk in introducing a state of an O_PATH file with
> > > > > NULL f_inode, but I (and the bots that I asked) did not find any
> > > > > obvious risk in this state.
> > > > >
> > > > > Note that specifically, the user path inode is accessed via d_inode()
> > > > > and not via file_inode(), which makes this safe for file_user_inode()
> > > > > callers.
> > > > >
> > > > > BTW, I missed this regression with the original patch because I
> > > > > only ran the quick overlayfs sanity test.
> > > > >
> > > > > Now I ran a full quick fstest cycle and verified that the O_TMPFILE
> > > > > test case is covered and that the bug is detected.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Thanks,
> > > > > Amir.
> > > > >
> > > > > [1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
> > > > > [2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
> > > > >
> > > > > fs/open.c | 7 ++++---
> > > > > 1 file changed, 4 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/fs/open.c b/fs/open.c
> > > > > index 91f1139591abe..2004a8c0d9c97 100644
> > > > > --- a/fs/open.c
> > > > > +++ b/fs/open.c
> > > > > @@ -893,9 +893,6 @@ static int do_dentry_open(struct file *f,
> > > > >
> > > > > path_get(&f->f_path);
> > > > > f->f_inode = inode;
> > > > > - f->f_mapping = inode->i_mapping;
> > > > > - f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > > - f->f_sb_err = file_sample_sb_err(f);
> > > > >
> > > > > if (unlikely(f->f_flags & O_PATH)) {
> > > > > f->f_mode = FMODE_PATH | FMODE_OPENED;
> > > > > @@ -904,6 +901,10 @@ static int do_dentry_open(struct file *f,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > + f->f_mapping = inode->i_mapping;
> > > > > + f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > > + f->f_sb_err = file_sample_sb_err(f);
> > > > > +
> > > > > if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
> > > > > i_readcount_inc(inode);
> > > > > } else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
> > > >
> > > > I think this is really ugly and I'm really unhappy that we should adjust
> > > > initialization of generic vfs code for this. My preference is to push
> > > > the pain into the backing file stuff. And my ultimate preference is for
> > > > this backing file stuff to be removed again for a simple struct path.
> > > > We're working around some more fundamental cleanup here imho.
> > >
> > > Fair enough, we can rip the entire thing from vfs if you don't like it.
> > > The user path file can be opened and stored internally by selinux
> > > without adding all the associated risks in vfs.
> > >
> > > Paul,
> > >
> > > Please see compile tested code at:
> > > https://github.com/amir73il/linux/commits/user_path_file/
> >
> > No. Definitely no. Ignoring the fact that there is no reason we
> > should pushing this into the LSM, doing it in this way means it is
> > very likely that each LSM wanting to provide mmap/mprotect controls on
> > overlayfs will have to create a new O_PATH file. No.
> >
> > ... and let me preemptively comment that this doesn't belong in the
> > LSM framework either.
> >
> > As Christian already mentioned, this really needs to be addressed in
> > the backing file code, please do it there.
> >
>
> OK, will give it another try.
>
Christian,
I pushed another version of the syzbot ovl O_TMPFILE crash fix to:
https://github.com/amir73il/linux/commits/user_path_file/
In a nutshell, created a helper kernel_path_file_open()
instead of modifying do_dentry_open(), with a bit more magic in
ovl_tmpfile().
It's not super pretty, but at least it does not touch any non-backing_file
code paths, so maybe you will be ok with it.
Thanks,
Amir.
^ permalink raw reply
* [PATCH] smack: simplify write handlers of sysfs entries
From: Dmitry Antipov @ 2026-03-20 11:31 UTC (permalink / raw)
To: Casey Schaufler, Paul Moore, James Morris, Serge E. Hallyn,
Konstantin Andreev
Cc: linux-security-module, Dmitry Antipov
Use the convenient 'kstrto{u,s}32_from_user()' to simplify write
handlers of /smack/{doi,direct,mapped,logging,ptrace} sysfs entries.
Signed-off-by: Dmitry Antipov <dmantipov@yandex.ru>
---
security/smack/smackfs.c | 81 +++++++++++-----------------------------
1 file changed, 22 insertions(+), 59 deletions(-)
diff --git a/security/smack/smackfs.c b/security/smack/smackfs.c
index 6e62dcb36f74..f60d5469043e 100644
--- a/security/smack/smackfs.c
+++ b/security/smack/smackfs.c
@@ -1598,24 +1598,17 @@ static ssize_t smk_read_doi(struct file *filp, char __user *buf,
static ssize_t smk_write_doi(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- char temp[80];
- unsigned long u;
+ int ret;
+ u32 u;
if (!smack_privileged(CAP_MAC_ADMIN))
return -EPERM;
- if (count >= sizeof(temp) || count == 0)
- return -EINVAL;
-
- if (copy_from_user(temp, buf, count) != 0)
- return -EFAULT;
-
- temp[count] = '\0';
+ ret = kstrtou32_from_user(buf, count, 10, &u);
+ if (unlikely(ret))
+ return ret;
- if (kstrtoul(temp, 10, &u))
- return -EINVAL;
-
- if (u == CIPSO_V4_DOI_UNKNOWN || u > U32_MAX)
+ if (u == CIPSO_V4_DOI_UNKNOWN)
return -EINVAL;
return smk_cipso_doi(u, GFP_KERNEL) ? : count;
@@ -1664,22 +1657,14 @@ static ssize_t smk_write_direct(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct smack_known *skp;
- char temp[80];
- int i;
+ int i, ret;
if (!smack_privileged(CAP_MAC_ADMIN))
return -EPERM;
- if (count >= sizeof(temp) || count == 0)
- return -EINVAL;
-
- if (copy_from_user(temp, buf, count) != 0)
- return -EFAULT;
-
- temp[count] = '\0';
-
- if (sscanf(temp, "%d", &i) != 1)
- return -EINVAL;
+ ret = kstrtos32_from_user(buf, count, 10, &i);
+ if (unlikely(ret))
+ return ret;
/*
* Don't do anything if the value hasn't actually changed.
@@ -1742,22 +1727,14 @@ static ssize_t smk_write_mapped(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
struct smack_known *skp;
- char temp[80];
- int i;
+ int i, ret;
if (!smack_privileged(CAP_MAC_ADMIN))
return -EPERM;
- if (count >= sizeof(temp) || count == 0)
- return -EINVAL;
-
- if (copy_from_user(temp, buf, count) != 0)
- return -EFAULT;
-
- temp[count] = '\0';
-
- if (sscanf(temp, "%d", &i) != 1)
- return -EINVAL;
+ ret = kstrtos32_from_user(buf, count, 10, &i);
+ if (unlikely(ret))
+ return ret;
/*
* Don't do anything if the value hasn't actually changed.
@@ -2179,22 +2156,15 @@ static ssize_t smk_read_logging(struct file *filp, char __user *buf,
static ssize_t smk_write_logging(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- char temp[32];
- int i;
+ int i, ret;
if (!smack_privileged(CAP_MAC_ADMIN))
return -EPERM;
- if (count >= sizeof(temp) || count == 0)
- return -EINVAL;
-
- if (copy_from_user(temp, buf, count) != 0)
- return -EFAULT;
+ ret = kstrtos32_from_user(buf, count, 10, &i);
+ if (unlikely(ret))
+ return ret;
- temp[count] = '\0';
-
- if (sscanf(temp, "%d", &i) != 1)
- return -EINVAL;
if (i < 0 || i > 3)
return -EINVAL;
log_policy = i;
@@ -2838,22 +2808,15 @@ static ssize_t smk_read_ptrace(struct file *filp, char __user *buf,
static ssize_t smk_write_ptrace(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
- char temp[32];
- int i;
+ int i, ret;
if (!smack_privileged(CAP_MAC_ADMIN))
return -EPERM;
- if (*ppos != 0 || count >= sizeof(temp) || count == 0)
- return -EINVAL;
-
- if (copy_from_user(temp, buf, count) != 0)
- return -EFAULT;
+ ret = kstrtos32_from_user(buf, count, 10, &i);
+ if (unlikely(ret))
+ return ret;
- temp[count] = '\0';
-
- if (sscanf(temp, "%d", &i) != 1)
- return -EINVAL;
if (i < SMACK_PTRACE_DEFAULT || i > SMACK_PTRACE_MAX)
return -EINVAL;
smack_ptrace_rule = i;
--
2.53.0
^ permalink raw reply related
* Re: [PATCH v6 5/9] landlock/selftests: Test LANDLOCK_ACCESS_FS_RESOLVE_UNIX
From: Günther Noack @ 2026-03-20 10:51 UTC (permalink / raw)
To: Mickaël Salaün
Cc: John Johansen, Justin Suess, Tingmao Wang, linux-security-module,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross, Jann Horn,
Tahera Fahimi, Sebastian Andrzej Siewior, Kuniyuki Iwashima
In-Reply-To: <20260318.eghahf0Chiez@digikod.net>
On Wed, Mar 18, 2026 at 05:53:15PM +0100, Mickaël Salaün wrote:
> The subject's prefix is swapped, it should be "selftests/landlock".
Thanks, done!
–Günther
^ permalink raw reply
* Re: [PATCH v6 2/9] landlock: use mem_is_zero() in is_layer_masks_allowed()
From: Günther Noack @ 2026-03-20 10:50 UTC (permalink / raw)
To: Mickaël Salaün
Cc: John Johansen, linux-security-module, Tingmao Wang, Justin Suess,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross, Jann Horn,
Tahera Fahimi, Sebastian Andrzej Siewior, Kuniyuki Iwashima
In-Reply-To: <20260318.ePee0acaeng8@digikod.net>
On Wed, Mar 18, 2026 at 05:52:10PM +0100, Mickaël Salaün wrote:
> Subject should be "landlock: Use..."
Thanks, done.
–Günther
^ permalink raw reply
* Re: [PATCH RESEND] lsm: Fix the crash issue in xfrm_decode_session
From: Feng Yang @ 2026-03-20 3:24 UTC (permalink / raw)
To: stephen.smalley.work, casey, jmorris, paul, serge
Cc: linux-kernel, linux-security-module, yangfeng59949
In-Reply-To: <CAEjxPJ5aA01in+Z1yLF1cwe-3uqL_E8SKGK4J294D5eRG5__5Q@mail.gmail.com>
On Thu, 19 Mar 2026 14:22:06 -0400, Stephen Smalley wrote:
> On Thu, Mar 19, 2026 at 1:52 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 3/18/2026 7:22 PM, Feng Yang wrote:
> > > On Wed, 18 Mar 2026 10:09:47 -0700, Casey Schaufler wrote:
> > >> On 3/17/2026 11:19 PM, Feng Yang wrote:
> > >>> From: Feng Yang <yangfeng@kylinos.cn>
> > >>>
> > >>> After hooking the following BPF program:
> > >>> SEC("lsm/xfrm_decode_session")
> > >>> int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall)
> > >>> {
> > >>> return 1; // Any non-zero value
> > >>> }
> > >>> Subsequent packet transmission triggers will cause a kernel panic:
> > >> LSM hooks that use or provide secids cannot be stacked. That is,
> > >> you can't provide a BPF LSM hook and an SELinux LSM hook and expect
> > >> correct behavior. Your proposed "fix" removes a legitimate check.
> > > I'm very sorry, I didn't quite understand what you meant.
> > >
> > > Maybe my commit message wasn't clear. I only used a BPF LSM hook without SELinux stacking enabled.
> >
> > Do i understand correctly that you do not have SELinux enabled?
> >
> > > Therefore, is it the expected behavior that simply using
> > > SEC("lsm/xfrm_decode_session")
> > > int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall) {
> > > return -1;
> > > }
> > > would cause a kernel panic?
> >
> > Yes. As the infrastructure is written, any return other than 0 is erroneous.
> > You will need to query the SELinux developers about why they decided to
> > implement the xfrm handling in this way.
>
> Looks like this BUG_ON() has just been preserved since first being
> introduced by:
> https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=beb8d13bed80f8388f1a9a107d07ddd342e627e8
Yes, there is a comment in https://lore.kernel.org/all/Pine.LNX.4.64.0607122149070.573@d.namei/:
+static inline void security_xfrm_skb_secid(struct sk_buff *skb, u32 *secid)
{
- return security_ops->xfrm_decode_session(skb, fl);
+ BUG_ON(security_ops->xfrm_decode_session(skb, secid, 0));
BUG_ON looks wrong here, in that you don't know why the LSM returned an
error, and why should the box panic at this point at all?
But it seems no one has answered this question before.
> I think we can remove it at this point - it evidently doesn't ever
> trigger for SELinux or we would have seen it by now and it is
> definitely unsafe for BPF LSM.
Yes, the call in the SELinux module is as follows.
void security_skb_classify_flow(struct sk_buff *skb, struct flowi_common *flic)
{
int rc = call_int_hook(xfrm_decode_session, skb, &flic->flowic_secid,
0);
BUG_ON(rc);
}
int selinux_xfrm_decode_session(struct sk_buff *skb, u32 *sid, int ckall)
{
if (skb == NULL) {
*sid = SECSID_NULL;
return 0;
}
return selinux_xfrm_skb_sid_ingress(skb, sid, ckall);
}
static int selinux_xfrm_skb_sid_ingress(struct sk_buff *skb,
u32 *sid, int ckall)
{
u32 sid_session = SECSID_NULL;
struct sec_path *sp = skb_sec_path(skb);
if (sp) {
int i;
for (i = sp->len - 1; i >= 0; i--) {
struct xfrm_state *x = sp->xvec[i];
if (selinux_authorizable_xfrm(x)) {
struct xfrm_sec_ctx *ctx = x->security;
if (sid_session == SECSID_NULL) {
sid_session = ctx->ctx_sid;
if (!ckall)
goto out;
} else if (sid_session != ctx->ctx_sid) {
*sid = SECSID_NULL;
return -EINVAL;
}
}
}
}
out:
*sid = sid_session;
return 0;
}
Since ckall is 0, selinux_xfrm_skb_sid_ingress will only return 0.
Therefore, the selinux_xfrm_decode_session function will only return 0 when ckall is 0, and BUG_ON will never be triggered.
However, BPF can be exploited to cause a system crash.
^ permalink raw reply
* Re: [PATCH] lsm: Fix the crash issue in xfrm_decode_session
From: Feng Yang @ 2026-03-20 3:20 UTC (permalink / raw)
To: stephen.smalley.work, casey, jmorris; +Cc: linux-security-module, yangfeng59949
In-Reply-To: <CAEjxPJ5aA01in+Z1yLF1cwe-3uqL_E8SKGK4J294D5eRG5__5Q@mail.gmail.com>
On Thu, 19 Mar 2026 14:22:06 -0400, Stephen Smalley wrote:
> On Thu, Mar 19, 2026 at 1:52 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
> >
> > On 3/18/2026 7:22 PM, Feng Yang wrote:
> > > On Wed, 18 Mar 2026 10:09:47 -0700, Casey Schaufler wrote:
> > >> On 3/17/2026 11:19 PM, Feng Yang wrote:
> > >>> From: Feng Yang <yangfeng@kylinos.cn>
> > >>>
> > >>> After hooking the following BPF program:
> > >>> SEC("lsm/xfrm_decode_session")
> > >>> int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall)
> > >>> {
> > >>> return 1; // Any non-zero value
> > >>> }
> > >>> Subsequent packet transmission triggers will cause a kernel panic:
> > >> LSM hooks that use or provide secids cannot be stacked. That is,
> > >> you can't provide a BPF LSM hook and an SELinux LSM hook and expect
> > >> correct behavior. Your proposed "fix" removes a legitimate check.
> > > I'm very sorry, I didn't quite understand what you meant.
> > >
> > > Maybe my commit message wasn't clear. I only used a BPF LSM hook without SELinux stacking enabled.
> >
> > Do i understand correctly that you do not have SELinux enabled?
> >
> > > Therefore, is it the expected behavior that simply using
> > > SEC("lsm/xfrm_decode_session")
> > > int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall) {
> > > return -1;
> > > }
> > > would cause a kernel panic?
> >
> > Yes. As the infrastructure is written, any return other than 0 is erroneous.
> > You will need to query the SELinux developers about why they decided to
> > implement the xfrm handling in this way.
>
> Looks like this BUG_ON() has just been preserved since first being
> introduced by:
> https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=beb8d13bed80f8388f1a9a107d07ddd342e627e8
Yes, there is a comment in https://lore.kernel.org/all/Pine.LNX.4.64.0607122149070.573@d.namei/:
+static inline void security_xfrm_skb_secid(struct sk_buff *skb, u32 *secid)
{
- return security_ops->xfrm_decode_session(skb, fl);
+ BUG_ON(security_ops->xfrm_decode_session(skb, secid, 0));
BUG_ON looks wrong here, in that you don't know why the LSM returned an
error, and why should the box panic at this point at all?
But it seems no one has answered this question before.
> I think we can remove it at this point - it evidently doesn't ever
> trigger for SELinux or we would have seen it by now and it is
> definitely unsafe for BPF LSM.
Yes, the call in the SELinux module is as follows.
void security_skb_classify_flow(struct sk_buff *skb, struct flowi_common *flic)
{
int rc = call_int_hook(xfrm_decode_session, skb, &flic->flowic_secid,
0);
BUG_ON(rc);
}
int selinux_xfrm_decode_session(struct sk_buff *skb, u32 *sid, int ckall)
{
if (skb == NULL) {
*sid = SECSID_NULL;
return 0;
}
return selinux_xfrm_skb_sid_ingress(skb, sid, ckall);
}
static int selinux_xfrm_skb_sid_ingress(struct sk_buff *skb,
u32 *sid, int ckall)
{
u32 sid_session = SECSID_NULL;
struct sec_path *sp = skb_sec_path(skb);
if (sp) {
int i;
for (i = sp->len - 1; i >= 0; i--) {
struct xfrm_state *x = sp->xvec[i];
if (selinux_authorizable_xfrm(x)) {
struct xfrm_sec_ctx *ctx = x->security;
if (sid_session == SECSID_NULL) {
sid_session = ctx->ctx_sid;
if (!ckall)
goto out;
} else if (sid_session != ctx->ctx_sid) {
*sid = SECSID_NULL;
return -EINVAL;
}
}
}
}
out:
*sid = sid_session;
return 0;
}
Since ckall is 0, selinux_xfrm_skb_sid_ingress will only return 0.
Therefore, the selinux_xfrm_decode_session function will only return 0 when ckall is 0, and BUG_ON will never be triggered.
However, BPF can be exploited to cause a system crash.
^ permalink raw reply
* Re: [PATCH] lsm: Fix the crash issue in xfrm_decode_session
From: Feng Yang @ 2026-03-20 3:03 UTC (permalink / raw)
To: casey
Cc: jmorris, linux-kernel, linux-security-module, paul, serge,
yangfeng59949
In-Reply-To: <80ff003b-51fd-4ec8-9505-dfc1c5890ced@schaufler-ca.com>
On Thu, 19 Mar 2026 10:51:24 -0700, Casey Schaufler wrote:
> On 3/18/2026 7:22 PM, Feng Yang wrote:
> > On Wed, 18 Mar 2026 10:09:47 -0700, Casey Schaufler wrote:
> >> On 3/17/2026 11:19 PM, Feng Yang wrote:
> >>> From: Feng Yang <yangfeng@kylinos.cn>
> >>>
> >>> After hooking the following BPF program:
> >>> SEC("lsm/xfrm_decode_session")
> >>> int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall)
> >>> {
> >>> return 1; // Any non-zero value
> >>> }
> >>> Subsequent packet transmission triggers will cause a kernel panic:
> >> LSM hooks that use or provide secids cannot be stacked. That is,
> >> you can't provide a BPF LSM hook and an SELinux LSM hook and expect
> >> correct behavior. Your proposed "fix" removes a legitimate check.
> > I'm very sorry, I didn't quite understand what you meant.
> >
> > Maybe my commit message wasn't clear. I only used a BPF LSM hook without SELinux stacking enabled.
>
> Do i understand correctly that you do not have SELinux enabled?
Yes.
>
> > Therefore, is it the expected behavior that simply using
> > SEC("lsm/xfrm_decode_session")
> > int BPF_PROG(lsm_hook_xfrm_decode_session, struct sk_buff *skb, u32 *secid, int ckall) {
> > return -1;
> > }
> > would cause a kernel panic?
>
> Yes. As the infrastructure is written, any return other than 0 is erroneous.
> You will need to query the SELinux developers about why they decided to
> implement the xfrm handling in this way.
>
> > If not, and if the BUG_ON check is still necessary,
> > then does it mean we need to modify the return value validation logic in the BPF
> > verifier to ensure that only BPF programs returning 0 are accepted for this hook?
> >
> > Thanks.
> >
> >
^ permalink raw reply
* Re: [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: steven chen @ 2026-03-19 21:31 UTC (permalink / raw)
To: Mimi Zohar, Roberto Sassu, corbet, skhan, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <587e11bf4d29552bbbfc029f716146e8ebfca1eb.camel@linux.ibm.com>
On 3/17/2026 2:03 PM, Mimi Zohar wrote:
> Hi Roberto,
>
> On Wed, 2026-03-11 at 18:19 +0100, Roberto Sassu wrote:
>> From: Roberto Sassu <roberto.sassu@huawei.com>
>>
>> Introduce the ability of staging the IMA measurement list for deletion.
>> Staging means moving the current content of the measurement list to a
>> separate location, and allowing users to read and delete it. This causes
>> the measurement list to be atomically truncated before new measurements can
>> be added.
> I really like this design of atomically moving and subsequently deleting the
> measurement list. However this is a solution, not the motivation for the patch.
> Please include the motivation for the patch, before describing the solution.
>
>> Staging can be done only once at a time. In the event of kexec(),
>> staging is reverted and staged entries will be carried over to the new
>> kernel.
>> Staged measurements can be deleted entirely, or partially, with the
>> non-deleted ones added back to the IMA measurements list.
> This patch description is really long, which is an indication that the patch
> needs to be split up. Adding support for partially deleting the measurement
> list records, by prepending the remaining measurement records, should be a
> separate patch.
>
>> This allows the
>> remote attestation agents to easily separate the measurements that where
>> verified (staged and deleted) from those that weren't due to the race
>> between taking a TPM quote and reading the measurements list.
>>
>> User space is responsible to concatenate the staged IMA measurements list
>> portions (excluding the measurements added back to the IMA measurements
>> list) following the temporal order in which the operations were done,
>> together with the current measurement list. Then, it can send the collected
>> data to the remote verifiers.
> This belongs in a Documentation patch.
>
>> The benefit of staging and deleting is the ability to free precious kernel
>> memory,
> This is the motivation for the patch.
>
>> in exchange of delegating user space to reconstruct the full
>> measurement list from the chunks. No trust needs to be given to user space,
>> since the integrity of the measurement list is protected by the TPM.
> Agreed the measurement list, itself, is protected by the TPM. However, relying
> on userspace to reassemble the chunks is another concern. Support for staging
> and deleting the measurement list should be configurable. Defining a Kconfig
> should be part of this initial patch.
>
>> By default, staging the measurements list does not alter the hash table.
>> When staging and deleting are done, IMA is still able to detect collisions
>> on the staged and later deleted measurement entries, by keeping the entry
>> digests (only template data are freed).
>>
>> However, since during the measurements list serialization only the SHA1
>> digest is passed, and since there are no template data to recalculate the
>> other digests from, the hash table is currently not populated with digests
>> from staged/deleted entries after kexec().
>>
>> Introduce the new kernel option ima_flush_htable to decide whether or not
>> the digests of staged measurement entries are flushed from the hash table,
>> when they are deleted. Flushing the hash table is supported only when
>> deleting all the staged measurements, since in that case the old hash table
>> can be quickly swapped with a blank one (otherwise entries would have to be
>> removed one by one for partial deletion).
> Allowing the hash table to be deleted would be an example of another patch.
>
>> Then, introduce ascii_runtime_measurements_<algo>_staged and
>> binary_runtime_measurements_<algo>_staged interfaces to stage and delete
>> the measurements. Use 'echo A > <IMA interface>' and
>> 'echo D > <IMA interface>' to respectively stage and delete the entire
>> measurements list. Use 'echo N > <IMA interface>', with N between 1 and
>> ULONG_MAX - 1, to delete the selected staged portion of the measurements
>> list.
>>
>> The ima_measure_users counter (protected by the ima_measure_mutex mutex)
>> has been introduced to protect access to the measurements list and the
>> staged part. The open method of all the measurement interfaces has been
>> extended to allow only one writer at a time or, in alternative, multiple
>> readers. The write permission is used to stage and delete the measurements,
>> the read permission to read them. Write requires also the CAP_SYS_ADMIN
>> capability.
> Yes, this is part of the initial patch that adds support for staging the
> measurement list.
>
>> Finally, introduce the binary_lists enum and make binary_runtime_size
>> and ima_num_entries as arrays, to keep track of their values for the
>> current IMA measurements list (BINARY), current list plus staged
>> measurements (BINARY_STAGED) and the cumulative list since IMA
>> initialization (BINARY_FULL).
>>
>> Use BINARY in ima_show_measurements_count(), BINARY_STAGED in
>> ima_add_kexec_buffer() and BINARY_FULL in ima_measure_kexec_event().
>>
>> It should be noted that the BINARY_FULL counter is not passed through
>> kexec. Thus, the number of entries included in the kexec critical data
>> records refers to the entries since the previous kexec records.
>>
>> Note: This code derives from the Alt-IMA Huawei project, whose license is
>> GPL-2.0 OR MIT.
>>
>> Link: https://github.com/linux-integrity/linux/issues/1
>> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
> The design looks good. As I mentioned above, this patch description is quite
> long, which is an indication that the patch needs to be split up. One method of
> breaking it up would be:
>
> - (Basic) support for staging measurements for deletion (based on a Kconfig)
> - Support for removing the hash table
Great work on performance improvement and hash table redesign.
If update the following "Trim N" method patch with your current patch
lock time performance
improvement plus the hash table change, "Trim N" method can do the same
kernel
measurement list lock time as staged method do, right?
https://lore.kernel.org/linux-integrity/20260205235849.7086-1-chenste@linux.microsoft.com/
> - Support for deleting N measurement records (and pre-pending the remaining
> measurement records)
Is there any problem to bring work of "stage" step together to the
deletion step?
"Trim N" method does everything that "staged" method can do, right?
what's the "stage"
method can do but "trim N" method can't do?
in user space, if in "staged" state, no other user space agent can
access the IMA measure list, right?
Could you explain the benefit of bringing the "stage" step?
Thanks,
Steven
> - Adding documentation
>
> thanks,
>
> Mimi
^ permalink raw reply
* Re: [PATCH] fs: allow vfs code to open an O_PATH file with negative dentry
From: Paul Moore @ 2026-03-19 18:44 UTC (permalink / raw)
To: Amir Goldstein
Cc: Christian Brauner, Al Viro, Miklos Szeredi, Gao Xiang,
linux-security-module, selinux, linux-erofs, linux-fsdevel,
linux-unionfs, syzbot+f34aab278bf5d664e2be
In-Reply-To: <CAOQ4uxinSLXKWjMgwyA2A_UU5e+6ZjbdsFUY-+f9DMfQcxH0qA@mail.gmail.com>
On Thu, Mar 19, 2026 at 2:30 PM Amir Goldstein <amir73il@gmail.com> wrote:
> On Thu, Mar 19, 2026 at 4:55 PM Paul Moore <paul@paul-moore.com> wrote:
> > On Thu, Mar 19, 2026 at 10:50 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > > On Thu, Mar 19, 2026 at 2:13 PM Christian Brauner <brauner@kernel.org> wrote:
> > > > On Thu, Mar 19, 2026 at 01:46:16PM +0100, Amir Goldstein wrote:
> > > > > The fields f_mapping, f_wb_err, f_sb_err are irrelevant for O_PATH file.
> > > > > Skip setting them for O_PATH file, so that the O_PATH file could be
> > > > > opened with a negative dentry.
> > > > >
> > > > > This is not something that a user should be able to do, but vfs code,
> > > > > such as ovl_tmpfile() can use this to open a backing O_PATH tmpfile
> > > > > before instantiating the dentry.
> > > > >
> > > > > Reported-by: syzbot+f34aab278bf5d664e2be@syzkaller.appspotmail.com
> > > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > > ---
> > > > >
> > > > > Christian,
> > > > >
> > > > > This patch fixes the syzbot report [1] that the
> > > > > backing_file_user_path_file() patch [2] introduces.
> > > > >
> > > > > This is not the only possible fix, but it is the cleanest one IMO.
> > > > > There is a small risk in introducing a state of an O_PATH file with
> > > > > NULL f_inode, but I (and the bots that I asked) did not find any
> > > > > obvious risk in this state.
> > > > >
> > > > > Note that specifically, the user path inode is accessed via d_inode()
> > > > > and not via file_inode(), which makes this safe for file_user_inode()
> > > > > callers.
> > > > >
> > > > > BTW, I missed this regression with the original patch because I
> > > > > only ran the quick overlayfs sanity test.
> > > > >
> > > > > Now I ran a full quick fstest cycle and verified that the O_TMPFILE
> > > > > test case is covered and that the bug is detected.
> > > > >
> > > > > WDYT?
> > > > >
> > > > > Thanks,
> > > > > Amir.
> > > > >
> > > > > [1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
> > > > > [2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
> > > > >
> > > > > fs/open.c | 7 ++++---
> > > > > 1 file changed, 4 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/fs/open.c b/fs/open.c
> > > > > index 91f1139591abe..2004a8c0d9c97 100644
> > > > > --- a/fs/open.c
> > > > > +++ b/fs/open.c
> > > > > @@ -893,9 +893,6 @@ static int do_dentry_open(struct file *f,
> > > > >
> > > > > path_get(&f->f_path);
> > > > > f->f_inode = inode;
> > > > > - f->f_mapping = inode->i_mapping;
> > > > > - f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > > - f->f_sb_err = file_sample_sb_err(f);
> > > > >
> > > > > if (unlikely(f->f_flags & O_PATH)) {
> > > > > f->f_mode = FMODE_PATH | FMODE_OPENED;
> > > > > @@ -904,6 +901,10 @@ static int do_dentry_open(struct file *f,
> > > > > return 0;
> > > > > }
> > > > >
> > > > > + f->f_mapping = inode->i_mapping;
> > > > > + f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > > + f->f_sb_err = file_sample_sb_err(f);
> > > > > +
> > > > > if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
> > > > > i_readcount_inc(inode);
> > > > > } else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
> > > >
> > > > I think this is really ugly and I'm really unhappy that we should adjust
> > > > initialization of generic vfs code for this. My preference is to push
> > > > the pain into the backing file stuff. And my ultimate preference is for
> > > > this backing file stuff to be removed again for a simple struct path.
> > > > We're working around some more fundamental cleanup here imho.
> > >
> > > Fair enough, we can rip the entire thing from vfs if you don't like it.
> > > The user path file can be opened and stored internally by selinux
> > > without adding all the associated risks in vfs.
> > >
> > > Paul,
> > >
> > > Please see compile tested code at:
> > > https://github.com/amir73il/linux/commits/user_path_file/
> >
> > No. Definitely no. Ignoring the fact that there is no reason we
> > should pushing this into the LSM, doing it in this way means it is
> > very likely that each LSM wanting to provide mmap/mprotect controls on
> > overlayfs will have to create a new O_PATH file. No.
> >
> > ... and let me preemptively comment that this doesn't belong in the
> > LSM framework either.
> >
> > As Christian already mentioned, this really needs to be addressed in
> > the backing file code, please do it there.
>
> OK, will give it another try.
Thanks, please let me know if there is anything I can do to help, e.g.
testing, etc.
--
paul-moore.com
^ permalink raw reply
* Re: [PATCH] fs: allow vfs code to open an O_PATH file with negative dentry
From: Amir Goldstein @ 2026-03-19 18:30 UTC (permalink / raw)
To: Paul Moore
Cc: Christian Brauner, Al Viro, Miklos Szeredi, Gao Xiang,
linux-security-module, selinux, linux-erofs, linux-fsdevel,
linux-unionfs, syzbot+f34aab278bf5d664e2be
In-Reply-To: <CAHC9VhQDXHh_=X-OKD9oN8fapVXBwLakNp4rEHLUL0cQ1RSteA@mail.gmail.com>
On Thu, Mar 19, 2026 at 4:55 PM Paul Moore <paul@paul-moore.com> wrote:
>
> On Thu, Mar 19, 2026 at 10:50 AM Amir Goldstein <amir73il@gmail.com> wrote:
> > On Thu, Mar 19, 2026 at 2:13 PM Christian Brauner <brauner@kernel.org> wrote:
> > > On Thu, Mar 19, 2026 at 01:46:16PM +0100, Amir Goldstein wrote:
> > > > The fields f_mapping, f_wb_err, f_sb_err are irrelevant for O_PATH file.
> > > > Skip setting them for O_PATH file, so that the O_PATH file could be
> > > > opened with a negative dentry.
> > > >
> > > > This is not something that a user should be able to do, but vfs code,
> > > > such as ovl_tmpfile() can use this to open a backing O_PATH tmpfile
> > > > before instantiating the dentry.
> > > >
> > > > Reported-by: syzbot+f34aab278bf5d664e2be@syzkaller.appspotmail.com
> > > > Signed-off-by: Amir Goldstein <amir73il@gmail.com>
> > > > ---
> > > >
> > > > Christian,
> > > >
> > > > This patch fixes the syzbot report [1] that the
> > > > backing_file_user_path_file() patch [2] introduces.
> > > >
> > > > This is not the only possible fix, but it is the cleanest one IMO.
> > > > There is a small risk in introducing a state of an O_PATH file with
> > > > NULL f_inode, but I (and the bots that I asked) did not find any
> > > > obvious risk in this state.
> > > >
> > > > Note that specifically, the user path inode is accessed via d_inode()
> > > > and not via file_inode(), which makes this safe for file_user_inode()
> > > > callers.
> > > >
> > > > BTW, I missed this regression with the original patch because I
> > > > only ran the quick overlayfs sanity test.
> > > >
> > > > Now I ran a full quick fstest cycle and verified that the O_TMPFILE
> > > > test case is covered and that the bug is detected.
> > > >
> > > > WDYT?
> > > >
> > > > Thanks,
> > > > Amir.
> > > >
> > > > [1] https://syzkaller.appspot.com/bug?extid=f34aab278bf5d664e2be
> > > > [2] https://lore.kernel.org/linux-fsdevel/20260318131258.1457101-1-amir73il@gmail.com/
> > > >
> > > > fs/open.c | 7 ++++---
> > > > 1 file changed, 4 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/fs/open.c b/fs/open.c
> > > > index 91f1139591abe..2004a8c0d9c97 100644
> > > > --- a/fs/open.c
> > > > +++ b/fs/open.c
> > > > @@ -893,9 +893,6 @@ static int do_dentry_open(struct file *f,
> > > >
> > > > path_get(&f->f_path);
> > > > f->f_inode = inode;
> > > > - f->f_mapping = inode->i_mapping;
> > > > - f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > - f->f_sb_err = file_sample_sb_err(f);
> > > >
> > > > if (unlikely(f->f_flags & O_PATH)) {
> > > > f->f_mode = FMODE_PATH | FMODE_OPENED;
> > > > @@ -904,6 +901,10 @@ static int do_dentry_open(struct file *f,
> > > > return 0;
> > > > }
> > > >
> > > > + f->f_mapping = inode->i_mapping;
> > > > + f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> > > > + f->f_sb_err = file_sample_sb_err(f);
> > > > +
> > > > if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ) {
> > > > i_readcount_inc(inode);
> > > > } else if (f->f_mode & FMODE_WRITE && !special_file(inode->i_mode)) {
> > >
> > > I think this is really ugly and I'm really unhappy that we should adjust
> > > initialization of generic vfs code for this. My preference is to push
> > > the pain into the backing file stuff. And my ultimate preference is for
> > > this backing file stuff to be removed again for a simple struct path.
> > > We're working around some more fundamental cleanup here imho.
> >
> > Fair enough, we can rip the entire thing from vfs if you don't like it.
> > The user path file can be opened and stored internally by selinux
> > without adding all the associated risks in vfs.
> >
> > Paul,
> >
> > Please see compile tested code at:
> > https://github.com/amir73il/linux/commits/user_path_file/
>
> No. Definitely no. Ignoring the fact that there is no reason we
> should pushing this into the LSM, doing it in this way means it is
> very likely that each LSM wanting to provide mmap/mprotect controls on
> overlayfs will have to create a new O_PATH file. No.
>
> ... and let me preemptively comment that this doesn't belong in the
> LSM framework either.
>
> As Christian already mentioned, this really needs to be addressed in
> the backing file code, please do it there.
>
OK, will give it another try.
Thanks,
Amir.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox