[RFC] Landlock: mutable domains (and supervisor notification uAPI options)

public inbox for linux-security-module@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
@ 2026-02-15  2:54 Tingmao Wang
  2026-02-15 21:23 ` Justin Suess
  2026-02-22 18:04 ` Tingmao Wang
  0 siblings, 2 replies; 6+ messages in thread
From: Tingmao Wang @ 2026-02-15  2:54 UTC (permalink / raw)
  To: Günther Noack, Mickaël Salaün
  Cc: Justin Suess, Amir Goldstein, Jan Kara, Song Liu, Tetsuo Handa,
	Jann Horn, linux-security-module

Hi,

Recently I have been continuing work on the previously proposed Landlock
supervise feature (context below).  While I do have some rough PoCs, and
I'm aware that sometimes code is better than talk, because of the amount
of work involved, I would like to get some early feedback on the design
before continuing.

Scrappy demo (just 2-3 min screencasts):

- user-space implemented "permissive mode":
    https://fileshare.maowtm.org/landlock-20260214/demo.mp4
- mutable domains based on a reloadable config file:
    https://fileshare.maowtm.org/landlock-20260213/demo.mp4

While I would be glad to receive reviews from anyone (and I've added
people who have replied to the previous RFC in CC), Günther, when you are
not too busy, can you kindly give this a review?  A lot of this has
already been discussed with Mickaël, in fact a large part of this design
was from his suggestions.  I apologize in advance for the length of this
email - please feel free to respond to any part of it, and whenever you
have time to.

PoC code used in the above videos are largely generated, somewhat buggy,
and unreviewed, but they are available:

- mutable domains:
    https://github.com/micromaomao/linux-dev/pull/26/changes
- supervisor notification:
    https://github.com/micromaomao/linux-dev/pull/27/changes

The motivations listed in [1] are still relevant, and to add to that, here
are some additional examples of things we can do with the supervisor
feature (all from unprivileged applications):

- Implementing a version of StemJail [2] which does not rely on bind
  mounts and LD_PRELOAD (for the notification part, not for access
  control).  Or in fact, any other uses of LD_PRELOAD for the purpose of
  finding out what files are accessed.

- For island [3], some sort of denial logging tied to the context,
  integrated in the tool itself (rather than through kernel audit) and
  live config reload.

- Use in a non-security related context, such as automated build
  dependency tracking.

[1]: https://lore.kernel.org/all/cover.1741047969.git.m@maowtm.org/
[2]: https://github.com/stemjail/stemjail
[3]: https://github.com/landlock-lsm/island

Background
----------

A while ago I sent a "Landlock supervise" RFC patch series [1], in which I
proposed to extend Landlock with additional functionality to support
"interactive" rule enforcement.  In discussion with Mickaël, we decided to
split this work into 3 stages:  quiet flag, mutable domains, and finally
supervisor notification.  Relevant discussions are at [4] and in replies
to [1].

The patch for quiet flag [5] has gone through multiple review iterations
already.  It is useful on its own, but it was also motivated by the
eventual use in controlling supervisor notification.

The next stage is to introduce "mutable domains".  The motivation for this
is two fold:

1. This allows the supervisor to allow access to (large) file hierarchies
   without needing to be woken up again for each access.
2. Because we cannot block within security_path_mknod and other
   directory-modification related hooks [6], the proposal was to return
   immediately from those hooks after queuing the supervisor notification,
   then wait in a separate task_work.  This however means that we cannot
   directly "allow" access (and even if we can, it may introduce TOCTOU
   problems).  In order to allow access to requested files, the supervisor
   has to add additional rules to the (now mutable) domain which will
   allow the required access.

[1]: https://lore.kernel.org/all/cover.1741047969.git.m@maowtm.org/
[4]: https://github.com/landlock-lsm/linux/issues/44
[5]: https://lore.kernel.org/all/cover.1766330134.git.m@maowtm.org/
[6]: https://lore.kernel.org/all/20250311.Ti7bi9ahshuu@digikod.net/

Proposed changes
----------------

This patchset introduces the concept of "supervisor" and "supervisee"
rulesets (alternative names for this are "static"/"dynamic",
"mutable"/"immutable" etc), which are Landlock rulesets that are joined
together when enforced.  The supervisee ruleset can be thought of as the
"static" part of a domain, and the supervisor ruleset can be thought of as
the "dynamic" part.  The two rulesets can have different rules and access
rights for individual rules, but they internally have the same sets of
handled access and scope bits.  When an access request is evaluated for
processes in such domains, the access is allowed if, for each layer,
either the supervisee or the supervisor ruleset of that domain allows the
access.

A Landlock supervisor will first create the supervisor ruleset, which
internally creates a ref-counted landlock_supervisor which the unmerged
(and in fact, unmergeable, to prevent accidental misuse) landlock_ruleset
will point to.  Through a new ioctl, the user can get a supervisee ruleset
with the attached supervisor (this relationship does not necessarily have
to be 1-1), which can then be passed to landlock_restrict_self() by a
child process.  The supervisor can also at any time (before the ioctl,
before the landlock_restrict_self() call, or after it) modify the
supervisor ruleset to add or remove (via a new "intersect" flag) rules or
change access rights, and commit those changes through a flag passed to
landlock_add_rule() (although maybe this would be better done as an
ioctl() on the supervisor?), after which the changes start affecting the
child.

The supervisee ruleset is immutable, it is basically the current
landlock_ruleset, and internally we continue to "fold" rules from parents
into the child's rbtree.  However, since all ancestor supervisor rulesets
are mutable, we cannot simply fold the supervisor rules from parents into
its children at enforce time, as it may be removed or changed later at a
parent layer.  Therefore, if an access is not allowed by any layer's
supervisee ruleset (which is quick to check thanks to the "folding" of the
supervisee rules), Landlock will then have to check that the access is
allowed by the supervisor rulesets of all the denying layers. (The access
is also denied if any of the denying layers does not have a supervisor
ruleset, in this case we don't even have to check the other supervisor
rulesets.)

To enable removing rules from a ruleset, we also implement the
LANDLOCK_ADD_RULE_INTERSECT flag for landlock_add_rule().  If this is
passed, instead of adding rules, the corresponding rule, if it exists, is
updated to be the intersection of the existing access rights and the
specified access rights.  If the result is zero, the rule is removed.  For
API consistency, the LANDLOCK_ADD_RULE_INTERSECT flag will be supported
for both supervisor and supervisee (i.e. existing) rulesets, but it is
probably only useful for supervisor rulesets.

(I'm not very certain about this intersect flag - see below for
alternative designs)

Later on, a supervisor notification mechanism can be implemented to allow
the supervisor to be notified when an access is denied by its supervised
layer, but this is not in scope for the "mutable domains" feature on its
own (although it does make it significantly more useful).  This will be
the step after mutable domains, if we keep with the plan previously
discussed with Mickaël.

uAPI example
------------

```c
/*
 * This landlock_ruleset_attr controls the handled/quiet/scope bits for
 * this layer (internally shared by both the supervisor and supervisee
 * rulesets).
 */
struct landlock_ruleset_attr attr = {
    .handled_access_fs = ...,
    /* ... */
};

/* supervisor_fd default to CLOEXEC */
int supervisor_fd = landlock_create_ruleset(
    &attr, sizeof(attr), LANDLOCK_CREATE_RULESET_SUPERVISOR);
if (supervisor_fd < 0)
    perror("landlock_create_ruleset");

/*
 * supervisor_fd can then be passed to landlock_add_rule, but it does not
 * work with landlock_restrict_self.  Not working for restrict_self means
 * that if a sandboxer accidentally passes the supervisor fd to the child,
 * it would not work in the same way as the supervisee fd, and therefore
 * the error is more discoverable.
 */
 if (landlock_add_rule(supervisor_fd, ...) < 0)
    perror("landlock_add_rule");

 /*
  * Any changes to the supervisor ruleset must be committed, even before
  * any child calls landlock_restrict_self().  Without committing, the
  * supervisor ruleset still behaves as if it is empty.
  */
 if (landlock_add_rule(supervisor_fd, ..., ...,
        LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR) < 0)
    perror("landlock_add_rule(COMMIT)");

/* Creates the supervisee ruleset */
int supervisee_fd = ioctl(supervisor_fd,
        LANDLOCK_IOCTL_GET_SUPERVISEE_RULESET, /* flags= */ 0);
if (supervisee_fd < 0)
    perror("ioctl(LANDLOCK_IOCTL_GET_SUPERVISEE_RULESET)");

pid_t child = fork();
if (child == 0) {
    /* The supervisor should not leak supervisor_fd to any untrusted code. */
    close(supervisor_fd);
    if (landlock_restrict_self(supervisee_fd, 0) < 0)
        perror("landlock_restrict_self");
    execve(...);
    perror("execve");
} else {
    close(supervisee_fd);
    /*
     * Here, the supervisor can add rules via landlock_add_rule(), Or
     * remove rules via landlock_add_rule() with
     * LANDLOCK_ADD_RULE_INTERSECT.
     *
     * Added rules doesn't come into effect until a final
     * landlock_add_rule() with commit flag (which may also just add a
     * dummy rule with access=0):
     */
    if (landlock_add_rule(supervisor_fd, ..., ..., LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR) < 0)
        perror("landlock_add_rule(COMMIT)");
}
```

Discussion on LANDLOCK_ADD_RULE_INTERSECT
-----------------------------------------

This was initially proposed by Mickaël, although now after writing some
example code against it [7], I'm not 100% sure that it is the most useful
uAPI.  For a supervisor based on some sort of config file, it already has
to track which rules are added to know what to remove, and thus I feel
that it would be easier (both to use and to implement) to have an API that
simply "replaces" a rule, rather than do a bitwise AND on the access.

Another alternative is to simply have a "clear all rules in this ruleset"
flag.  This allows the supervisor to not have to track what is already
allowed - if it reloads the config file, it can simply clear the ruleset,
re-add all rules based on the config, then commit it.  Although I worry
that this might make implementing some other use cases more difficult.

(We can of course implement both)

[7]: https://github.com/micromaomao/linux-dev/blob/94477974c616126762f24cc268967d7f989cc96d/samples/landlock/supervisor_sandboxer.c#L437-L481

Why require a commit operation?
-------------------------------

This is not a strictly necessary requirement with an rbtree based
implementation - it can be made thread-safe with RCU while still allowing
lockless access checks without too much overhead (although the code is
indeed more tricky to write).  However, there is a possibility that the
domain lookup might become a hashtable with some future enhancement [8],
at which point it would be better to have an explicit commit operation to
avoid rebuilding the hashtable for every landlock_add_rule().  Having a
commit operation will likely also make some atomicity properties easier to
achieve, depending on the supervisor's needs.

I've actually previously implemented hashtable domains [9], but after
benchmarking it I did not find a very significant performance improvement
(2.2% with 10 dir depth and 10 rules, 8.6% with 29 depth and 1000 rules) [10]
especially considering the complexity of the changes required.  After
discussion with Mickaël I've decided to not pursue it for now, but I'm
open to suggestions.  If Mickaël and Günther are open to taking it, I can
revive the patch.

[8]:  https://github.com/landlock-lsm/linux/issues/1
[9]:  https://lore.kernel.org/all/cover.1751814658.git.m@maowtm.org/
      Note that the benchmark posted here was inaccurate, due to the
      relatively high cost of kfunc probes compared to the work required
      to handle one openat().  For a more proper benchmark, refer to the
      comment below:
[10]: https://github.com/landlock-lsm/landlock-test-tools/pull/17#issuecomment-3594121269
      See specifically the collapsed section "parse-microbench.py
      base-vm.log arraydomain-vm.log"

Proposed implementation
-----------------------

In order to store additional data and locks for the supervisor, we create
a new `struct landlock_supervisor`.  Both the supervisor and supervisee
rulesets, and the landlock_hierarchy of each layer, will point to this
struct.  (A future revision may optimize on this to reduce pointer chasing
when needing to check supervisor rulesets of parent layers.)

One of the main tricky areas of this work is the implementation of
LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR and the access checks.  We want:

- atomic commit: the supervised program should not "experience" any rule
  changes until they are committed, and once it is committed it should see
  all the changes together

- lockless access checks (even when the supervisee ruleset does not allow
  the access, necessitating checking the supervisor rulesets, this should
  still not involve any locks)

- atomic access checks: an access check should either be completely based
  on the "old" rules or the "new" rules, even if a commit happens in the
  middle of a path walk.  This prevents incorrect denials when a commit
  moves a rule from /a to /a/b when we've just finished checking /a/b and
  about to check /a.

In order to achieve atomic commit, the supervisor fd cannot actually point
to (and thus allow editing) the "live" ruleset.  Instead, when a
`LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR` is requested, a new `struct
landlock_ruleset` is created, the rules are copied over from the existing
supervisor ruleset, and the pointer in the landlock_supervisor is swapped.

In order to keep access checks lockless (as it is currently), the live
ruleset pointer needs to be RCU-protected.  To reduce complexity, this
initial implementation uses synchronize_rcu() directly in the calling
thread of `LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR`, and frees the old
supervisor ruleset afterwards, but this can be rewritten to use call_rcu()
in a future iteration if necessary (which will allow quicker commits,
which can be quite impactful if we use this to auto-generate rulesets).

During access checks, for each step of the path walk, after
landlock_unmask_layers()-ing the supervisee rule, if the access is not
already allowed, we check for rules in the supervisor ruleset and
effectively does landlock_unmask_layers() on them too.

In order to have atomic access checks, we need to pre-capture the
supervisor committed ruleset pointers for all layers at the start of the
path walk (in `is_access_to_paths_allowed`).  Storing this on the stack,
this takes the space of 16 pointers, hence 128 bytes on 64-bit (I'm keen
to hear suggestions on how best to mitigate this).  Another effect of this
"caching" is that in order to be able to release rcu in the path walk
(which is required for the path_put()), we actually need to take refcount
on the committed ruleset (and free it at the end of
is_access_to_paths_allowed).

Optional accesses
-----------------

Optional access (truncate and ioctl) handling is also tricky.  There are
two possible alternatives:

- The allowed optional actions are still entirely determined at file open
  time.  This likely works in the majority of cases, where truncate (and
  maybe also ioctl) are given or taken away together with write access.
  However, this may mean that we need to send an access request
  notification immediately at open() time if e.g. write access is given
  but truncate (or ioctl) is not, even if truncate (or ioctl) is not
  attempted yet, since the supervisor would not be able to allow it later.
  (or alternatively we can choose to not send this notification, and the
  supervisor will just have to "know" to add truncate/ioctl rights if
  required, in advance.)

- The allowed optional actions are considered to be determined at
  operation time (even though for a static ruleset it is cached).  This
  means that for supervised layers, we will always have to re-check their
  supervisor rulesets, whether or not the access was initially allowed,
  which will involve doing a path walk.  This does however means that the
  supervisor can be notified "in the moment" when a truncate (or more
  likely to be relevant - ioctl) is attempted.

The PoC partially implements the second one (but has bugs), but I'm not
sure which is best.  The second one is most flexible and makes more sense
to me from a user perspective, but does come with performance
implications.

(Disallowing) self-supervision
------------------------------

We should figure out a way to ensure that a process cannot call
landlock_restrict_self() with a ruleset that has a supervisor for which it
has access to (i.e. via a supervisor ruleset fd).  This prevents
accidental misuse, and also prevents deadlocks as discussed in [11].  I'm
not sure if this will be easy to implement, however.

[11]: https://lore.kernel.org/all/cc3e131f-f9a3-417b-9267-907b45083dc3@maowtm.org/

Supervisor notification
-----------------------

The above RFC only covers mutable domains.  The natural next stage of this
work is to send notification to the supervisor on access denials, so that
it can decide whether to allow the access or not.  For that, there are
also lots of questions at this stage:

- Should we in fact implement that first, before mutable domains?  This
  means that the supervisor would only be able to find out about denials,
  but not allow them without a sandbox restart.  We still eventually want
  the mutable domains, since that makes this a lot more useful, but I can
  see some use cases for just the notification part (e.g. island denial
  log), and I can't see a likely use case for just mutable domains, aside
  from live reload of landlock-config (maybe that _is_ useful on its own,
  considering that you can also find out about denials from the kernel
  audit log, and add missing rules based on that).

- Earlier when implementing the Landlock supervise v1 RFC, I basically
  came up with an ad-hoc uAPI for the notification [12], and the PoC code
  linked to above also uses this uAPI.  There are of course many problems
  with this as it stands, e.g. it only having one destname, which means
  that for rename, the fd1 needs to be the child being moved, which does
  not align with the vfs semantic and how Landlock treat it (i.e. the
  thing being updated here is the parent directory, not the child itself).
  Same for delete, which currently sends the child as fd1.

  But also, in discussion with Mickaël last year, he mentioned that we
  could reuse the fsnotify infrastructure, and perhaps additionally, use
  fanotify to deliver these notifications.  I do think there is some
  potential here, as fanotify already implements an event header, a
  mechanism for receiving and replying to events, etc.  We could possibly
  extend it to send Landlock specific notifications via a new kind of mark
  (FAN_MARK_LANDLOCK_DOMAIN ??) and add one or more new corresponding
  event types.  Mickaël mentioned mount notifications [13] as an example
  of using fanotify to send notifications other than file/dir
  modifications.

  I'm not sure if directly extending the fanotify uAPI is a good idea tho,
  considering that Landlock is not a feature specific to the filesystem -
  we will also have denial events for net_port rules, and perhaps more in
  the future.  However, Mickaël mentioned that there might be some
  internal infrastructure which we can re-use (even if we have our own
  notification uAPI).

- The other uAPI alternative which I have been thinking of is to extend
  seccomp-unotify.  For example, a Landlock denial could result in the
  syscall being trapped and a `struct seccomp_notif` being sent to the
  seccomp supervisor (via the existing mechanism), with additional
  information (mostly, the file(s) / net ports being accessed and access
  rights requested) attached to the notification _somehow_.  Then the
  supervisor can use the same kind of responses one would use for
  seccomp-unotify to cause the syscall to either be retried (possibly via
  `SECCOMP_USER_NOTIF_FLAG_CONTINUE`) or return with an error code of its
  choice (or alternatively, carry out the operation on behalf of the
  child, and pretend that the syscall succeed, which might be useful to
  implement an "allow file creation but only this file" / "allow `mktemp
  -d` but not arbitrary create on anything under /tmp").

  Looking at `struct seccomp_notif` and `struct seccomp_data` however, I'm
  not sure how feasible / doable this extension would be.  Also,
  seccomp-unotify is supposed to trigger before a syscall is actually
  executed, whereas if we use it this way, we will want it to trigger
  after we're already midway through the syscall (in the LSM hook).  This
  might make it hard to implement (and also twists a bit the uAPI
  semantics of seccomp-unotify).

Are there any immediate reasons, from Landlock's perspective, to rule out
either of them?  (I will probably wait for at least a first review from
the Landlock side before directing this explicitly to the fanotify and/or
seccomp-unotify maintainers, in case the plan significantly changes, but
if somehow a maintainer/reviewer from either of those areas are already
reading this, firstly thanks, and feedback would be very valuable :D )

[12]: https://lore.kernel.org/all/cde6bbf0b52710b33170f2787fdcb11538e40813.1741047969.git.m@maowtm.org/#iZ31include:uapi:linux:landlock.h
[13]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.15-rc1&id=fd101da676362aaa051b4f5d8a941bd308603041

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
  2026-02-15  2:54 [RFC] Landlock: mutable domains (and supervisor notification uAPI options) Tingmao Wang
@ 2026-02-15 21:23 ` Justin Suess
  2026-02-16 21:27   ` Justin Suess
  2026-02-22 18:04   ` Tingmao Wang
  2026-02-22 18:04 ` Tingmao Wang
  1 sibling, 2 replies; 6+ messages in thread
From: Justin Suess @ 2026-02-15 21:23 UTC (permalink / raw)
  To: m
  Cc: amir73il, gnoack, jack, jannh, linux-security-module, mic,
	penguin-kernel, song, utilityemal77

On Sun, Feb 15, 2026 at 02:54:08AM +0000, Tingmao Wang wrote:
> Hi,
> 
> Recently I have been continuing work on the previously proposed Landlock
> supervise feature (context below).  While I do have some rough PoCs, and
> I'm aware that sometimes code is better than talk, because of the amount
> of work involved, I would like to get some early feedback on the design
> before continuing.
> 
> Scrappy demo (just 2-3 min screencasts):
> 
> - user-space implemented "permissive mode":
>     https://fileshare.maowtm.org/landlock-20260214/demo.mp4
> - mutable domains based on a reloadable config file:
>     https://fileshare.maowtm.org/landlock-20260213/demo.mp4
> 
> While I would be glad to receive reviews from anyone (and I've added
> people who have replied to the previous RFC in CC), Günther, when you are
> not too busy, can you kindly give this a review?  A lot of this has
> already been discussed with Mickaël, in fact a large part of this design
> was from his suggestions.  I apologize in advance for the length of this
> email - please feel free to respond to any part of it, and whenever you
> have time to.
> 
> PoC code used in the above videos are largely generated, somewhat buggy,
> and unreviewed, but they are available:
> 
> - mutable domains:
>     https://github.com/micromaomao/linux-dev/pull/26/changes
> - supervisor notification:
>     https://github.com/micromaomao/linux-dev/pull/27/changes
> 
> The motivations listed in [1] are still relevant, and to add to that, here
> are some additional examples of things we can do with the supervisor
> feature (all from unprivileged applications):
> 
> - Implementing a version of StemJail [2] which does not rely on bind
>   mounts and LD_PRELOAD (for the notification part, not for access
>   control).  Or in fact, any other uses of LD_PRELOAD for the purpose of
>   finding out what files are accessed.
> 
> - For island [3], some sort of denial logging tied to the context,
>   integrated in the tool itself (rather than through kernel audit) and
>   live config reload.
> 
> - Use in a non-security related context, such as automated build
>   dependency tracking.
> 
> [1]: https://lore.kernel.org/all/cover.1741047969.git.m@maowtm.org/
> [2]: https://github.com/stemjail/stemjail
> [3]: https://github.com/landlock-lsm/island
> 
> 
> Background
> ----------
> 
> A while ago I sent a "Landlock supervise" RFC patch series [1], in which I
> proposed to extend Landlock with additional functionality to support
> "interactive" rule enforcement.  In discussion with Mickaël, we decided to
> split this work into 3 stages:  quiet flag, mutable domains, and finally
> supervisor notification.  Relevant discussions are at [4] and in replies
> to [1].
> 
> The patch for quiet flag [5] has gone through multiple review iterations
> already.  It is useful on its own, but it was also motivated by the
> eventual use in controlling supervisor notification.
> 
> The next stage is to introduce "mutable domains".  The motivation for this
> is two fold:
> 
> 1. This allows the supervisor to allow access to (large) file hierarchies
>    without needing to be woken up again for each access.
> 2. Because we cannot block within security_path_mknod and other
>    directory-modification related hooks [6], the proposal was to return
>    immediately from those hooks after queuing the supervisor notification,
>    then wait in a separate task_work.  This however means that we cannot
>    directly "allow" access (and even if we can, it may introduce TOCTOU
>    problems).  In order to allow access to requested files, the supervisor
>    has to add additional rules to the (now mutable) domain which will
>    allow the required access.
> 
> [1]: https://lore.kernel.org/all/cover.1741047969.git.m@maowtm.org/
> [4]: https://github.com/landlock-lsm/linux/issues/44
> [5]: https://lore.kernel.org/all/cover.1766330134.git.m@maowtm.org/
> [6]: https://lore.kernel.org/all/20250311.Ti7bi9ahshuu@digikod.net/
> 
>
Hello Tingmao,

Thank you for sending this.

I've read the proposal and had some time to gather thoughts on it. I'm
planning to break this feedback into multiple parts.

This first part addresses the intersect flag.

> Proposed changes
> ----------------
> 
> This patchset introduces the concept of "supervisor" and "supervisee"
> rulesets (alternative names for this are "static"/"dynamic",
> "mutable"/"immutable" etc), which are Landlock rulesets that are joined
> together when enforced.  The supervisee ruleset can be thought of as the
> "static" part of a domain, and the supervisor ruleset can be thought of as
> the "dynamic" part.  The two rulesets can have different rules and access
> rights for individual rules, but they internally have the same sets of
> handled access and scope bits.  When an access request is evaluated for
> processes in such domains, the access is allowed if, for each layer,
> either the supervisee or the supervisor ruleset of that domain allows the
> access.
> 
> A Landlock supervisor will first create the supervisor ruleset, which
> internally creates a ref-counted landlock_supervisor which the unmerged
> (and in fact, unmergeable, to prevent accidental misuse) landlock_ruleset
> will point to.  Through a new ioctl, the user can get a supervisee ruleset
> with the attached supervisor (this relationship does not necessarily have
> to be 1-1), which can then be passed to landlock_restrict_self() by a
> child process.  The supervisor can also at any time (before the ioctl,
> before the landlock_restrict_self() call, or after it) modify the
> supervisor ruleset to add or remove (via a new "intersect" flag) rules or
> change access rights, and commit those changes through a flag passed to
> landlock_add_rule() (although maybe this would be better done as an
> ioctl() on the supervisor?), after which the changes start affecting the
> child.
> 
> The supervisee ruleset is immutable, it is basically the current
> landlock_ruleset, and internally we continue to "fold" rules from parents
> into the child's rbtree.  However, since all ancestor supervisor rulesets
> are mutable, we cannot simply fold the supervisor rules from parents into
> its children at enforce time, as it may be removed or changed later at a
> parent layer.  Therefore, if an access is not allowed by any layer's
> supervisee ruleset (which is quick to check thanks to the "folding" of the
> supervisee rules), Landlock will then have to check that the access is
> allowed by the supervisor rulesets of all the denying layers. (The access
> is also denied if any of the denying layers does not have a supervisor
> ruleset, in this case we don't even have to check the other supervisor
> rulesets.)
> 
> To enable removing rules from a ruleset, we also implement the
> LANDLOCK_ADD_RULE_INTERSECT flag for landlock_add_rule().  If this is
> passed, instead of adding rules, the corresponding rule, if it exists, is
> updated to be the intersection of the existing access rights and the
> specified access rights.  If the result is zero, the rule is removed.  For
> API consistency, the LANDLOCK_ADD_RULE_INTERSECT flag will be supported
> for both supervisor and supervisee (i.e. existing) rulesets, but it is
> probably only useful for supervisor rulesets.
> 
> (I'm not very certain about this intersect flag - see below for
> alternative designs)
> 
> Later on, a supervisor notification mechanism can be implemented to allow
> the supervisor to be notified when an access is denied by its supervised
> layer, but this is not in scope for the "mutable domains" feature on its
> own (although it does make it significantly more useful).  This will be
> the step after mutable domains, if we keep with the plan previously
> discussed with Mickaël.
> 
> 
> uAPI example
> ------------
> 
> ```c
> /*
>  * This landlock_ruleset_attr controls the handled/quiet/scope bits for
>  * this layer (internally shared by both the supervisor and supervisee
>  * rulesets).
>  */
> struct landlock_ruleset_attr attr = {
>     .handled_access_fs = ...,
>     /* ... */
> };
> 
> /* supervisor_fd default to CLOEXEC */
> int supervisor_fd = landlock_create_ruleset(
>     &attr, sizeof(attr), LANDLOCK_CREATE_RULESET_SUPERVISOR);
> if (supervisor_fd < 0)
>     perror("landlock_create_ruleset");
> 
> /*
>  * supervisor_fd can then be passed to landlock_add_rule, but it does not
>  * work with landlock_restrict_self.  Not working for restrict_self means
>  * that if a sandboxer accidentally passes the supervisor fd to the child,
>  * it would not work in the same way as the supervisee fd, and therefore
>  * the error is more discoverable.
>  */
>  if (landlock_add_rule(supervisor_fd, ...) < 0)
>     perror("landlock_add_rule");
> 
>  /*
>   * Any changes to the supervisor ruleset must be committed, even before
>   * any child calls landlock_restrict_self().  Without committing, the
>   * supervisor ruleset still behaves as if it is empty.
>   */
>  if (landlock_add_rule(supervisor_fd, ..., ...,
>         LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR) < 0)
>     perror("landlock_add_rule(COMMIT)");
> 
> /* Creates the supervisee ruleset */
> int supervisee_fd = ioctl(supervisor_fd,
>         LANDLOCK_IOCTL_GET_SUPERVISEE_RULESET, /* flags= */ 0);
> if (supervisee_fd < 0)
>     perror("ioctl(LANDLOCK_IOCTL_GET_SUPERVISEE_RULESET)");
> 
> pid_t child = fork();
> if (child == 0) {
>     /* The supervisor should not leak supervisor_fd to any untrusted code. */
>     close(supervisor_fd);
>     if (landlock_restrict_self(supervisee_fd, 0) < 0)
>         perror("landlock_restrict_self");
>     execve(...);
>     perror("execve");
> } else {
>     close(supervisee_fd);
>     /*
>      * Here, the supervisor can add rules via landlock_add_rule(), Or
>      * remove rules via landlock_add_rule() with
>      * LANDLOCK_ADD_RULE_INTERSECT.
>      *
>      * Added rules doesn't come into effect until a final
>      * landlock_add_rule() with commit flag (which may also just add a
>      * dummy rule with access=0):
>      */
>     if (landlock_add_rule(supervisor_fd, ..., ..., LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR) < 0)
>         perror("landlock_add_rule(COMMIT)");
> }
> ```
> 
> 
> Discussion on LANDLOCK_ADD_RULE_INTERSECT
> -----------------------------------------
> 
> This was initially proposed by Mickaël, although now after writing some
> example code against it [7], I'm not 100% sure that it is the most useful
> uAPI.  For a supervisor based on some sort of config file, it already has
> to track which rules are added to know what to remove, and thus I feel
> that it would be easier (both to use and to implement) to have an API that
> simply "replaces" a rule, rather than do a bitwise AND on the access.
> 
Instead of intersection being done at the rule level via
landlock_add_rule, would it be better for intersection to be done at the
ruleset_fd/ruleset level?

So instead of intersecting individual rules, you can intersect entire
rulesets, with the added benefit of being able to intersect handled
accesses as well. (so you could handle an access initially, and not
handle it later).

Intersecting at the ruleset level allows for grouping the intersection rules
together, so you could create an unenforced ruleset for the sole purpose
of intersecting with rulesets, and intersect all the rule(s) at once.

That way, the ruleset fd can be reused for this purpose later with other
supervisees, instead of creating ruleset, intersecting individual rules,
repeat.

I think also the semantics of having a function called
"landlock_add_rule" actually removing accesses (when the intersect flag
is added) is also confusing, because we're not really *add*-ing
anything, we're removing.

ALTERNATIVE #1

Maybe the best way to do it is instead continue treating rulesets as
immutable, but allow composition of them at ruleset creation time.

This would look something like:

Ruleset C = Ruleset A & Ruleset B

Ruleset A and B are never modified, but instead a new Ruleset C is
created that is the intersection of A and B. This could be done in a
variety of ways (LANDLOCK_CREATE_RULESET_INTERSECT? new IOCTL?)

An example API for what this might look like:

  struct landlock_ruleset_attr ruleset_attr = {
          // other fields for handled accesses must be blank.
          .left_fd = existing_fd,
          .right_fd = other_existing_fd,
  };
  int new_ruleset_fd = syscall(SYS_landlock_create_ruleset, &ruleset_attr, 
    sizeof(ruleset_attr), LANDLOCK_CREATE_RULESET_INTERSECT);

And then the resulting ruleset which is the intersection of existing_fd
and other_existing_fd could be returned.

Similarly, we could: 

  int new_ruleset_fd = syscall(SYS_landlock_create_ruleset, &ruleset_attr, 
      sizeof(ruleset_attr), LANDLOCK_CREATE_RULESET_UNION);

Which would be convienent for creating unions of rulesets.

Then instead mutating rulesets, we commit/replace an entirely new ruleset.

ioctl(supervisee_fd, LANDLOCK_IOCTL_COMMIT_RULESET, &new_ruleset_fd);

This has the following benefits:

1. Clearer semantics: "landlock_add_rule" is just for adding rules, not
removing.

2. Intersection of all ruleset attributes, not just individual rule
attributes.

3. Better logical grouping of rules for the purpose of intersection, and
better composition.

It does have drawbacks:

1. Intersecting individual rules requires making an entire ruleset for
that one rule.

2. Users must be responsible for closing the unused/old rulesets that
they might not longer need.

ALTERNATIVE #2

A middle ground is to keep the ruleset mutation via landlock_add_rule,
but have it be done at the ruleset_fd level.

Something like this:

  struct landlock_ruleset_operand intersection = {
    .operand = other_ruleset_fd
  };
  landlock_add_rule(ruleset_fd, LANDLOCK_RULE_INTERSECT_RULESET, &intersection, 0))

I think this is also a valid way to do things, and increases the
reusibility of rulesets.

1. Again, having landlock_add_rule being used to actually remove rules
is confusing.

2. I'm unsure if we can change handled accesses after ruleset creation,
so we might not be able to intersect the handled accesses like we can in
the ALTERNATIVE #1.

> Another alternative is to simply have a "clear all rules in this ruleset"
> flag.  This allows the supervisor to not have to track what is already
> allowed - if it reloads the config file, it can simply clear the ruleset,
> re-add all rules based on the config, then commit it.  Although I worry
> that this might make implementing some other use cases more difficult.

At a minimum, it is cumbersome, and I worry about file descriptors
becoming inaccessible (due to bind mounts / namespace changes in the
supervisor's environment).

Of course they can just hold those file descriptors open for the purposes
of future intersections, but this is annoying and error prone.

> [...]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
  2026-02-15 21:23 ` Justin Suess
@ 2026-02-16 21:27   ` Justin Suess
  2026-02-22 18:04     ` Tingmao Wang
  2026-02-22 18:04   ` Tingmao Wang
  1 sibling, 1 reply; 6+ messages in thread
From: Justin Suess @ 2026-02-16 21:27 UTC (permalink / raw)
  To: utilityemal77
  Cc: amir73il, gnoack, jack, jannh, linux-security-module, m, mic,
	penguin-kernel, song

On Sun, Feb 15, 2026 at 02:54:08AM +0000, Tingmao Wang wrote:
> [...]
> Background
> ----------
> 
> A while ago I sent a "Landlock supervise" RFC patch series [1], in which I
> proposed to extend Landlock with additional functionality to support
> "interactive" rule enforcement.  In discussion with Mickaël, we decided to
> split this work into 3 stages:  quiet flag, mutable domains, and finally
> supervisor notification.  Relevant discussions are at [4] and in replies
> to [1].
> 
> The patch for quiet flag [5] has gone through multiple review iterations
> already.  It is useful on its own, but it was also motivated by the
> eventual use in controlling supervisor notification.
> 
> The next stage is to introduce "mutable domains".  The motivation for this
> is two fold:
> 
> 1. This allows the supervisor to allow access to (large) file hierarchies
>    without needing to be woken up again for each access.
> 2. Because we cannot block within security_path_mknod and other
>    directory-modification related hooks [6], the proposal was to return
>    immediately from those hooks after queuing the supervisor notification,
>    then wait in a separate task_work.  This however means that we cannot
>    directly "allow" access (and even if we can, it may introduce TOCTOU
>    problems).  In order to allow access to requested files, the supervisor
>    has to add additional rules to the (now mutable) domain which will
>    allow the required access.

Is blocking during connect(2) allowed either if the socket is non-blocking?

This may be another example case that needs to be handled differently than calls
we can block in safely.

> [...]
> Why require a commit operation?
> -------------------------------
> 
> This is not a strictly necessary requirement with an rbtree based
> implementation - it can be made thread-safe with RCU while still allowing
> lockless access checks without too much overhead (although the code is
> indeed more tricky to write).  However, there is a possibility that the
> domain lookup might become a hashtable with some future enhancement [8],
> at which point it would be better to have an explicit commit operation to
> avoid rebuilding the hashtable for every landlock_add_rule().  Having a
> commit operation will likely also make some atomicity properties easier to
> achieve, depending on the supervisor's needs.
> 
> I've actually previously implemented hashtable domains [9], but after
> benchmarking it I did not find a very significant performance improvement
> (2.2% with 10 dir depth and 10 rules, 8.6% with 29 depth and 1000 rules) [10]
> especially considering the complexity of the changes required.  After
> discussion with Mickaël I've decided to not pursue it for now, but I'm
> open to suggestions.  If Mickaël and Günther are open to taking it, I can
> revive the patch.
> 
> [8]:  https://github.com/landlock-lsm/linux/issues/1
> [9]:  https://lore.kernel.org/all/cover.1751814658.git.m@maowtm.org/
>       Note that the benchmark posted here was inaccurate, due to the
>       relatively high cost of kfunc probes compared to the work required
>       to handle one openat().  For a more proper benchmark, refer to the
>       comment below:
> [10]: https://github.com/landlock-lsm/landlock-test-tools/pull/17#issuecomment-3594121269
>       See specifically the collapsed section "parse-microbench.py
>       base-vm.log arraydomain-vm.log"
> 
> 
> Proposed implementation
> -----------------------
> 
> In order to store additional data and locks for the supervisor, we create
> a new `struct landlock_supervisor`.  Both the supervisor and supervisee
> rulesets, and the landlock_hierarchy of each layer, will point to this
> struct.  (A future revision may optimize on this to reduce pointer chasing
> when needing to check supervisor rulesets of parent layers.)
> 
> One of the main tricky areas of this work is the implementation of
> LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR and the access checks.  We want:
> 
> - atomic commit: the supervised program should not "experience" any rule
>   changes until they are committed, and once it is committed it should see
>   all the changes together
> 
> - lockless access checks (even when the supervisee ruleset does not allow
>   the access, necessitating checking the supervisor rulesets, this should
>   still not involve any locks)
> 
> - atomic access checks: an access check should either be completely based
>   on the "old" rules or the "new" rules, even if a commit happens in the
>   middle of a path walk.  This prevents incorrect denials when a commit
>   moves a rule from /a to /a/b when we've just finished checking /a/b and
>   about to check /a.
> 
> In order to achieve atomic commit, the supervisor fd cannot actually point
> to (and thus allow editing) the "live" ruleset.  Instead, when a
> `LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR` is requested, a new `struct
> landlock_ruleset` is created, the rules are copied over from the existing
> supervisor ruleset, and the pointer in the landlock_supervisor is swapped.
> 
> In order to keep access checks lockless (as it is currently), the live
> ruleset pointer needs to be RCU-protected.  To reduce complexity, this
> initial implementation uses synchronize_rcu() directly in the calling
> thread of `LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR`, and frees the old
> supervisor ruleset afterwards, but this can be rewritten to use call_rcu()
> in a future iteration if necessary (which will allow quicker commits,
> which can be quite impactful if we use this to auto-generate rulesets).
> 
> During access checks, for each step of the path walk, after
> landlock_unmask_layers()-ing the supervisee rule, if the access is not
> already allowed, we check for rules in the supervisor ruleset and
> effectively does landlock_unmask_layers() on them too.
> 
> In order to have atomic access checks, we need to pre-capture the
> supervisor committed ruleset pointers for all layers at the start of the
> path walk (in `is_access_to_paths_allowed`).  Storing this on the stack,
> this takes the space of 16 pointers, hence 128 bytes on 64-bit (I'm keen
> to hear suggestions on how best to mitigate this).  Another effect of this
> "caching" is that in order to be able to release rcu in the path walk
> (which is required for the path_put()), we actually need to take refcount
> on the committed ruleset (and free it at the end of
> is_access_to_paths_allowed).
> 
> 
> Optional accesses
> -----------------
> 
> Optional access (truncate and ioctl) handling is also tricky.  There are
> two possible alternatives:
> 
> - The allowed optional actions are still entirely determined at file open
>   time.  This likely works in the majority of cases, where truncate (and
>   maybe also ioctl) are given or taken away together with write access.
>   However, this may mean that we need to send an access request
>   notification immediately at open() time if e.g. write access is given
>   but truncate (or ioctl) is not, even if truncate (or ioctl) is not
>   attempted yet, since the supervisor would not be able to allow it later.
>   (or alternatively we can choose to not send this notification, and the
>   supervisor will just have to "know" to add truncate/ioctl rights if
>   required, in advance.)
> 
> - The allowed optional actions are considered to be determined at
>   operation time (even though for a static ruleset it is cached).  This
>   means that for supervised layers, we will always have to re-check their
>   supervisor rulesets, whether or not the access was initially allowed,
>   which will involve doing a path walk.  This does however means that the
>   supervisor can be notified "in the moment" when a truncate (or more
>   likely to be relevant - ioctl) is attempted.
> 
> The PoC partially implements the second one (but has bugs), but I'm not
> sure which is best.  The second one is most flexible and makes more sense
> to me from a user perspective, but does come with performance
> implications.
> 
> 
> (Disallowing) self-supervision
> ------------------------------
> 
> We should figure out a way to ensure that a process cannot call
> landlock_restrict_self() with a ruleset that has a supervisor for which it
> has access to (i.e. via a supervisor ruleset fd).  This prevents
> accidental misuse, and also prevents deadlocks as discussed in [11].  I'm
> not sure if this will be easy to implement, however.

This seems like a graph acyclicity problem.

Here are a couple cases to consider:

1. LANDLOCK_RESTRICT_SELF_TSYNC misuse:

In the case where a user wants to use this supervisor to supervise other
threads within the same process, a user could naively call
LANDLOCK_RESTRICT_SELF_TSYNC (merged into 7.0) when enforcing the
supervisee_fd. This would enforce the same policy on the thread running
the supervisor and the supervisee.

2. Transfer of the supervisee_fd (SCM_RIGHTS)

It's possible to transfer file descriptors over unix domain sockets. If
we had a supervisor daemon that used this form of IPC to send precooked
supervisee_fds to other threads, and one of those ended up in a parent
process of the supervisor, we could inadvertently end up with problems.

3. Blocking in other LSMs (pointed out in your source [11])

The hardest case to deal with, other LSMs like TOMOYO can also block and
cause dependency cycles.

---

This gets tricky, and I don't know  if just checking parent / child
relationships would work. Because the supervisor and supervisee rulesets
are just file descriptors, and there are potentially unlimited number of
ways these FDs could be transfered or instantiated.

I think the best way to deal with this is constraining the problem space:

An idea (binding supervisors/supervisees to domains on first use)

Whenever landlock_restrict_self(supervisee_fd,...) is called, check the
current domain credentials and verify that the domain is a *proper
subset* of the supervisors domain. Then permanently close the
supervisee_fd and never allow reenforcement. Similarly, once a
supervisor_fd is created, never allow commiting from a context with
"current landlock domain != original landlock domain at creation"

This prevents post-enforcement usage of the supervisee_fd by a parent
domain, and post-commit usage of a supervisee_fd by any subdomain.

I'm not sure if it's possible to check whether one domain is a
proper subset of another (ie supervisor domain includes but *doesn't equal*
supervisee domain), but I think that's one way do do it.

This idea would help, but doesn't address case 3 above.

> 
> [11]: https://lore.kernel.org/all/cc3e131f-f9a3-417b-9267-907b45083dc3@maowtm.org/
> 
> 
> Supervisor notification
> -----------------------
> 
> The above RFC only covers mutable domains.  The natural next stage of this
> work is to send notification to the supervisor on access denials, so that
> it can decide whether to allow the access or not.  For that, there are
> also lots of questions at this stage:
> 
> 
> - Should we in fact implement that first, before mutable domains?  This
>   means that the supervisor would only be able to find out about denials,
>   but not allow them without a sandbox restart.  We still eventually want
>   the mutable domains, since that makes this a lot more useful, but I can
>   see some use cases for just the notification part (e.g. island denial
>   log), and I can't see a likely use case for just mutable domains, aside
>   from live reload of landlock-config (maybe that _is_ useful on its own,
>   considering that you can also find out about denials from the kernel
>   audit log, and add missing rules based on that).
> 
> 
> - Earlier when implementing the Landlock supervise v1 RFC, I basically
>   came up with an ad-hoc uAPI for the notification [12], and the PoC code
>   linked to above also uses this uAPI.  There are of course many problems
>   with this as it stands, e.g. it only having one destname, which means
>   that for rename, the fd1 needs to be the child being moved, which does
>   not align with the vfs semantic and how Landlock treat it (i.e. the
>   thing being updated here is the parent directory, not the child itself).
>   Same for delete, which currently sends the child as fd1.
> 
>   But also, in discussion with Mickaël last year, he mentioned that we
>   could reuse the fsnotify infrastructure, and perhaps additionally, use
>   fanotify to deliver these notifications.  I do think there is some
>   potential here, as fanotify already implements an event header, a
>   mechanism for receiving and replying to events, etc.  We could possibly
>   extend it to send Landlock specific notifications via a new kind of mark
>   (FAN_MARK_LANDLOCK_DOMAIN ??) and add one or more new corresponding
>   event types.  Mickaël mentioned mount notifications [13] as an example
>   of using fanotify to send notifications other than file/dir
>   modifications.
> 
>   I'm not sure if directly extending the fanotify uAPI is a good idea tho,
>   considering that Landlock is not a feature specific to the filesystem -
>   we will also have denial events for net_port rules, and perhaps more in
>   the future.  However, Mickaël mentioned that there might be some
>   internal infrastructure which we can re-use (even if we have our own
>   notification uAPI).
I think that a new FAN_MARK would be required to use fanotify uAPI.

There are a couple questions I have with this: (if we extend fanotify)

1. What FAN_CLASS_* would notifications use?

FAN_CLASS_* specifies the type of notification, when the notification is
triggered.

See [1] for the current classes.

If we want interactive, pre-access blocking, that would correspond to
FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT. Both of which currently
require CAP_SYS_ADMIN regardless of FAN_MARK. Which requiring that
would require that supervisors have CAP_SYS_ADMIN, if the current
CAP_SYS_ADMIN requirements remain in place.

(If we don't have interactive blocking denials, we could just use
FAN_CLASS_NOTIF)

2. How would fanotify events be encoded?

Events in fanotify use this structure for event data (one or more of the
following must be recieved in a notification) [2]

           struct fanotify_event_metadata {
               __u32 event_len;
               __u8 vers;
               __u8 reserved;
               __u16 metadata_len;
               __aligned_u64 mask;
               __s32 fd;
               __s32 pid;
           };

There are access classes landlock restricts that might not have an fd at
all, like abstract unix sockets, tcp ports, signals etc.

Good news is fanotify supports multiple types of additional information
records, and we could potentially extend fanotify to support new ones as
you alluded to.

For examples of this, see struct fanotify_event_info_mnt,
fanotify_event_info_pidfd.

These records get attached to the event so they could be used to pass
landlock access data.

3. If we support interactive permission decisions (even for a
subset of landlock access rights only), do we use the response code? 
(question might be moot if we don't do blocking/responses at all)

From [2]:

       For permission events, the application must write(2) a structure
       of the following form to the fanotify file descriptor:

           struct fanotify_response {
               __s32 fd;
               __u32 response;
           };


response is a FAN_ALLOW or FAN_DENY. This is used by fanotify as a
one-time access decision. Would this be used to do one-off exceptions to
policy, or would we require policy decisions to go through the
supervisor_fd and ignore the response code?

4. How would we reconcile the disparity between fanotify access rights
and landlock access rights?

There's no clean 1:1 mapping between fanotify access rights and landlock
access rights as Mickaël pointed out. [2] [3]

Many fs rights (creation, deletion, rename, linking) are not handled or
implemented, (not even considering network/unix/signal scoping), so we'd
be adding all these landlock specific rights.

We could make a "catch-all" FAN_LANDLOCK_ACCESS or similar and ignore
all the existing rights, and put the actual access data in the event
record. It's awkward either way.
---

In conclusion, I think extending fanotify is more viable than seccomp,
from a purely technical standpoint. because it seems extensible,
and because it runs post-lsm hooks.

That being said, it's awkward, requires large extensions to the API, and
definition of permissions that are specific to landlock.

Whether or not landlock makes sense in fanotify from a semantic point of
view is an entirely different question. There's no precedent for
non-filesystem access controls in fanotify, so it's a little... out-of-place
for an LSM to expose features on a filesystem access notification api?

Curious on what people think.

[1]: https://man7.org/linux/man-pages/man2/fanotify_init.2.html
[2]: https://man7.org/linux/man-pages/man7/fanotify.7.html
[3]: https://lore.kernel.org/all/20250304.Choo7foe2eoj@digikod.net/
> 
> 
> - The other uAPI alternative which I have been thinking of is to extend
>   seccomp-unotify.  For example, a Landlock denial could result in the
>   syscall being trapped and a `struct seccomp_notif` being sent to the
>   seccomp supervisor (via the existing mechanism), with additional
>   information (mostly, the file(s) / net ports being accessed and access
>   rights requested) attached to the notification _somehow_.  Then the
>   supervisor can use the same kind of responses one would use for
>   seccomp-unotify to cause the syscall to either be retried (possibly via
>   `SECCOMP_USER_NOTIF_FLAG_CONTINUE`) or return with an error code of its
>   choice (or alternatively, carry out the operation on behalf of the
>   child, and pretend that the syscall succeed, which might be useful to
>   implement an "allow file creation but only this file" / "allow `mktemp
>   -d` but not arbitrary create on anything under /tmp").
> 
>   Looking at `struct seccomp_notif` and `struct seccomp_data` however, I'm
>   not sure how feasible / doable this extension would be.  Also,
>   seccomp-unotify is supposed to trigger before a syscall is actually
>   executed, whereas if we use it this way, we will want it to trigger
>   after we're already midway through the syscall (in the LSM hook).  This
>   might make it hard to implement (and also twists a bit the uAPI
>   semantics of seccomp-unotify).
>

(Some of the stuff discussed with seccomp below is derived from a side
conversation with Tingmao over this proposal)

There are some problems with extending seccomp unotify. Passing the
full context needed through this api to the supervisor is problematic.
seccomp unotify notifications look like this [4]:

           struct seccomp_notif {
               __u64  id;              /* Cookie */
               __u32  pid;             /* TID of target thread */
               __u32  flags;           /* Currently unused (0) */
               struct seccomp_data data;   /* See seccomp(2) */
           };

And struct seccomp_data [5]:

           struct seccomp_data {
               int   nr;                   /* System call number */
               __u32 arch;                 /* AUDIT_ARCH_* value
                                              (see <linux/audit.h>) */
               __u64 instruction_pointer;  /* CPU instruction pointer */
               __u64 args[6];              /* Up to 6 system call arguments */
           };

Even if we pass the syscall data, for the userspace to actually decode
the arguments to figure out what the access is doing we have two
critical problems (1,2) and one annoyance (3):

1. The syscall itself doesn't necessarily contain the full context of the access.

2. We cannot decode the pointer-based arguments from userspace for a syscall
in seccomp without TOCTOU. It also requires reaching into userspace
memory. [6]

3. Decoding the syscall number is an arch-specific operation that we now have
to expect userspace to deal with.

So unless there's something I'm missing extending seccomp unotify doesn't really make
sense. It's not as much of an extensible API like fanotify.

Unless we artificially trigger some notification after the fact, and figure out how
to jam the relevant access information into the notification or pass it through
a side channel, it's gonna be a difficult path forward to use seccomp directly.

[4]: https://man7.org/linux/man-pages/man2/seccomp_unotify.2.html
[5]: https://man7.org/linux/man-pages/man2/seccomp.2.html
[6]: https://blog.skepticfx.com/post/seccomp-pointers/
> 
> Are there any immediate reasons, from Landlock's perspective, to rule out
> either of them?  (I will probably wait for at least a first review from

I think direct extensions to seccomp are awkward at best, and it's
difficult to reason about an extension that would make sense.

fanotify seems more viable, but would require heavy extensions
(new record types, permission types) and adding landlock to it would be
inconsistent with the existing implementation semantically. (landlock is
not VFS specific).

I think the most viable path forward if this is to be done is a
dedicated uAPI. That being said, I think what Mickaël said about reusing
the internals is viable.

> the Landlock side before directing this explicitly to the fanotify and/or
> seccomp-unotify maintainers, in case the plan significantly changes, but
> if somehow a maintainer/reviewer from either of those areas are already
> reading this, firstly thanks, and feedback would be very valuable :D )
> 
> [12]: https://lore.kernel.org/all/cde6bbf0b52710b33170f2787fdcb11538e40813.1741047969.git.m@maowtm.org/#iZ31include:uapi:linux:landlock.h
> [13]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.15-rc1&id=fd101da676362aaa051b4f5d8a941bd308603041

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
  2026-02-15  2:54 [RFC] Landlock: mutable domains (and supervisor notification uAPI options) Tingmao Wang
  2026-02-15 21:23 ` Justin Suess
@ 2026-02-22 18:04 ` Tingmao Wang
  1 sibling, 0 replies; 6+ messages in thread
From: Tingmao Wang @ 2026-02-22 18:04 UTC (permalink / raw)
  To: Günther Noack, Mickaël Salaün, Justin Suess
  Cc: linux-security-module

On 2/15/26 02:54, Tingmao Wang wrote:
> PoC code used in the above videos are largely generated, somewhat buggy,
> and unreviewed, but they are available:
>
> - mutable domains:
>     https://github.com/micromaomao/linux-dev/pull/26/changes
> - supervisor notification:
>     https://github.com/micromaomao/linux-dev/pull/27/changes

btw, on second thought I should have clarified that I don't expect anyone
to review any of the code here.  Those were done purely to ensure that the
design I'm asking for review here actually works, and I'm open to any
changes.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
  2026-02-15 21:23 ` Justin Suess
  2026-02-16 21:27   ` Justin Suess
@ 2026-02-22 18:04   ` Tingmao Wang
  1 sibling, 0 replies; 6+ messages in thread
From: Tingmao Wang @ 2026-02-22 18:04 UTC (permalink / raw)
  To: Justin Suess, Günther Noack, Mickaël Salaün
  Cc: amir73il, jack, jannh, linux-security-module, penguin-kernel,
	song

On 2/15/26 21:23, Justin Suess wrote:
> On Sun, Feb 15, 2026 at 02:54:08AM +0000, Tingmao Wang wrote:
> [...]
>> Discussion on LANDLOCK_ADD_RULE_INTERSECT
>> -----------------------------------------
>>
>> This was initially proposed by Mickaël, although now after writing some
>> example code against it [7], I'm not 100% sure that it is the most useful
>> uAPI.  For a supervisor based on some sort of config file, it already has
>> to track which rules are added to know what to remove, and thus I feel
>> that it would be easier (both to use and to implement) to have an API that
>> simply "replaces" a rule, rather than do a bitwise AND on the access.
>>
> Instead of intersection being done at the rule level via
> landlock_add_rule, would it be better for intersection to be done at the
> ruleset_fd/ruleset level?
>
> So instead of intersecting individual rules, you can intersect entire
> rulesets, with the added benefit of being able to intersect handled
> accesses as well. (so you could handle an access initially, and not
> handle it later).

Personally I don't think making the list of handled accesses mutable would
add a lot of value (after all, a sandbox would usually handle all accesses
that it knows of), and I would like to avoid the complexity of making the
list of handled accesses mutable.  The semantic of "intersection" and
"union" of handled accesses is also not trivial: if ruleset A handles
read, and ruleset B handles read+write, their "intersection", if
interpreted as "only allow accesses allowed by both rulesets", would in
fact handle read+write (and their "union" would handle read only).

In the second (union) case, there is also the problem of what to do if
ruleset B has write access rules - these rules would technically become
invalid (although to no negative effect) in a ruleset that doesn't handle
write.

I do see the benefit of modifying scope bits (and maybe also
quiet_access_* bits), but I'm still worried about the extra complexity
(and thus also testing / docs needed etc)

>
> Intersecting at the ruleset level allows for grouping the intersection rules
> together, so you could create an unenforced ruleset for the sole purpose
> of intersecting with rulesets, and intersect all the rule(s) at once.
>
> That way, the ruleset fd can be reused for this purpose later with other
> supervisees, instead of creating ruleset, intersecting individual rules,
> repeat.
>
> I think also the semantics of having a function called
> "landlock_add_rule" actually removing accesses (when the intersect flag
> is added) is also confusing, because we're not really *add*-ing
> anything, we're removing.
>
> ALTERNATIVE #1
>
> Maybe the best way to do it is instead continue treating rulesets as
> immutable, but allow composition of them at ruleset creation time.
>
> This would look something like:
>
> Ruleset C = Ruleset A & Ruleset B
>
> Ruleset A and B are never modified, but instead a new Ruleset C is
> created that is the intersection of A and B. This could be done in a
> variety of ways (LANDLOCK_CREATE_RULESET_INTERSECT? new IOCTL?)
>
> An example API for what this might look like:
>
>   struct landlock_ruleset_attr ruleset_attr = {
>           // other fields for handled accesses must be blank.
>           .left_fd = existing_fd,
>           .right_fd = other_existing_fd,
>   };
>   int new_ruleset_fd = syscall(SYS_landlock_create_ruleset, &ruleset_attr,
>     sizeof(ruleset_attr), LANDLOCK_CREATE_RULESET_INTERSECT);
>
> And then the resulting ruleset which is the intersection of existing_fd
> and other_existing_fd could be returned.
>
> Similarly, we could:
>
>   int new_ruleset_fd = syscall(SYS_landlock_create_ruleset, &ruleset_attr,
>       sizeof(ruleset_attr), LANDLOCK_CREATE_RULESET_UNION);

If we do keep with the "intersect" way of removing rules (instead of
replace / clear all), this does seem like an interesting idea.  However,
it is more complex to implement (it will probably require traversing two
rbtrees at once to be implemented efficiently), and I'm not sure how much
utility this would add compared to just LANDLOCK_ADD_RULE_INTERSECT.  See
below for more reasoning.

>
> Which would be convienent for creating unions of rulesets.
>
> Then instead mutating rulesets, we commit/replace an entirely new ruleset.
>
> ioctl(supervisee_fd, LANDLOCK_IOCTL_COMMIT_RULESET, &new_ruleset_fd);

Using a dedicated ioctl to commit is also a potentially better idea - I
find that having the commit be a side effect of landlock_add_rule() via a
flag a bit unwieldy, as it would either require the supervisor to track
when it adds the last rule, or to add an "empty" rule just to commit.

Mickaël, you initially suggested the LANDLOCK_ADD_RULE_COMMIT_SUPERVISOR
flag, but do you think this is better?

>
> This has the following benefits:
>
> 1. Clearer semantics: "landlock_add_rule" is just for adding rules, not
> removing.
>
> 2. Intersection of all ruleset attributes, not just individual rule
> attributes.
>
> 3. Better logical grouping of rules for the purpose of intersection, and
> better composition.
>
> It does have drawbacks:
>
> 1. Intersecting individual rules requires making an entire ruleset for
> that one rule.
>
> 2. Users must be responsible for closing the unused/old rulesets that
> they might not longer need.
>
> ALTERNATIVE #2
>
> A middle ground is to keep the ruleset mutation via landlock_add_rule,
> but have it be done at the ruleset_fd level.
>
> Something like this:
>
>   struct landlock_ruleset_operand intersection = {
>     .operand = other_ruleset_fd
>   };
>   landlock_add_rule(ruleset_fd, LANDLOCK_RULE_INTERSECT_RULESET, &intersection, 0))
>
> I think this is also a valid way to do things, and increases the
> reusibility of rulesets.
>
> 1. Again, having landlock_add_rule being used to actually remove rules
> is confusing.

In this case, wouldn't we also be removing rules via landlock_add_rule()?
Personally I feel like this inconsistency is tolerable (it's easy enough
to explain), but I guess we could also change this to an ioctl if this is
a problem.

>
> 2. I'm unsure if we can change handled accesses after ruleset creation,
> so we might not be able to intersect the handled accesses like we can in
> the ALTERNATIVE #1.
>
>> Another alternative is to simply have a "clear all rules in this ruleset"
>> flag.  This allows the supervisor to not have to track what is already
>> allowed - if it reloads the config file, it can simply clear the ruleset,
>> re-add all rules based on the config, then commit it.  Although I worry
>> that this might make implementing some other use cases more difficult.
>
> At a minimum, it is cumbersome, and I worry about file descriptors
> becoming inaccessible (due to bind mounts / namespace changes in the
> supervisor's environment).
>
> Of course they can just hold those file descriptors open for the purposes
> of future intersections, but this is annoying and error prone.

If a supervisor doesn't care about potential renames / mount / namespace
changes making the sandboxed application lose access to previously
accessible files, the "clear all rules" approach would not force it to
keep any fds open in order to remove rules (i.e. it can clear everything,
then re-open the fds to add the rules back).  On the other hand, with the
"intersect" approach, it would have to keep the fds open in all cases to
correctly remove previously added rules, so I think this "clear all" is
not more cumbersome.

There is a general consideration here about how much we want to design the
API to advantage / disadvantage particular ways of using it.  For example,
having ruleset-ruleset intersection / union operations would (in theory,
setting aside the fact that to remain compatible to older kernels it
cannot do this for constructions of existing static rulesets) work very
well for something like island [1] / landlock-config, where we compose
rulesets by intersecting / unioning different landlock configuration files
together.  However, it will introduce more complexity to someone who just
wants to allow access as they come up (e.g. something like a "permissive /
learning mode"), as they now have to, every time when an access is denied,
create a new ruleset, add one single rule, and do a union.

IMO, there is also the general preference of having complicated logic
being in user-space rather than implemented by the kernel.  In this case,
one can argue that for someone that wants to compose rulesets via logical
operations, this should really be handled by a Landlock library, which
could, for static rulesets, do it completely internally in-memory, and do
one landlock_create_ruleset() + n landlock_add_rule()s in the end.  For
live modifications, it could then use the more low-level "intersect" /
"add rule" / "clear all" uAPI.  Compared to intersecting single rules,
having the kernel do the logical operations on entire rulesets also
doesn't reduce the number of syscalls needed (you still need one
landlock_add_rule() for each rule to be modified), so there is not a
performance argument either

(although for the use case where the supervisor wants to incrementally add
and remove rules, there is a performance benefit to intersect vs "clear
all".  But I do wonder how often in practice this would be implemented for
a supervisor that can remove rules - because it needs to keep an in-memory
table for the fds it has open anyway, in order to correctly remove rules,
a simpler approach would be to simply clear all, then re-add whatever it
wants based on what it still has in that table).

[1]: https://github.com/landlock-lsm/island

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC] Landlock: mutable domains (and supervisor notification uAPI options)
  2026-02-16 21:27   ` Justin Suess
@ 2026-02-22 18:04     ` Tingmao Wang
  0 siblings, 0 replies; 6+ messages in thread
From: Tingmao Wang @ 2026-02-22 18:04 UTC (permalink / raw)
  To: Justin Suess, Günther Noack, Mickaël Salaün
  Cc: amir73il, jack, jannh, linux-security-module, penguin-kernel,
	song

On 2/16/26 21:27, Justin Suess wrote:
> On Sun, Feb 15, 2026 at 02:54:08AM +0000, Tingmao Wang wrote:
>> [...]
>> The next stage is to introduce "mutable domains".  The motivation for this
>> is two fold:
>>
>> 1. This allows the supervisor to allow access to (large) file hierarchies
>>    without needing to be woken up again for each access.
>> 2. Because we cannot block within security_path_mknod and other
>>    directory-modification related hooks [6], the proposal was to return
>>    immediately from those hooks after queuing the supervisor notification,
>>    then wait in a separate task_work.  This however means that we cannot
>>    directly "allow" access (and even if we can, it may introduce TOCTOU
>>    problems).  In order to allow access to requested files, the supervisor
>>    has to add additional rules to the (now mutable) domain which will
>>    allow the required access.
>
> Is blocking during connect(2) allowed either if the socket is non-blocking?
>
> This may be another example case that needs to be handled differently than calls
> we can block in safely.

I think the non-blocking socket case is worth considering but is
orthogonal to the above discussion - even when the socket is marked
non-blocking, we can still "safely" block in the connect hook (as long as
there isn't another reason, like locks, that prevents us from doing so),
because it only affects the calling process.  Although if we're doing the
-ERESTARTNOINTR approach for fs, we might as well do the same here and
return -ERESTARTNOINTR from the connect hook (but that would still "block"
the syscall).  Also, even for the file create/delete case, while we don't
block in the hook, from the user-space perspective, we're still blocking
(i.e. in the task_work), since the syscall doesn't immediately return (but
instead waits for a response from the supervisor).

However we could consider implementing some special case for
non-blocking-capable operations (like socket connect) such that we return
from the syscall without waiting for a supervisor response (and we return
a value that a "try again later" response would normally return), but then
we have to figure out how to cause a "you can retry now" notification that
would normally be sent for the non-blocking operation.

Related, there is the question of how we want to handle "non-blocking
open()s".  Technically there isn't a non-blocking open() for regular files
from the user-space perspective, but ideally, for example, a wait on a
Landlock supervisor would not prevent an io_uring_enter() processing
multiple openat()s from continuing on to other requests.  But I suspect
this would be very hard to implement.

> [...]
>> (Disallowing) self-supervision
>> ------------------------------
>>
>> We should figure out a way to ensure that a process cannot call
>> landlock_restrict_self() with a ruleset that has a supervisor for which it
>> has access to (i.e. via a supervisor ruleset fd).  This prevents
>> accidental misuse, and also prevents deadlocks as discussed in [11].  I'm
>> not sure if this will be easy to implement, however.
>
> This seems like a graph acyclicity problem.
>
> Here are a couple cases to consider:
>
> 1. LANDLOCK_RESTRICT_SELF_TSYNC misuse:
>
> In the case where a user wants to use this supervisor to supervise other
> threads within the same process, a user could naively call
> LANDLOCK_RESTRICT_SELF_TSYNC (merged into 7.0) when enforcing the
> supervisee_fd. This would enforce the same policy on the thread running
> the supervisor and the supervisee.

Yes, but if the intention is to supervise threads within the same process
(which IMO is already a slightly questionable use case, maybe it can be
used as a tracing method, but certainly not for security), calling
restrict with LANDLOCK_RESTRICT_SELF_TSYNC is itself a mistake.  There is
a case for us to try to detect and prevent this, but I'm thinking it would
more be to prevent "useless sandboxing" rather than to prevent deadlocks
(after all, it might not deadlock if someone else also has a fd for that
supervisor).

>
> 2. Transfer of the supervisee_fd (SCM_RIGHTS)
>
> It's possible to transfer file descriptors over unix domain sockets. If
> we had a supervisor daemon that used this form of IPC to send precooked
> supervisee_fds to other threads, and one of those ended up in a parent
> process of the supervisor, we could inadvertently end up with problems.

Note that being a parent process of the supervisor itself is not a
problem, what matters here is the Landlock domain relationships.  It will
only be a problem if the fd is then used to make the supervisor be
supervised by itself (but if it's an unrelated parent process that is
supervised, it should still be fine).

>
> 3. Blocking in other LSMs (pointed out in your source [11])
>
> The hardest case to deal with, other LSMs like TOMOYO can also block and
> cause dependency cycles.
>
> ---
>
> This gets tricky, and I don't know  if just checking parent / child
> relationships would work. Because the supervisor and supervisee rulesets
> are just file descriptors, and there are potentially unlimited number of
> ways these FDs could be transfered or instantiated.
>
> I think the best way to deal with this is constraining the problem space:
>
> An idea (binding supervisors/supervisees to domains on first use)
>
> Whenever landlock_restrict_self(supervisee_fd,...) is called, check the
> current domain credentials and verify that the domain is a *proper
> subset* of the supervisors domain. Then permanently close the
> supervisee_fd and never allow reenforcement. Similarly, once a
> supervisor_fd is created, never allow commiting from a context with
> "current landlock domain != original landlock domain at creation"
>
> This prevents post-enforcement usage of the supervisee_fd by a parent
> domain, and post-commit usage of a supervisee_fd by any subdomain.

It's not the supervisee fd that matters right?  For example, a supervisor
can still call landlock_restrict_self() with the supervisee_fd it acquired
from the ioctl() (after all, this call itself is fine, it's only the fact
that there is no other process holding the supervisor fd that means this
gets us into a deadlock situation).  I would think that if we
automatically close any fds, it would be the supervisor fd that we would
close.  In this particular case this then means that there are no longer
any process holding the supervisor fd, which is a case we can detect, and
just deny any access request by the supervisee from that point forward
(preventing deadlock).

>
> I'm not sure if it's possible to check whether one domain is a
> proper subset of another (ie supervisor domain includes but *doesn't equal*
> supervisee domain), but I think that's one way do do it.
>
> This idea would help, but doesn't address case 3 above.
>
>>
>> [11]: https://lore.kernel.org/all/cc3e131f-f9a3-417b-9267-907b45083dc3@maowtm.org/
>>
>>
>> Supervisor notification
>> -----------------------
>>
>> The above RFC only covers mutable domains.  The natural next stage of this
>> work is to send notification to the supervisor on access denials, so that
>> it can decide whether to allow the access or not.  For that, there are
>> also lots of questions at this stage:
>>
>>
>> - Should we in fact implement that first, before mutable domains?  This
>>   means that the supervisor would only be able to find out about denials,
>>   but not allow them without a sandbox restart.  We still eventually want
>>   the mutable domains, since that makes this a lot more useful, but I can
>>   see some use cases for just the notification part (e.g. island denial
>>   log), and I can't see a likely use case for just mutable domains, aside
>>   from live reload of landlock-config (maybe that _is_ useful on its own,
>>   considering that you can also find out about denials from the kernel
>>   audit log, and add missing rules based on that).
>>
>>
>> - Earlier when implementing the Landlock supervise v1 RFC, I basically
>>   came up with an ad-hoc uAPI for the notification [12], and the PoC code
>>   linked to above also uses this uAPI.  There are of course many problems
>>   with this as it stands, e.g. it only having one destname, which means
>>   that for rename, the fd1 needs to be the child being moved, which does
>>   not align with the vfs semantic and how Landlock treat it (i.e. the
>>   thing being updated here is the parent directory, not the child itself).
>>   Same for delete, which currently sends the child as fd1.
>>
>>   But also, in discussion with Mickaël last year, he mentioned that we
>>   could reuse the fsnotify infrastructure, and perhaps additionally, use
>>   fanotify to deliver these notifications.  I do think there is some
>>   potential here, as fanotify already implements an event header, a
>>   mechanism for receiving and replying to events, etc.  We could possibly
>>   extend it to send Landlock specific notifications via a new kind of mark
>>   (FAN_MARK_LANDLOCK_DOMAIN ??) and add one or more new corresponding
>>   event types.  Mickaël mentioned mount notifications [13] as an example
>>   of using fanotify to send notifications other than file/dir
>>   modifications.
>>
>>   I'm not sure if directly extending the fanotify uAPI is a good idea tho,
>>   considering that Landlock is not a feature specific to the filesystem -
>>   we will also have denial events for net_port rules, and perhaps more in
>>   the future.  However, Mickaël mentioned that there might be some
>>   internal infrastructure which we can re-use (even if we have our own
>>   notification uAPI).
> I think that a new FAN_MARK would be required to use fanotify uAPI.
>
> There are a couple questions I have with this: (if we extend fanotify)
>
> 1. What FAN_CLASS_* would notifications use?
>
> FAN_CLASS_* specifies the type of notification, when the notification is
> triggered.
>
> See [1] for the current classes.
>
> If we want interactive, pre-access blocking, that would correspond to
> FAN_CLASS_PRE_CONTENT or FAN_CLASS_CONTENT. Both of which currently
> require CAP_SYS_ADMIN regardless of FAN_MARK. Which requiring that
> would require that supervisors have CAP_SYS_ADMIN, if the current
> CAP_SYS_ADMIN requirements remain in place.
>
> (If we don't have interactive blocking denials, we could just use
> FAN_CLASS_NOTIF)

I'm not sure, using fanotify is just a vague idea at this point and I
don't have any concrete design to offer, and this is indeed a good point -
we definitely don't want to require CAP_SYS_ADMIN (since being
unprivileged is one advantage of Landlock)

>
> 2. How would fanotify events be encoded?
>
> Events in fanotify use this structure for event data (one or more of the
> following must be recieved in a notification) [2]
>
>            struct fanotify_event_metadata {
>                __u32 event_len;
>                __u8 vers;
>                __u8 reserved;
>                __u16 metadata_len;
>                __aligned_u64 mask;
>                __s32 fd;
>                __s32 pid;
>            };
>
> There are access classes landlock restricts that might not have an fd at
> all, like abstract unix sockets, tcp ports, signals etc.

This was also one of my concerns, I guess we might still need a cookie to
identify such events.  This is one of the reasons why I think extending
seccomp-unotify makes slightly more sense (if we have to choose one of
these two, but of course we can still create a standalone uAPI).

>
> Good news is fanotify supports multiple types of additional information
> records, and we could potentially extend fanotify to support new ones as
> you alluded to.
>
> For examples of this, see struct fanotify_event_info_mnt,
> fanotify_event_info_pidfd.
>
> These records get attached to the event so they could be used to pass
> landlock access data.
>
> 3. If we support interactive permission decisions (even for a
> subset of landlock access rights only), do we use the response code?
> (question might be moot if we don't do blocking/responses at all)

Regardless of uAPI choice, I would really like to support blocking (even
though it makes this whole thing much more difficult).

>
> From [2]:
>
>        For permission events, the application must write(2) a structure
>        of the following form to the fanotify file descriptor:
>
>            struct fanotify_response {
>                __s32 fd;
>                __u32 response;
>            };
>
>
> response is a FAN_ALLOW or FAN_DENY. This is used by fanotify as a
> one-time access decision. Would this be used to do one-off exceptions to
> policy, or would we require policy decisions to go through the
> supervisor_fd and ignore the response code?

I don't think we would be able to do one-off allow decisions (but we
should still be able to do deny) regardless of the uAPI we use, due to the
problem with inode locks.

>
> 4. How would we reconcile the disparity between fanotify access rights
> and landlock access rights?
>
> There's no clean 1:1 mapping between fanotify access rights and landlock
> access rights as Mickaël pointed out. [2] [3]
>
> Many fs rights (creation, deletion, rename, linking) are not handled or
> implemented, (not even considering network/unix/signal scoping), so we'd
> be adding all these landlock specific rights.
>
> We could make a "catch-all" FAN_LANDLOCK_ACCESS or similar and ignore
> all the existing rights, and put the actual access data in the event
> record. It's awkward either way.
> ---
>
> In conclusion, I think extending fanotify is more viable than seccomp,
> from a purely technical standpoint. because it seems extensible,
> and because it runs post-lsm hooks.
>
> That being said, it's awkward, requires large extensions to the API, and
> definition of permissions that are specific to landlock.
>
> Whether or not landlock makes sense in fanotify from a semantic point of
> view is an entirely different question. There's no precedent for
> non-filesystem access controls in fanotify, so it's a little... out-of-place
> for an LSM to expose features on a filesystem access notification api?
>
> Curious on what people think.
>
> [1]: https://man7.org/linux/man-pages/man2/fanotify_init.2.html
> [2]: https://man7.org/linux/man-pages/man7/fanotify.7.html
> [3]: https://lore.kernel.org/all/20250304.Choo7foe2eoj@digikod.net/
>>
>>
>> - The other uAPI alternative which I have been thinking of is to extend
>>   seccomp-unotify.  For example, a Landlock denial could result in the
>>   syscall being trapped and a `struct seccomp_notif` being sent to the
>>   seccomp supervisor (via the existing mechanism), with additional
>>   information (mostly, the file(s) / net ports being accessed and access
>>   rights requested) attached to the notification _somehow_.  Then the
>>   supervisor can use the same kind of responses one would use for
>>   seccomp-unotify to cause the syscall to either be retried (possibly via
>>   `SECCOMP_USER_NOTIF_FLAG_CONTINUE`) or return with an error code of its
>>   choice (or alternatively, carry out the operation on behalf of the
>>   child, and pretend that the syscall succeed, which might be useful to
>>   implement an "allow file creation but only this file" / "allow `mktemp
>>   -d` but not arbitrary create on anything under /tmp").
>>
>>   Looking at `struct seccomp_notif` and `struct seccomp_data` however, I'm
>>   not sure how feasible / doable this extension would be.  Also,
>>   seccomp-unotify is supposed to trigger before a syscall is actually
>>   executed, whereas if we use it this way, we will want it to trigger
>>   after we're already midway through the syscall (in the LSM hook).  This
>>   might make it hard to implement (and also twists a bit the uAPI
>>   semantics of seccomp-unotify).
>>
>
> (Some of the stuff discussed with seccomp below is derived from a side
> conversation with Tingmao over this proposal)
>
> There are some problems with extending seccomp unotify. Passing the
> full context needed through this api to the supervisor is problematic.
> seccomp unotify notifications look like this [4]:
>
>            struct seccomp_notif {
>                __u64  id;              /* Cookie */
>                __u32  pid;             /* TID of target thread */
>                __u32  flags;           /* Currently unused (0) */
>                struct seccomp_data data;   /* See seccomp(2) */
>            };
>
> And struct seccomp_data [5]:
>
>            struct seccomp_data {
>                int   nr;                   /* System call number */
>                __u32 arch;                 /* AUDIT_ARCH_* value
>                                               (see <linux/audit.h>) */
>                __u64 instruction_pointer;  /* CPU instruction pointer */
>                __u64 args[6];              /* Up to 6 system call arguments */
>            };
>
> Even if we pass the syscall data, for the userspace to actually decode
> the arguments to figure out what the access is doing we have two
> critical problems (1,2) and one annoyance (3):
>
> 1. The syscall itself doesn't necessarily contain the full context of the access.
>
> 2. We cannot decode the pointer-based arguments from userspace for a syscall
> in seccomp without TOCTOU. It also requires reaching into userspace
> memory. [6]
>
> 3. Decoding the syscall number is an arch-specific operation that we now have
> to expect userspace to deal with.

We would probably need to attach additional (potentially variable length,
like file names for mknod / link / rename requests) Landlock-specific
information.  The supervisor should not have to do any syscall decoding
(otherwise this partially defeats the point of Landlock supervise - they
can just use seccomp-unotify instead, and handling all "monitored"
syscalls on behalf of the supervisee to prevent TOCTOU).

Also, if we do not allow one-off allows (but require the supervisor to
modify the ruleset), this cannot lead to TOCTOU because if the supervised
process tries to change the argument, it will cause another Landlock
denial and notification.

Thanks for the review!
Tingmao

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-22 18:04 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-15  2:54 [RFC] Landlock: mutable domains (and supervisor notification uAPI options) Tingmao Wang
2026-02-15 21:23 ` Justin Suess
2026-02-16 21:27   ` Justin Suess
2026-02-22 18:04     ` Tingmao Wang
2026-02-22 18:04   ` Tingmao Wang
2026-02-22 18:04 ` Tingmao Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox