* [PATCH v3 5/5] selftests/landlock: Fix format warning for __u64 in net_test
From: Mickaël Salaün @ 2026-04-02 19:26 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, linux-security-module, Justin Suess,
Tingmao Wang, stable, kernel test robot
In-Reply-To: <20260402192608.1458252-1-mic@digikod.net>
On architectures where __u64 is unsigned long (e.g. powerpc64), using
%llx to format a __u64 triggers a -Wformat warning because %llx expects
unsigned long long. Cast the argument to unsigned long long.
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: a549d055a22e ("selftests/landlock: Add network tests")
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/r/202604020206.62zgOTeP-lkp@intel.com/
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
Changes since v2:
- New patch.
---
tools/testing/selftests/landlock/net_test.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/landlock/net_test.c b/tools/testing/selftests/landlock/net_test.c
index b34b139b3f89..4c528154ea92 100644
--- a/tools/testing/selftests/landlock/net_test.c
+++ b/tools/testing/selftests/landlock/net_test.c
@@ -1356,7 +1356,7 @@ TEST_F(mini, network_access_rights)
&net_port, 0))
{
TH_LOG("Failed to add rule with access 0x%llx: %s",
- access, strerror(errno));
+ (unsigned long long)access, strerror(errno));
}
}
EXPECT_EQ(0, close(ruleset_fd));
--
2.53.0
^ permalink raw reply related
* [PATCH v3 4/5] selftests/landlock: Skip stale records in audit_match_record()
From: Mickaël Salaün @ 2026-04-02 19:26 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, linux-security-module, Justin Suess,
Tingmao Wang, stable
In-Reply-To: <20260402192608.1458252-1-mic@digikod.net>
Domain deallocation records are emitted asynchronously from kworker
threads (via free_ruleset_work()). Stale deallocation records from a
previous test can arrive during the current test's deallocation read
loop and be picked up by audit_match_record() instead of the expected
record, causing a domain ID mismatch. The audit.layers test (which
creates 16 nested domains) is particularly vulnerable because it reads
16 deallocation records in sequence, providing a large window for stale
records to interleave.
The same issue affects audit_flags.signal, where deallocation records
from a previous test (audit.layers) can leak into the next test and be
picked up by audit_match_record() instead of the expected record.
Fix this by continuing to read records when the type matches but the
content pattern does not. Stale records are silently consumed, and the
loop only stops when both type and pattern match (or the socket times
out with -EAGAIN).
Additionally, extend matches_log_domain_deallocated() with an
expected_domain_id parameter. When set, the regex pattern includes the
specific domain ID as a literal hex value, so that deallocation records
for a different domain do not match the pattern at all. This handles
the case where the stale record has the same denial count as the
expected one (e.g. both have denials=1), which the type+pattern loop
alone cannot distinguish. Callers that already know the expected domain
ID (from a prior denial or allocation record) now pass it to filter
precisely.
When expected_domain_id is set, matches_log_domain_deallocated() also
temporarily increases the socket timeout to audit_tv_dom_drop (1 second)
to wait for the asynchronous kworker deallocation, and restores
audit_tv_default afterward. This removes the need for callers to manage
the timeout switch manually.
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
Changes since v2:
https://lore.kernel.org/r/20260401161503.1136946-1-mic@digikod.net
- Fix __u64 format warnings on powerpc64 (cast to unsigned long long
for %llx).
Changes since v1:
https://lore.kernel.org/r/20260312100444.2609563-8-mic@digikod.net
- New patch.
---
tools/testing/selftests/landlock/audit.h | 82 ++++++++++++++-----
tools/testing/selftests/landlock/audit_test.c | 34 ++++----
2 files changed, 77 insertions(+), 39 deletions(-)
diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 74e1c3d763be..834005b2b0f0 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -249,9 +249,9 @@ static __maybe_unused char *regex_escape(const char *const src, char *dst,
static int audit_match_record(int audit_fd, const __u16 type,
const char *const pattern, __u64 *domain_id)
{
- struct audit_message msg;
+ struct audit_message msg, last_mismatch = {};
int ret, err = 0;
- bool matches_record = !type;
+ int num_type_match = 0;
regmatch_t matches[2];
regex_t regex;
@@ -259,21 +259,35 @@ static int audit_match_record(int audit_fd, const __u16 type,
if (ret)
return -EINVAL;
- do {
+ /*
+ * Reads records until one matches both the expected type and the
+ * pattern. Type-matching records with non-matching content are
+ * silently consumed, which handles stale domain deallocation records
+ * from a previous test emitted asynchronously by kworker threads.
+ */
+ while (true) {
memset(&msg, 0, sizeof(msg));
err = audit_recv(audit_fd, &msg);
- if (err)
+ if (err) {
+ if (num_type_match) {
+ printf("DATA: %s\n", last_mismatch.data);
+ printf("ERROR: %d record(s) matched type %u"
+ " but not pattern: %s\n",
+ num_type_match, type, pattern);
+ }
goto out;
+ }
- if (msg.header.nlmsg_type == type)
- matches_record = true;
- } while (!matches_record);
+ if (type && msg.header.nlmsg_type != type)
+ continue;
- ret = regexec(®ex, msg.data, ARRAY_SIZE(matches), matches, 0);
- if (ret) {
- printf("DATA: %s\n", msg.data);
- printf("ERROR: no match for pattern: %s\n", pattern);
- err = -ENOENT;
+ ret = regexec(®ex, msg.data, ARRAY_SIZE(matches), matches,
+ 0);
+ if (!ret)
+ break;
+
+ num_type_match++;
+ last_mismatch = msg;
}
if (domain_id) {
@@ -316,21 +330,49 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
domain_id);
}
-static int __maybe_unused matches_log_domain_deallocated(
- int audit_fd, unsigned int num_denials, __u64 *domain_id)
+/*
+ * Matches a domain deallocation record. When expected_domain_id is non-zero,
+ * the pattern includes the specific domain ID so that stale deallocation
+ * records from a previous test (with a different domain ID) are skipped by
+ * audit_match_record(), and the socket timeout is temporarily increased to
+ * audit_tv_dom_drop to wait for the asynchronous kworker deallocation.
+ */
+static int __maybe_unused
+matches_log_domain_deallocated(int audit_fd, unsigned int num_denials,
+ __u64 expected_domain_id, __u64 *domain_id)
{
static const char log_template[] = REGEX_LANDLOCK_PREFIX
" status=deallocated denials=%u$";
- char log_match[sizeof(log_template) + 10];
- int log_match_len;
+ static const char log_template_with_id[] =
+ "^audit([0-9.:]\\+): domain=\\(%llx\\)"
+ " status=deallocated denials=%u$";
+ char log_match[sizeof(log_template_with_id) + 32];
+ int log_match_len, err;
+
+ if (expected_domain_id)
+ log_match_len = snprintf(log_match, sizeof(log_match),
+ log_template_with_id,
+ (unsigned long long)expected_domain_id,
+ num_denials);
+ else
+ log_match_len = snprintf(log_match, sizeof(log_match),
+ log_template, num_denials);
- log_match_len = snprintf(log_match, sizeof(log_match), log_template,
- num_denials);
if (log_match_len >= sizeof(log_match))
return -E2BIG;
- return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
- domain_id);
+ if (expected_domain_id)
+ setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO,
+ &audit_tv_dom_drop, sizeof(audit_tv_dom_drop));
+
+ err = audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
+ domain_id);
+
+ if (expected_domain_id)
+ setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
+ sizeof(audit_tv_default));
+
+ return err;
}
struct audit_records {
diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
index f92ba6774faa..60de97bd0153 100644
--- a/tools/testing/selftests/landlock/audit_test.c
+++ b/tools/testing/selftests/landlock/audit_test.c
@@ -139,23 +139,24 @@ TEST_F(audit, layers)
WEXITSTATUS(status) != EXIT_SUCCESS)
_metadata->exit_code = KSFT_FAIL;
- /* Purges log from deallocated domains. */
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop, sizeof(audit_tv_dom_drop)));
+ /*
+ * Purges log from deallocated domains. Records arrive in LIFO order
+ * (innermost domain first) because landlock_put_hierarchy() walks the
+ * chain sequentially in a single kworker context.
+ */
for (i = ARRAY_SIZE(*domain_stack) - 1; i >= 0; i--) {
__u64 deallocated_dom = 2;
EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 1,
+ (*domain_stack)[i],
&deallocated_dom));
EXPECT_EQ((*domain_stack)[i], deallocated_dom)
{
TH_LOG("Failed to match domain %llx (#%d)",
- (*domain_stack)[i], i);
+ (unsigned long long)(*domain_stack)[i], i);
}
}
EXPECT_EQ(0, munmap(domain_stack, sizeof(*domain_stack)));
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default, sizeof(audit_tv_default)));
EXPECT_EQ(0, close(ruleset_fd));
}
@@ -270,13 +271,9 @@ TEST_F(audit, thread)
EXPECT_EQ(0, close(pipe_parent[1]));
ASSERT_EQ(0, pthread_join(thread, NULL));
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop, sizeof(audit_tv_dom_drop)));
- EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 1,
- &deallocated_dom));
+ EXPECT_EQ(0, matches_log_domain_deallocated(
+ self->audit_fd, 1, denial_dom, &deallocated_dom));
EXPECT_EQ(denial_dom, deallocated_dom);
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default, sizeof(audit_tv_default)));
}
FIXTURE(audit_flags)
@@ -432,22 +429,21 @@ TEST_F(audit_flags, signal)
if (variant->restrict_flags &
LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF) {
+ /*
+ * No deallocation record: denials=0 never matches a real
+ * record.
+ */
EXPECT_EQ(-EAGAIN,
- matches_log_domain_deallocated(self->audit_fd, 0,
+ matches_log_domain_deallocated(self->audit_fd, 0, 0,
&deallocated_dom));
EXPECT_EQ(deallocated_dom, 2);
} else {
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop,
- sizeof(audit_tv_dom_drop)));
EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 2,
+ *self->domain_id,
&deallocated_dom));
EXPECT_NE(deallocated_dom, 2);
EXPECT_NE(deallocated_dom, 0);
EXPECT_EQ(deallocated_dom, *self->domain_id);
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default,
- sizeof(audit_tv_default)));
}
}
--
2.53.0
^ permalink raw reply related
* [PATCH v3 1/5] selftests/landlock: Fix snprintf truncation checks in audit helpers
From: Mickaël Salaün @ 2026-04-02 19:26 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, linux-security-module, Justin Suess,
Tingmao Wang, stable
In-Reply-To: <20260402192608.1458252-1-mic@digikod.net>
snprintf() returns the number of characters that would have been
written, excluding the terminating NUL byte. When the output is
truncated, this return value equals or exceeds the buffer size. Fix
matches_log_domain_allocated() and matches_log_domain_deallocated() to
detect truncation with ">=" instead of ">".
Cc: Günther Noack <gnoack@google.com>
Cc: stable@vger.kernel.org
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Reviewed-by: Günther Noack <gnoack@google.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
Changes since v1:
https://lore.kernel.org/r/20260312100444.2609563-8-mic@digikod.net
- New patch (split from the drain fix).
---
tools/testing/selftests/landlock/audit.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 44eb433e9666..1049a0582af5 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -309,7 +309,7 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
log_match_len =
snprintf(log_match, sizeof(log_match), log_template, pid);
- if (log_match_len > sizeof(log_match))
+ if (log_match_len >= sizeof(log_match))
return -E2BIG;
return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
@@ -326,7 +326,7 @@ static int __maybe_unused matches_log_domain_deallocated(
log_match_len = snprintf(log_match, sizeof(log_match), log_template,
num_denials);
- if (log_match_len > sizeof(log_match))
+ if (log_match_len >= sizeof(log_match))
return -E2BIG;
return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
--
2.53.0
^ permalink raw reply related
* Re: LSM namespacing API
From: Casey Schaufler @ 2026-04-02 17:49 UTC (permalink / raw)
To: Dr. Greg, Paul Moore
Cc: Stephen Smalley, Ondrej Mosnacek, linux-security-module, selinux,
John Johansen, Casey Schaufler
In-Reply-To: <ac5MKr4lFQhc44i6@wind.enjellic.com>
On 4/2/2026 3:59 AM, Dr. Greg wrote:
> That still leaves the question of whether or not CAP_MAC_ADMIN is
> appropriate for gating the creation of a new security namespace.
That will have to be up to the individual LSMs. Not all LSMs implement
Mandatory Access Controls. It would be inappropriate for an LSM that
provides finer grain privilege than capabilities do to be gated by
CAP_MAC_ADMIN. An LSM that implements a novel access control list scheme
would fall under CAP_DAC_SOMETHING, not CAP_MAC_ADMIN. While a time-of-day
access scheme might require CAP_MAC_ADMIN, it might not. Implying that all
LSMs enforce a MAC policy is not a good idea.
^ permalink raw reply
* Re: [PATCH] landlock: Document fallocate(2) as another truncation corner case
From: Mickaël Salaün @ 2026-04-02 18:16 UTC (permalink / raw)
To: Günther Noack; +Cc: linux-security-module
In-Reply-To: <ac1SP3cGuEeIZFmM@google.com>
On Wed, Apr 01, 2026 at 07:13:35PM +0200, Günther Noack wrote:
> On Wed, Apr 01, 2026 at 06:30:28PM +0200, Mickaël Salaün wrote:
> > On Wed, Apr 01, 2026 at 05:09:10PM +0200, Günther Noack wrote:
> > > Reinforce the already stated policy that LANDLOCK_ACCESS_FS_TRUNCATE should
> > > always go hand in hand with LANDLOCK_ACCESS_FS_WRITE_FILE, as their
> > > meanings and enforcement overlap in counterintuitive ways.
> > >
> > > On many common file systems, fallocate(2) offers a way to shorten files as
> > > long as the file is opened for writing, side-stepping the
> > > LANDLOCK_ACCESS_FS_TRUNCATE right.
> > >
> > > Assisted-by: Gemini-CLI:gemini-3.1
> > > Signed-off-by: Günther Noack <gnoack@google.com>
> > > ---
> > > Documentation/userspace-api/landlock.rst | 8 ++++++--
> > > 1 file changed, 6 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
> > > index 7f86d7a37dc2..d5691ec136cc 100644
> > > --- a/Documentation/userspace-api/landlock.rst
> > > +++ b/Documentation/userspace-api/landlock.rst
> > > @@ -378,8 +378,8 @@ Truncating files
> > >
> > > The operations covered by ``LANDLOCK_ACCESS_FS_WRITE_FILE`` and
> > > ``LANDLOCK_ACCESS_FS_TRUNCATE`` both change the contents of a file and sometimes
> > > -overlap in non-intuitive ways. It is recommended to always specify both of
> > > -these together.
> > > +overlap in non-intuitive ways. It is strongly recommended to always specify
> > > +both of these together (either granting both, or granting none).
> > >
> > > A particularly surprising example is :manpage:`creat(2)`. The name suggests
> > > that this system call requires the rights to create and write files. However,
> > > @@ -391,6 +391,10 @@ It should also be noted that truncating files does not require the
> > > system call, this can also be done through :manpage:`open(2)` with the flags
> > > ``O_RDONLY | O_TRUNC``.
> > >
> > > +At the same time, on some filesystems, :manpage:`fallocate(2)` offers a way to
> > > +shorten file contents with ``FALLOC_FL_COLLAPSE_RANGE`` when the file is opened
> > > +for writing, sidestepping the ``LANDLOCK_ACCESS_FS_TRUNCATE`` right.
> >
> > Interesting, which filesystems? Shouldn't it be fixed in the code
> > instead?
>
> It works on ext4, and I also see mentions of FALLOC_FL_COLLAPSE_RANGE
> in XFS, F2FS, SMB and NTFS3.
>
> I should mention, it is not *exactly* the same as a truncation, but
> you can remove a chunk of the file from the middle, which also leads
> to a shorter file. For example, assuming a block size of 1024:
>
> 1. Make a file with 2*1024 bytes: 1024*'A', then 1024*'B'
> 2. fallocate(collapse range, 0, 1024)
>
> Resulting file is 1024*'B', and the file is shortened to 1024 bytes.
>
> So this is not *exactly* a truncation. (The man page says that an
> attempt to remove the end of a file results in EINVAL, so you have to
> take it from the middle, and it needs to align with block boundaries.)
>
> But it's quite similar, also shortens the file, and it does not
> require the Landlock truncation access right.
>
> I agree, another way would potentially be to call the LSM ftruncate
> hook. I suspect this would stay compatible with other LSMs, because
> the LSM ftruncate hook is a relatively recent addition (but have not
> checked in detail).
>
> The implementation of fallocate is vfs_fallocate() in fs/open.c - I
> only had a tentative look now; it checks that the file->f_mode is open
> for writing and calls security_file_permission() with MAY_WRITE.
>
> I always saw LANDLOCK_ACCESS_FS_WRITE_FILE and
> LANDLOCK_ACCESS_FS_TRUNCATE as rights that should always go together,
> so I suspect that it does not make a big difference in practice, and
> that is why I am suggesting to just document it more clearly for now.
OK, I agree, I'll take this patch. Thanks!
>
> —Günther
>
^ permalink raw reply
* Re: [PATCH v8 04/12] landlock: Control pathname UNIX domain socket resolution by path
From: Kuniyuki Iwashima @ 2026-04-02 18:09 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, John Johansen, Tingmao Wang,
Justin Suess, Sebastian Andrzej Siewior, Jann Horn,
linux-security-module, Samasth Norway Ananda, Matthieu Buffet,
Mikhail Ivanov, konstantin.meskhidze, Demi Marie Obenour,
Alyssa Ross, Tahera Fahimi, Georgia Garcia
In-Reply-To: <20260327164838.38231-5-gnoack3000@gmail.com>
On Fri, Mar 27, 2026 at 9:49 AM Günther Noack <gnoack3000@gmail.com> wrote:
>
> * Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
> controls the lookup operations for named UNIX domain sockets. The
> resolution happens during connect() and sendmsg() (depending on
> socket type).
> * Change access_mask_t from u16 to u32 (see below)
> * Hook into the path lookup in unix_find_bsd() in af_unix.c, using a
> LSM hook. Make policy decisions based on the new access rights
> * Increment the Landlock ABI version.
> * Minor test adaptations to keep the tests working.
> * Document the design rationale for scoped access rights,
> and cross-reference it from the header documentation.
>
> With this access right, access is granted if either of the following
> conditions is met:
>
> * The target socket's filesystem path was allow-listed using a
> LANDLOCK_RULE_PATH_BENEATH rule, *or*:
> * The target socket was created in the same Landlock domain in which
> LANDLOCK_ACCESS_FS_RESOLVE_UNIX was restricted.
>
> In case of a denial, connect() and sendmsg() return EACCES, which is
> the same error as it is returned if the user does not have the write
> bit in the traditional UNIX file system permissions of that file.
>
> The access_mask_t type grows from u16 to u32 to make space for the new
> access right. This also doubles the size of struct layer_access_masks
> from 32 byte to 64 byte.
>
> Document the (possible future) interaction between scoped flags and
> other access rights in struct landlock_ruleset_attr, and summarize the
> rationale, as discussed in code review leading up to [2].
>
> This feature was created with substantial discussion and input from
> Justin Suess, Tingmao Wang and Mickaël Salaün.
>
> Cc: Tingmao Wang <m@maowtm.org>
> Cc: Justin Suess <utilityemal77@gmail.com>
> Cc: Mickaël Salaün <mic@digikod.net>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Jann Horn <jannh@google.com>
> Link[1]: https://github.com/landlock-lsm/linux/issues/36
> Link[2]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
> Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> ---
> Documentation/security/landlock.rst | 42 +++++-
> Documentation/userspace-api/landlock.rst | 2 +-
> include/uapi/linux/landlock.h | 21 +++
> security/landlock/access.h | 2 +-
> security/landlock/audit.c | 1 +
> security/landlock/fs.c | 130 ++++++++++++++++++-
> security/landlock/limits.h | 2 +-
> security/landlock/syscalls.c | 2 +-
> tools/testing/selftests/landlock/base_test.c | 2 +-
> tools/testing/selftests/landlock/fs_test.c | 5 +-
> 10 files changed, 200 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst
> index 3e4d4d04cfae..c3f8f43073a7 100644
> --- a/Documentation/security/landlock.rst
> +++ b/Documentation/security/landlock.rst
> @@ -7,7 +7,7 @@ Landlock LSM: kernel documentation
> ==================================
>
> :Author: Mickaël Salaün
> -:Date: September 2025
> +:Date: March 2026
>
> Landlock's goal is to create scoped access-control (i.e. sandboxing). To
> harden a whole system, this feature should be available to any process,
> @@ -89,6 +89,46 @@ this is required to keep access controls consistent over the whole system, and
> this avoids unattended bypasses through file descriptor passing (i.e. confused
> deputy attack).
>
> +.. _scoped-flags-interaction:
> +
> +Interaction between scoped flags and other access rights
> +--------------------------------------------------------
> +
> +The ``scoped`` flags in ``struct landlock_ruleset_attr`` restrict the
> +use of *outgoing* IPC from the created Landlock domain, while they
> +permit reaching out to IPC endpoints *within* the created Landlock
> +domain.
> +
> +In the future, scoped flags *may* interact with other access rights,
> +e.g. so that abstract UNIX sockets can be allow-listed by name, or so
> +that signals can be allow-listed by signal number or target process.
> +
> +When introducing ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``, we defined it to
> +implicitly have the same scoping semantics as a
> +``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` flag would have: connecting to
> +UNIX sockets within the same domain (where
> +``LANDLOCK_ACCESS_FS_RESOLVE_UNIX`` is used) is unconditionally
> +allowed.
> +
> +The reasoning is:
> +
> +* Like other IPC mechanisms, connecting to named UNIX sockets in the
> + same domain should be expected and harmless. (If needed, users can
> + further refine their Landlock policies with nested domains or by
> + restricting ``LANDLOCK_ACCESS_FS_MAKE_SOCK``.)
> +* We reserve the option to still introduce
> + ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the future. (This would
> + be useful if we wanted to have a Landlock rule to permit IPC access
> + to other Landlock domains.)
> +* But we can postpone the point in time when users have to deal with
> + two interacting flags visible in the userspace API. (In particular,
> + it is possible that it won't be needed in practice, in which case we
> + can avoid the second flag altogether.)
> +* If we *do* introduce ``LANDLOCK_SCOPE_PATHNAME_UNIX_SOCKET`` in the
> + future, setting this scoped flag in a ruleset does *not reduce* the
> + restrictions, because access within the same scope is already
> + allowed based on ``LANDLOCK_ACCESS_FS_RESOLVE_UNIX``.
> +
> Tests
> =====
>
> diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
> index 13134bccdd39..1490f879f621 100644
> --- a/Documentation/userspace-api/landlock.rst
> +++ b/Documentation/userspace-api/landlock.rst
> @@ -8,7 +8,7 @@ Landlock: unprivileged access control
> =====================================
>
> :Author: Mickaël Salaün
> -:Date: January 2026
> +:Date: March 2026
>
> The goal of Landlock is to enable restriction of ambient rights (e.g. global
> filesystem or network access) for a set of processes. Because Landlock
> diff --git a/include/uapi/linux/landlock.h b/include/uapi/linux/landlock.h
> index f88fa1f68b77..3157d257555b 100644
> --- a/include/uapi/linux/landlock.h
> +++ b/include/uapi/linux/landlock.h
> @@ -248,6 +248,26 @@ struct landlock_net_port_attr {
> *
> * This access right is available since the fifth version of the Landlock
> * ABI.
> + * - %LANDLOCK_ACCESS_FS_RESOLVE_UNIX: Look up pathname UNIX domain sockets
> + * (:manpage:`unix(7)`). On UNIX domain sockets, this restricts both calls to
> + * :manpage:`connect(2)` as well as calls to :manpage:`sendmsg(2)` with an
> + * explicit recipient address.
> + *
> + * This access right only applies to connections to UNIX server sockets which
> + * were created outside of the newly created Landlock domain (e.g. from within
> + * a parent domain or from an unrestricted process). Newly created UNIX
> + * servers within the same Landlock domain continue to be accessible. In this
> + * regard, %LANDLOCK_ACCESS_FS_RESOLVE_UNIX has the same semantics as the
> + * ``LANDLOCK_SCOPE_*`` flags.
> + *
> + * If a resolve attempt is denied, the operation returns an ``EACCES`` error,
> + * in line with other filesystem access rights (but different to denials for
> + * abstract UNIX domain sockets).
> + *
> + * This access right is available since the ninth version of the Landlock ABI.
> + *
> + * The rationale for this design is described in
> + * :ref:`Documentation/security/landlock.rst <scoped-flags-interaction>`.
> *
> * Whether an opened file can be truncated with :manpage:`ftruncate(2)` or used
> * with `ioctl(2)` is determined during :manpage:`open(2)`, in the same way as
> @@ -333,6 +353,7 @@ struct landlock_net_port_attr {
> #define LANDLOCK_ACCESS_FS_REFER (1ULL << 13)
> #define LANDLOCK_ACCESS_FS_TRUNCATE (1ULL << 14)
> #define LANDLOCK_ACCESS_FS_IOCTL_DEV (1ULL << 15)
> +#define LANDLOCK_ACCESS_FS_RESOLVE_UNIX (1ULL << 16)
> /* clang-format on */
>
> /**
> diff --git a/security/landlock/access.h b/security/landlock/access.h
> index 277b6ed7f7bb..99c709f7979e 100644
> --- a/security/landlock/access.h
> +++ b/security/landlock/access.h
> @@ -34,7 +34,7 @@
> LANDLOCK_ACCESS_FS_IOCTL_DEV)
> /* clang-format on */
>
> -typedef u16 access_mask_t;
> +typedef u32 access_mask_t;
>
> /* Makes sure all filesystem access rights can be stored. */
> static_assert(BITS_PER_TYPE(access_mask_t) >= LANDLOCK_NUM_ACCESS_FS);
> diff --git a/security/landlock/audit.c b/security/landlock/audit.c
> index 60ff217ab95b..8d0edf94037d 100644
> --- a/security/landlock/audit.c
> +++ b/security/landlock/audit.c
> @@ -37,6 +37,7 @@ static const char *const fs_access_strings[] = {
> [BIT_INDEX(LANDLOCK_ACCESS_FS_REFER)] = "fs.refer",
> [BIT_INDEX(LANDLOCK_ACCESS_FS_TRUNCATE)] = "fs.truncate",
> [BIT_INDEX(LANDLOCK_ACCESS_FS_IOCTL_DEV)] = "fs.ioctl_dev",
> + [BIT_INDEX(LANDLOCK_ACCESS_FS_RESOLVE_UNIX)] = "fs.resolve_unix",
> };
>
> static_assert(ARRAY_SIZE(fs_access_strings) == LANDLOCK_NUM_ACCESS_FS);
> diff --git a/security/landlock/fs.c b/security/landlock/fs.c
> index 97065d51685a..fcf69b3d734d 100644
> --- a/security/landlock/fs.c
> +++ b/security/landlock/fs.c
> @@ -27,6 +27,7 @@
> #include <linux/lsm_hooks.h>
> #include <linux/mount.h>
> #include <linux/namei.h>
> +#include <linux/net.h>
> #include <linux/path.h>
> #include <linux/pid.h>
> #include <linux/rcupdate.h>
> @@ -36,6 +37,7 @@
> #include <linux/types.h>
> #include <linux/wait_bit.h>
> #include <linux/workqueue.h>
> +#include <net/af_unix.h>
> #include <uapi/linux/fiemap.h>
> #include <uapi/linux/landlock.h>
>
> @@ -314,7 +316,8 @@ static struct landlock_object *get_inode_object(struct inode *const inode)
> LANDLOCK_ACCESS_FS_WRITE_FILE | \
> LANDLOCK_ACCESS_FS_READ_FILE | \
> LANDLOCK_ACCESS_FS_TRUNCATE | \
> - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
> /* clang-format on */
>
> /*
> @@ -1557,6 +1560,130 @@ static int hook_path_truncate(const struct path *const path)
> return current_check_access_path(path, LANDLOCK_ACCESS_FS_TRUNCATE);
> }
>
> +/**
> + * unmask_scoped_access - Remove access right bits in @masks in all layers
> + * where @client and @server have the same domain
> + *
> + * This does the same as domain_is_scoped(), but unmasks bits in @masks.
> + * It can not return early as domain_is_scoped() does.
> + *
> + * A scoped access for a given access right bit is allowed iff, for all layer
> + * depths where the access bit is set, the client and server domain are the
> + * same. This function clears the access rights @access in @masks at all layer
> + * depths where the client and server domain are the same, so that, when they
> + * are all cleared, the access is allowed.
> + *
> + * @client: Client domain
> + * @server: Server domain
> + * @masks: Layer access masks to unmask
> + * @access: Access bits that control scoping
> + */
> +static void unmask_scoped_access(const struct landlock_ruleset *const client,
> + const struct landlock_ruleset *const server,
> + struct layer_access_masks *const masks,
> + const access_mask_t access)
> +{
> + int client_layer, server_layer;
> + const struct landlock_hierarchy *client_walker, *server_walker;
> +
> + /* This should not happen. */
> + if (WARN_ON_ONCE(!client))
> + return;
> +
> + /* Server has no Landlock domain; nothing to clear. */
> + if (!server)
> + return;
> +
> + /*
> + * client_layer must be a signed integer with greater capacity
> + * than client->num_layers to ensure the following loop stops.
> + */
> + BUILD_BUG_ON(sizeof(client_layer) > sizeof(client->num_layers));
> +
> + client_layer = client->num_layers - 1;
> + client_walker = client->hierarchy;
> + server_layer = server->num_layers - 1;
> + server_walker = server->hierarchy;
> +
> + /*
> + * Clears the access bits at all layers where the client domain is the
> + * same as the server domain. We start the walk at min(client_layer,
> + * server_layer). The layer bits until there can not be cleared because
> + * either the client or the server domain is missing.
> + */
> + for (; client_layer > server_layer; client_layer--)
> + client_walker = client_walker->parent;
> +
> + for (; server_layer > client_layer; server_layer--)
> + server_walker = server_walker->parent;
> +
> + for (; client_layer >= 0; client_layer--) {
> + if (masks->access[client_layer] & access &&
> + client_walker == server_walker)
> + masks->access[client_layer] &= ~access;
> +
> + client_walker = client_walker->parent;
> + server_walker = server_walker->parent;
> + }
> +}
> +
> +static int hook_unix_find(const struct path *const path, struct sock *other,
> + int flags)
> +{
> + const struct landlock_ruleset *dom_other;
> + const struct landlock_cred_security *subject;
> + struct layer_access_masks layer_masks;
> + struct landlock_request request = {};
> + static const struct access_masks fs_resolve_unix = {
> + .fs = LANDLOCK_ACCESS_FS_RESOLVE_UNIX,
> + };
> +
> + /* Lookup for the purpose of saving coredumps is OK. */
> + if (unlikely(flags & SOCK_COREDUMP))
> + return 0;
> +
> + subject = landlock_get_applicable_subject(current_cred(),
> + fs_resolve_unix, NULL);
> +
> + if (!subject)
> + return 0;
> +
> + /*
> + * Ignoring return value: that the domains apply was already checked in
> + * landlock_get_applicable_subject() above.
> + */
> + landlock_init_layer_masks(subject->domain, fs_resolve_unix.fs,
> + &layer_masks, LANDLOCK_KEY_INODE);
> +
> + /* Checks the layers in which we are connecting within the same domain. */
> + unix_state_lock(other);
> + if (unlikely(sock_flag(other, SOCK_DEAD) || !other->sk_socket ||
> + !other->sk_socket->file)) {
When will the latter two condition be true when !SOCK_DEAD ?
unix_find_bsd() should not find embryo sockets.
> + unix_state_unlock(other);
> + /*
> + * We rely on the caller to catch the (non-reversible) SOCK_DEAD
> + * condition and retry the lookup. If we returned an error
> + * here, the lookup would not get retried.
> + */
> + return 0;
> + }
> + dom_other = landlock_cred(other->sk_socket->file->f_cred)->domain;
> +
> + /* Access to the same (or a lower) domain is always allowed. */
> + unmask_scoped_access(subject->domain, dom_other, &layer_masks,
> + fs_resolve_unix.fs);
> + unix_state_unlock(other);
> +
> + /* Checks the connections to allow-listed paths. */
> + if (is_access_to_paths_allowed(subject->domain, path,
> + fs_resolve_unix.fs, &layer_masks,
> + &request, NULL, 0, NULL, NULL, NULL))
> + return 0;
> +
> + landlock_log_denial(subject, &request);
> + return -EACCES;
> +}
> +
> /* File hooks */
>
> /**
> @@ -1834,6 +1961,7 @@ static struct security_hook_list landlock_hooks[] __ro_after_init = {
> LSM_HOOK_INIT(path_unlink, hook_path_unlink),
> LSM_HOOK_INIT(path_rmdir, hook_path_rmdir),
> LSM_HOOK_INIT(path_truncate, hook_path_truncate),
> + LSM_HOOK_INIT(unix_find, hook_unix_find),
>
> LSM_HOOK_INIT(file_alloc_security, hook_file_alloc_security),
> LSM_HOOK_INIT(file_open, hook_file_open),
> diff --git a/security/landlock/limits.h b/security/landlock/limits.h
> index eb584f47288d..b454ad73b15e 100644
> --- a/security/landlock/limits.h
> +++ b/security/landlock/limits.h
> @@ -19,7 +19,7 @@
> #define LANDLOCK_MAX_NUM_LAYERS 16
> #define LANDLOCK_MAX_NUM_RULES U32_MAX
>
> -#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_IOCTL_DEV
> +#define LANDLOCK_LAST_ACCESS_FS LANDLOCK_ACCESS_FS_RESOLVE_UNIX
> #define LANDLOCK_MASK_ACCESS_FS ((LANDLOCK_LAST_ACCESS_FS << 1) - 1)
> #define LANDLOCK_NUM_ACCESS_FS __const_hweight64(LANDLOCK_MASK_ACCESS_FS)
>
> diff --git a/security/landlock/syscalls.c b/security/landlock/syscalls.c
> index 3b33839b80c7..a6e23657f3ce 100644
> --- a/security/landlock/syscalls.c
> +++ b/security/landlock/syscalls.c
> @@ -166,7 +166,7 @@ static const struct file_operations ruleset_fops = {
> * If the change involves a fix that requires userspace awareness, also update
> * the errata documentation in Documentation/userspace-api/landlock.rst .
> */
> -const int landlock_abi_version = 8;
> +const int landlock_abi_version = 9;
>
> /**
> * sys_landlock_create_ruleset - Create a new ruleset
> diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
> index 0fea236ef4bd..30d37234086c 100644
> --- a/tools/testing/selftests/landlock/base_test.c
> +++ b/tools/testing/selftests/landlock/base_test.c
> @@ -76,7 +76,7 @@ TEST(abi_version)
> const struct landlock_ruleset_attr ruleset_attr = {
> .handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
> };
> - ASSERT_EQ(8, landlock_create_ruleset(NULL, 0,
> + ASSERT_EQ(9, landlock_create_ruleset(NULL, 0,
> LANDLOCK_CREATE_RULESET_VERSION));
>
> ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr, 0,
> diff --git a/tools/testing/selftests/landlock/fs_test.c b/tools/testing/selftests/landlock/fs_test.c
> index 968a91c927a4..b318627e7561 100644
> --- a/tools/testing/selftests/landlock/fs_test.c
> +++ b/tools/testing/selftests/landlock/fs_test.c
> @@ -575,9 +575,10 @@ TEST_F_FORK(layout1, inval)
> LANDLOCK_ACCESS_FS_WRITE_FILE | \
> LANDLOCK_ACCESS_FS_READ_FILE | \
> LANDLOCK_ACCESS_FS_TRUNCATE | \
> - LANDLOCK_ACCESS_FS_IOCTL_DEV)
> + LANDLOCK_ACCESS_FS_IOCTL_DEV | \
> + LANDLOCK_ACCESS_FS_RESOLVE_UNIX)
>
> -#define ACCESS_LAST LANDLOCK_ACCESS_FS_IOCTL_DEV
> +#define ACCESS_LAST LANDLOCK_ACCESS_FS_RESOLVE_UNIX
>
> #define ACCESS_ALL ( \
> ACCESS_FILE | \
> --
> 2.53.0
>
^ permalink raw reply
* Re: LSM namespacing API
From: Dr. Greg @ 2026-04-02 10:59 UTC (permalink / raw)
To: Paul Moore
Cc: Stephen Smalley, Ondrej Mosnacek, linux-security-module, selinux,
John Johansen
In-Reply-To: <CAHC9VhR1R7TcC2a2wZ9-G8dmXTuhcDK1YedDduq0sFgPC8QxFw@mail.gmail.com>
On Sun, Mar 29, 2026 at 08:56:37PM -0400, Paul Moore wrote:
Good morning, hopefully the week is going well for everyone.
> On Sun, Mar 29, 2026 at 12:09???PM Dr. Greg <greg@enjellic.com> wrote:
> > On Tue, Mar 24, 2026 at 05:31:09PM -0400, Paul Moore wrote:
> > > On Tue, Mar 3, 2026 at 11:46???AM Paul Moore <paul@paul-moore.com> wrote:
> > > >
> > > > I'd really like to hear from some of the other LSMs before we start
> > > > diving into the code. It may sound funny, but from my perspective
> > > > doing the work to get the API definition "right" is far more important
> > > > than implementing it.
> >
> > > It's been three weeks now, and I haven't seen any strong arguments for
> > > supporting the clone() API at this time, so we can leave that out for
> > > now and stick with just the unshare() API for an initial attempt. We
> > > can always add a clone() API at a later date if needed; going small
> > > and expanding over time is usually a better decision anyway.
> > >
> > > So to quickly summarize, here is where I think the discussion landed:
> > >
> > > * Implement the lsm_unshare() syscall
> > >
> > > I expect it would look something like 'lsm_unshare(struct lsm_ctx
> > > *ctx, u32 size, u32 flags)' with @ctx specifying the particular LSM
> > > being unshared, and @flags being 0/unused at this point in time
> > > (unless we can think of something we want to specify here). Like
> > > lsm_set_self_attr(), only one @ctx can be specified at a time, so you
> > > can only unshare one LSM at a time.
> >
> > Unless we miss something, it would seem that there needs to be
> > additional thought as to how a process moves, atomically, from one
> > effective security configuration to the next.
> >
> > At a minimum, if we restrict ourselves to the model of simply changing
> > the namespace for a single LSM, there would seem to be a need to have
> > a 2-step process in order to atomically transition from one security
> > model/policy to the next.
> That depends on the individual LSMs, they are free to interpret the
> unshare request and handle it however they like.
No argument there.
An LSM will obviously need to allocate an LSM namespace specific
security 'blob' in order to hold the security context for the new
namespace.
Christian had proposed patches for a generic mechanism to create
LSM security namespace blobs, is implementation of that in scope for
this effort?
> > The interim between the first and second steps would allow an
> > orchestrator to configure the new namespace and load new namespace
> > specific policy into the security namespace ...
> As discussed previously, the LSM policy load syscalls might include
> some LSM namespace options. However, I first want to focus on
> finalizing the most basic namespace API, which on Linux is arguably
> the unshare() syscall concept.
Unfortunately, without considering all the implications and
requirements of various LSM's we may end up with lsm_share2() and
beyond.
See below.
> > It would seem that the flags variable might be a good option to use to
> > handle this 2-stage transition, for example LSM_NS_INIT and
> > LSM_NS_CHANGE, respectively, to specify the initialization and
> > execution phases of the transition.
> No. The lsm_unshare() syscall is intended to mimic the existing
> unshare() syscall as a single step process from a user's
> perspective. If it returns successfully the caller will be in a new
> LSM namespace as defined by the individual LSM specified in the
> syscall.
OK, we can reason forward with that paradigm.
An orchestrator issues the unshare call for an LSM namespace and upon
return from the system call the calling task is in a new namespace for
that particular LSM, the goal of which is presumably to implement a
security policy/model different than what had been in force
previously.
So the process is in a new LSM specific namespace, but still
implementing the policy from the previous namespace, until the
orchestrator can load the new policy and then trigger the LSM to
change from its previous policy to the newly loaded policy.
Is this consistent with your vision as to how all of this will work?
> > The other unanswered issue, or perhaps we missed it, are the security
> > controls that should be associated with the unshare call.
> Each LSM is free to implement whatever access controls it deems
> necessary in its lsm_unshare() callback.
Just to be clear.
When you refer to 'lsm_unshare() callback' are you referring to a new
LSM security hook to be be implemented that will allow all of the
active LSM's to pass judgement on whether or not the unshare should be
allowed to complete successfully?
See below.
> > Will there be a new LSM hook that allows other LSM's to veto the
> > creation of a namespace either for itself or for another LSM?
>
> I would expect the lsm_unshare() syscall to operate similarly to the
> lsm_set_self_attr() syscall in this regard.
The reference to handling this like lsm_set_self_attr() is unclear.
With lsm_set_self_attr() there is no reason for another LSM to deny
setting what is an LSM specific attribute, as you note above, each LSM
gets to decide if the request to set an attribute for the LSM should
be accepted or denied.
Since lsm_unshare() is changing the overall platform security state,
it seems consistent with the design of the LSM for other LSM's to be
able to veto this action.
Once again, this seems like an action that would be consistent with
the notion of the lockdown LSM,
> > Is there a need to have yet another kernel command-line parameter that
> > would completely deny the ability to create security namespaces?
> No, at least not at this point in time.
This would seem to reinforce issues in the previous discussion.
Given that distributions are 'kitchen sink' implementations it would
seem desirable that system security architects would want to use a
lockdown option to insure that the platform security configuration
cannot be changed.
> Individual LSMs can decide how they want to gate their own namespace
> functionality, if they implement namespaces at all.
>
> > Is CAP_MAC_ADMIN appropriate as the required capability to create a
> > new namespace or does there need to be, for security rigor, a specific
> > capability (CAP_LSM_NS?) that gates the ability to execute whatever
> > form of the system call is adopted?
> Once again, this is up to the individual LSMs, not the framework
> layer.
Fair enough.
That still leaves the question of whether or not CAP_MAC_ADMIN is
appropriate for gating the creation of a new security namespace.
> > Should there be an option to completely compile LSM namespaces out of
> > the kernel?
> That doesn't belong in the LSM framework layer, that is up to the
> individual LSMs.
You noted above the desire for lsm_unshare to be consistent with other
namespaces.
The current kernel paradigm is to allow classes of namespace
resources, ie. CONFIG_UTS_NS, CONFIG_TIME_NS et.al., to be compiled in
our out of the kernel.
It seems that CONFIG_LSM_NS would be consistent with that model.
> > > * Implement /proc/pid/ns/lsm and setns(CLONE_NEWLSM)
> > >
> > > As discussed previously, this allows us to move a process into an
> > > existing, established LSM namespace set. The caller cannot
> > > selectively choose which individual LSM namespaces they join from the
> > > given LSM namespace set, they receive the same LSM namespace
> > > configuration as the target process.
> >
> > As an initial aside. It would be assumed that a positive result of a
> > setns call would be to cause the calling process to atomically change
> > its security namespace set. This would further suggest the need to
> > have the security namespace creation process also execute atomically
> > in a multi-LSM namespace change environment.
> In the setns case no new LSM namespaces should be created, the process
> simply joins an existing set of LSM namespaces.
The issue isn't about new namespaces being created, the issue is
atomicity of a change to a new set of security policies.
With setns an atomic transition is implemented.
The proposed lsm_unshare() behavior results in a period of time when
multiple and varying security policies are active, depending on
various race issues in the orchestrator implementation.
This opens the door to a raft of potential security issues that we can
have a new acronym for, Time Of Implementation Time Of Use (TOITOU).
> > ... That is the concept of whether or not a setns
> > call, for any resource namespace, should also force a security
> > namespace change if the security namespace of the calling process
> > differs from that of the target process.
> That decision is left to the individual LSMs.
That is reasonable.
In order to support that model, there would seem to be a need to have
a new LSM call in the setns code that allows LSM's to determine
whether or not a change in the active security namespace set should be
forced, correct?
If so, is implementation of this in scope for the lsm_unshare()
infrastructure?
To close, at the risk of being the devils advocate.
Given that the sentiment is to force almost all of these
issues/decisions into the individual LSM's, what is the advantage of
having a common lsm_unshare() system call?
In the proposed model, a resource orchestrator is going to need to
have extensive knowledge over the mechanics of all the LSM's that
implement namespace functionality. At a very minimum, intrinsic to
the concept of security namespaces, there will be a need to load a new
policy or model into the namespace, an action that will be deeply LSM
specific.
At this point, the only common functionality may be the allocation of
a new LSM namespace 'blob'. An argument for not doing that in
lsm_unshare() is that it precludes the ability of an orchestrator to
implement an atomic policy change, as that would require an
orchestrator to somehow load a policy/model before lsm_unshare() is
called, which in turn would require a new security context to be
allocated prior to the unshare operation.
All of this tends to be an issue with integrity or measurement based
namespaces, which are important with respect to supporting
confidential computing initiatives. Without two stage namespace
transition, you stumble into subtle problems associated with
'Heisenberg dilemma' issues.
> paul-moore.com
Hopefully all of this will assist in defining the requirements for all
of this.
Have a good remainder of the week.
As always,
Dr. Greg
The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project
^ permalink raw reply
* Re: [PATCH v8 04/12] landlock: Control pathname UNIX domain socket resolution by path
From: Sebastian Andrzej Siewior @ 2026-04-02 9:51 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, John Johansen, Tingmao Wang,
Justin Suess, Kuniyuki Iwashima, Jann Horn, linux-security-module,
Samasth Norway Ananda, Matthieu Buffet, Mikhail Ivanov,
konstantin.meskhidze, Demi Marie Obenour, Alyssa Ross,
Tahera Fahimi, Georgia Garcia
In-Reply-To: <20260327164838.38231-5-gnoack3000@gmail.com>
On 2026-03-27 17:48:29 [+0100], Günther Noack wrote:
> * Add a new access right LANDLOCK_ACCESS_FS_RESOLVE_UNIX, which
> controls the lookup operations for named UNIX domain sockets. The
> resolution happens during connect() and sendmsg() (depending on
> socket type).
…
> Cc: Tingmao Wang <m@maowtm.org>
> Cc: Justin Suess <utilityemal77@gmail.com>
> Cc: Mickaël Salaün <mic@digikod.net>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Kuniyuki Iwashima <kuniyu@google.com>
> Suggested-by: Jann Horn <jannh@google.com>
> Link[1]: https://github.com/landlock-lsm/linux/issues/36
> Link[2]: https://lore.kernel.org/all/20260205.8531e4005118@gnoack.org/
> Signed-off-by: Günther Noack <gnoack3000@gmail.com>
The unix bits look okay to me,
Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Sebastian
^ permalink raw reply
* Re: [PATCH v6.1] apparmor: fix unprivileged local user can do privileged policy management
From: Keerthana Kalyanasundaram @ 2026-04-02 8:03 UTC (permalink / raw)
To: Greg KH
Cc: stable, john.johansen, paul, jmorris, serge, georgia.garcia,
cengiz.can, sashal, apparmor, linux-security-module, linux-kernel,
ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
yin.ding, tapas.kundu, Qualys Security Advisory,
Salvatore Bonaccorso
In-Reply-To: <2026040249-fable-sasquatch-4864@gregkh>
[-- Attachment #1.1: Type: text/plain, Size: 942 bytes --]
On Thu, Apr 2, 2026 at 11:31 AM Greg KH <gregkh@linuxfoundation.org> wrote:
> On Thu, Apr 02, 2026 at 05:47:00AM +0000, Keerthana K wrote:
> > From: John Johansen <john.johansen@canonical.com>
> >
> > commit 6601e13e82841879406bf9f369032656f441a425 upstream.
>
> <snip>
>
> Does your group/company/whatever actually use apparmor? If so, this
> isn't the only commit that needs to be backported. I'm waiting on a
> "correct" set of 6.1.y patches from John before applying all of them to
> 6.1.y and then I can take the patch series that he gave me for 5.10.y
> and 5.15.y and will queue them up.
>
> So thanks for this backport, but it's not going to help resolve all of
> the recent fixes that went in as part of this series by just applying
> one of them.
>
> Thanks for the update, Greg. We will wait for John to queue and apply the
complete series of patches to the stable branches.
thanks,
>
> greg k-h
>
[-- Attachment #1.2: Type: text/html, Size: 1696 bytes --]
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5459 bytes --]
^ permalink raw reply
* Re: [PATCH v6.1] apparmor: fix unprivileged local user can do privileged policy management
From: Greg KH @ 2026-04-02 6:01 UTC (permalink / raw)
To: Keerthana K
Cc: stable, john.johansen, paul, jmorris, serge, georgia.garcia,
cengiz.can, sashal, apparmor, linux-security-module, linux-kernel,
ajay.kaher, alexey.makhalov, vamsi-krishna.brahmajosyula,
yin.ding, tapas.kundu, Qualys Security Advisory,
Salvatore Bonaccorso
In-Reply-To: <20260402054700.2798707-1-keerthana.kalyanasundaram@broadcom.com>
On Thu, Apr 02, 2026 at 05:47:00AM +0000, Keerthana K wrote:
> From: John Johansen <john.johansen@canonical.com>
>
> commit 6601e13e82841879406bf9f369032656f441a425 upstream.
<snip>
Does your group/company/whatever actually use apparmor? If so, this
isn't the only commit that needs to be backported. I'm waiting on a
"correct" set of 6.1.y patches from John before applying all of them to
6.1.y and then I can take the patch series that he gave me for 5.10.y
and 5.15.y and will queue them up.
So thanks for this backport, but it's not going to help resolve all of
the recent fixes that went in as part of this series by just applying
one of them.
thanks,
greg k-h
^ permalink raw reply
* [PATCH v5.10-v5.15] apparmor: fix unprivileged local user can do privileged policy management
From: Keerthana K @ 2026-04-02 5:47 UTC (permalink / raw)
To: stable, gregkh
Cc: john.johansen, paul, jmorris, serge, georgia.garcia, cengiz.can,
sashal, apparmor, linux-security-module, linux-kernel, ajay.kaher,
alexey.makhalov, vamsi-krishna.brahmajosyula, yin.ding,
tapas.kundu, Qualys Security Advisory, Salvatore Bonaccorso,
Keerthana K
From: John Johansen <john.johansen@canonical.com>
commit 6601e13e82841879406bf9f369032656f441a425 upstream.
An unprivileged local user can load, replace, and remove profiles by
opening the apparmorfs interfaces, via a confused deputy attack, by
passing the opened fd to a privileged process, and getting the
privileged process to write to the interface.
This does require a privileged target that can be manipulated to do
the write for the unprivileged process, but once such access is
achieved full policy management is possible and all the possible
implications that implies: removing confinement, DoS of system or
target applications by denying all execution, by-passing the
unprivileged user namespace restriction, to exploiting kernel bugs for
a local privilege escalation.
The policy management interface can not have its permissions simply
changed from 0666 to 0600 because non-root processes need to be able
to load policy to different policy namespaces.
Instead ensure the task writing the interface has privileges that
are a subset of the task that opened the interface. This is already
done via policy for confined processes, but unconfined can delegate
access to the opened fd, by-passing the usual policy check.
Fixes: b7fd2c0340eac ("apparmor: add per policy ns .load, .replace, .remove interface files")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[Keerthana: aa_may_manage_policy() does not take a subj_cred
parameter (added in 90c436a64a6e, merged in v6.7). Pass current_cred()
directly to is_subset_of_obj_privilege() in place of subj_cred, which
is equivalent since all call sites pass current_cred() as subj_cred.]
Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
---
security/apparmor/apparmorfs.c | 16 ++++++++------
security/apparmor/include/policy.h | 2 +-
security/apparmor/policy.c | 35 +++++++++++++++++++++++++++++-
3 files changed, 44 insertions(+), 9 deletions(-)
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index e736936f4f0b..3053e5731b02 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -409,7 +409,8 @@ static struct aa_loaddata *aa_simple_write_to_buffer(const char __user *userbuf,
}
static ssize_t policy_update(u32 mask, const char __user *buf, size_t size,
- loff_t *pos, struct aa_ns *ns)
+ loff_t *pos, struct aa_ns *ns,
+ const struct cred *ocred)
{
struct aa_loaddata *data;
struct aa_label *label;
@@ -420,7 +421,7 @@ static ssize_t policy_update(u32 mask, const char __user *buf, size_t size,
/* high level check about policy management - fine grained in
* below after unpack
*/
- error = aa_may_manage_policy(label, ns, mask);
+ error = aa_may_manage_policy(label, ns, ocred, mask);
if (error)
goto end_section;
@@ -441,7 +442,8 @@ static ssize_t profile_load(struct file *f, const char __user *buf, size_t size,
loff_t *pos)
{
struct aa_ns *ns = aa_get_ns(f->f_inode->i_private);
- int error = policy_update(AA_MAY_LOAD_POLICY, buf, size, pos, ns);
+ int error = policy_update(AA_MAY_LOAD_POLICY, buf, size, pos, ns,
+ f->f_cred);
aa_put_ns(ns);
@@ -459,7 +461,7 @@ static ssize_t profile_replace(struct file *f, const char __user *buf,
{
struct aa_ns *ns = aa_get_ns(f->f_inode->i_private);
int error = policy_update(AA_MAY_LOAD_POLICY | AA_MAY_REPLACE_POLICY,
- buf, size, pos, ns);
+ buf, size, pos, ns, f->f_cred);
aa_put_ns(ns);
return error;
@@ -483,7 +485,7 @@ static ssize_t profile_remove(struct file *f, const char __user *buf,
/* high level check about policy management - fine grained in
* below after unpack
*/
- error = aa_may_manage_policy(label, ns, AA_MAY_REMOVE_POLICY);
+ error = aa_may_manage_policy(label, ns, f->f_cred, AA_MAY_REMOVE_POLICY);
if (error)
goto out;
@@ -1796,7 +1798,7 @@ static int ns_mkdir_op(struct inode *dir, struct dentry *dentry, umode_t mode)
int error;
label = begin_current_label_crit_section();
- error = aa_may_manage_policy(label, NULL, AA_MAY_LOAD_POLICY);
+ error = aa_may_manage_policy(label, NULL, NULL, AA_MAY_LOAD_POLICY);
end_current_label_crit_section(label);
if (error)
return error;
@@ -1845,7 +1847,7 @@ static int ns_rmdir_op(struct inode *dir, struct dentry *dentry)
int error;
label = begin_current_label_crit_section();
- error = aa_may_manage_policy(label, NULL, AA_MAY_LOAD_POLICY);
+ error = aa_may_manage_policy(label, NULL, NULL, AA_MAY_LOAD_POLICY);
end_current_label_crit_section(label);
if (error)
return error;
diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index b5aa4231af68..f6682a31df23 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -304,6 +304,6 @@ static inline int AUDIT_MODE(struct aa_profile *profile)
bool policy_view_capable(struct aa_ns *ns);
bool policy_admin_capable(struct aa_ns *ns);
int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns,
- u32 mask);
+ const struct cred *ocred, u32 mask);
#endif /* __AA_POLICY_H */
diff --git a/security/apparmor/policy.c b/security/apparmor/policy.c
index e59bdb750ef0..f2bc865bc7b6 100644
--- a/security/apparmor/policy.c
+++ b/security/apparmor/policy.c
@@ -671,14 +671,42 @@ bool policy_admin_capable(struct aa_ns *ns)
return policy_view_capable(ns) && capable && !aa_g_lock_policy;
}
+static bool is_subset_of_obj_privilege(const struct cred *cred,
+ struct aa_label *label,
+ const struct cred *ocred)
+{
+ if (cred == ocred)
+ return true;
+
+ if (!aa_label_is_subset(label, cred_label(ocred)))
+ return false;
+ /* don't allow crossing userns for now */
+ if (cred->user_ns != ocred->user_ns)
+ return false;
+ if (!cap_issubset(cred->cap_inheritable, ocred->cap_inheritable))
+ return false;
+ if (!cap_issubset(cred->cap_permitted, ocred->cap_permitted))
+ return false;
+ if (!cap_issubset(cred->cap_effective, ocred->cap_effective))
+ return false;
+ if (!cap_issubset(cred->cap_bset, ocred->cap_bset))
+ return false;
+ if (!cap_issubset(cred->cap_ambient, ocred->cap_ambient))
+ return false;
+ return true;
+}
+
+
/**
* aa_may_manage_policy - can the current task manage policy
* @label: label to check if it can manage policy
+ * @ocred: object cred if request is coming from an open object
* @op: the policy manipulation operation being done
*
* Returns: 0 if the task is allowed to manipulate policy else error
*/
-int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns, u32 mask)
+int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns,
+ const struct cred *ocred, u32 mask)
{
const char *op;
@@ -694,6 +722,11 @@ int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns, u32 mask)
return audit_policy(label, op, NULL, NULL, "policy_locked",
-EACCES);
+ if (ocred && !is_subset_of_obj_privilege(current_cred(), label, ocred))
+ return audit_policy(label, op, NULL, NULL,
+ "not privileged for target profile",
+ -EACCES);
+
if (!policy_admin_capable(ns))
return audit_policy(label, op, NULL, NULL, "not policy admin",
-EACCES);
--
2.43.7
^ permalink raw reply related
* [PATCH v6.1] apparmor: fix unprivileged local user can do privileged policy management
From: Keerthana K @ 2026-04-02 5:47 UTC (permalink / raw)
To: stable, gregkh
Cc: john.johansen, paul, jmorris, serge, georgia.garcia, cengiz.can,
sashal, apparmor, linux-security-module, linux-kernel, ajay.kaher,
alexey.makhalov, vamsi-krishna.brahmajosyula, yin.ding,
tapas.kundu, Qualys Security Advisory, Salvatore Bonaccorso,
Keerthana K
From: John Johansen <john.johansen@canonical.com>
commit 6601e13e82841879406bf9f369032656f441a425 upstream.
An unprivileged local user can load, replace, and remove profiles by
opening the apparmorfs interfaces, via a confused deputy attack, by
passing the opened fd to a privileged process, and getting the
privileged process to write to the interface.
This does require a privileged target that can be manipulated to do
the write for the unprivileged process, but once such access is
achieved full policy management is possible and all the possible
implications that implies: removing confinement, DoS of system or
target applications by denying all execution, by-passing the
unprivileged user namespace restriction, to exploiting kernel bugs for
a local privilege escalation.
The policy management interface can not have its permissions simply
changed from 0666 to 0600 because non-root processes need to be able
to load policy to different policy namespaces.
Instead ensure the task writing the interface has privileges that
are a subset of the task that opened the interface. This is already
done via policy for confined processes, but unconfined can delegate
access to the opened fd, by-passing the usual policy check.
Fixes: b7fd2c0340eac ("apparmor: add per policy ns .load, .replace, .remove interface files")
Reported-by: Qualys Security Advisory <qsa@qualys.com>
Tested-by: Salvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Georgia Garcia <georgia.garcia@canonical.com>
Reviewed-by: Cengiz Can <cengiz.can@canonical.com>
Signed-off-by: John Johansen <john.johansen@canonical.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
[Keerthana: aa_may_manage_policy() does not take a subj_cred
parameter (added in 90c436a64a6e, merged in v6.7). Pass current_cred()
directly to is_subset_of_obj_privilege() in place of subj_cred, which
is equivalent since all call sites pass current_cred() as subj_cred.]
Signed-off-by: Keerthana K <keerthana.kalyanasundaram@broadcom.com>
---
security/apparmor/apparmorfs.c | 16 ++++++++------
security/apparmor/include/policy.h | 2 +-
security/apparmor/policy.c | 35 +++++++++++++++++++++++++++++-
3 files changed, 44 insertions(+), 9 deletions(-)
diff --git a/security/apparmor/apparmorfs.c b/security/apparmor/apparmorfs.c
index fa518cd82366..fa4a6f20f58e 100644
--- a/security/apparmor/apparmorfs.c
+++ b/security/apparmor/apparmorfs.c
@@ -412,7 +412,8 @@ static struct aa_loaddata *aa_simple_write_to_buffer(const char __user *userbuf,
}
static ssize_t policy_update(u32 mask, const char __user *buf, size_t size,
- loff_t *pos, struct aa_ns *ns)
+ loff_t *pos, struct aa_ns *ns,
+ const struct cred *ocred)
{
struct aa_loaddata *data;
struct aa_label *label;
@@ -423,7 +424,7 @@ static ssize_t policy_update(u32 mask, const char __user *buf, size_t size,
/* high level check about policy management - fine grained in
* below after unpack
*/
- error = aa_may_manage_policy(label, ns, mask);
+ error = aa_may_manage_policy(label, ns, ocred, mask);
if (error)
goto end_section;
@@ -444,7 +445,8 @@ static ssize_t profile_load(struct file *f, const char __user *buf, size_t size,
loff_t *pos)
{
struct aa_ns *ns = aa_get_ns(f->f_inode->i_private);
- int error = policy_update(AA_MAY_LOAD_POLICY, buf, size, pos, ns);
+ int error = policy_update(AA_MAY_LOAD_POLICY, buf, size, pos, ns,
+ f->f_cred);
aa_put_ns(ns);
@@ -462,7 +464,7 @@ static ssize_t profile_replace(struct file *f, const char __user *buf,
{
struct aa_ns *ns = aa_get_ns(f->f_inode->i_private);
int error = policy_update(AA_MAY_LOAD_POLICY | AA_MAY_REPLACE_POLICY,
- buf, size, pos, ns);
+ buf, size, pos, ns, f->f_cred);
aa_put_ns(ns);
return error;
@@ -486,7 +488,7 @@ static ssize_t profile_remove(struct file *f, const char __user *buf,
/* high level check about policy management - fine grained in
* below after unpack
*/
- error = aa_may_manage_policy(label, ns, AA_MAY_REMOVE_POLICY);
+ error = aa_may_manage_policy(label, ns, f->f_cred, AA_MAY_REMOVE_POLICY);
if (error)
goto out;
@@ -1808,7 +1810,7 @@ static int ns_mkdir_op(struct user_namespace *mnt_userns, struct inode *dir,
int error;
label = begin_current_label_crit_section();
- error = aa_may_manage_policy(label, NULL, AA_MAY_LOAD_POLICY);
+ error = aa_may_manage_policy(label, NULL, NULL, AA_MAY_LOAD_POLICY);
end_current_label_crit_section(label);
if (error)
return error;
@@ -1857,7 +1859,7 @@ static int ns_rmdir_op(struct inode *dir, struct dentry *dentry)
int error;
label = begin_current_label_crit_section();
- error = aa_may_manage_policy(label, NULL, AA_MAY_LOAD_POLICY);
+ error = aa_may_manage_policy(label, NULL, NULL, AA_MAY_LOAD_POLICY);
end_current_label_crit_section(label);
if (error)
return error;
diff --git a/security/apparmor/include/policy.h b/security/apparmor/include/policy.h
index 639b5b248e63..3f776f5e8de4 100644
--- a/security/apparmor/include/policy.h
+++ b/security/apparmor/include/policy.h
@@ -308,7 +308,7 @@ static inline int AUDIT_MODE(struct aa_profile *profile)
bool aa_policy_view_capable(struct aa_label *label, struct aa_ns *ns);
bool aa_policy_admin_capable(struct aa_label *label, struct aa_ns *ns);
int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns,
- u32 mask);
+ const struct cred *ocred, u32 mask);
bool aa_current_policy_view_capable(struct aa_ns *ns);
bool aa_current_policy_admin_capable(struct aa_ns *ns);
diff --git a/security/apparmor/policy.c b/security/apparmor/policy.c
index 4ee5a450d118..e7412a221551 100644
--- a/security/apparmor/policy.c
+++ b/security/apparmor/policy.c
@@ -712,14 +712,42 @@ bool aa_current_policy_admin_capable(struct aa_ns *ns)
return res;
}
+static bool is_subset_of_obj_privilege(const struct cred *cred,
+ struct aa_label *label,
+ const struct cred *ocred)
+{
+ if (cred == ocred)
+ return true;
+
+ if (!aa_label_is_subset(label, cred_label(ocred)))
+ return false;
+ /* don't allow crossing userns for now */
+ if (cred->user_ns != ocred->user_ns)
+ return false;
+ if (!cap_issubset(cred->cap_inheritable, ocred->cap_inheritable))
+ return false;
+ if (!cap_issubset(cred->cap_permitted, ocred->cap_permitted))
+ return false;
+ if (!cap_issubset(cred->cap_effective, ocred->cap_effective))
+ return false;
+ if (!cap_issubset(cred->cap_bset, ocred->cap_bset))
+ return false;
+ if (!cap_issubset(cred->cap_ambient, ocred->cap_ambient))
+ return false;
+ return true;
+}
+
+
/**
* aa_may_manage_policy - can the current task manage policy
* @label: label to check if it can manage policy
+ * @ocred: object cred if request is coming from an open object
* @op: the policy manipulation operation being done
*
* Returns: 0 if the task is allowed to manipulate policy else error
*/
-int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns, u32 mask)
+int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns,
+ const struct cred *ocred, u32 mask)
{
const char *op;
@@ -735,6 +763,11 @@ int aa_may_manage_policy(struct aa_label *label, struct aa_ns *ns, u32 mask)
return audit_policy(label, op, NULL, NULL, "policy_locked",
-EACCES);
+ if (ocred && !is_subset_of_obj_privilege(current_cred(), label, ocred))
+ return audit_policy(label, op, NULL, NULL,
+ "not privileged for target profile",
+ -EACCES);
+
if (!aa_policy_admin_capable(label, ns))
return audit_policy(label, op, NULL, NULL, "not policy admin",
-EACCES);
--
2.43.7
^ permalink raw reply related
* Re: [RFC PATCH v1 02/11] security: Add LSM_AUDIT_DATA_NS for namespace audit records
From: Mickaël Salaün @ 2026-04-01 18:48 UTC (permalink / raw)
To: Christian Brauner
Cc: Günther Noack, Paul Moore, Serge E . Hallyn, Justin Suess,
Lennart Poettering, Mikhail Ivanov, Nicolas Bouchinet,
Shervin Oloumi, Tingmao Wang, kernel-team, linux-fsdevel,
linux-kernel, linux-security-module
In-Reply-To: <20260401.XohYeim7tah8@digikod.net>
On Wed, Apr 01, 2026 at 06:38:34PM +0200, Mickaël Salaün wrote:
> On Wed, Mar 25, 2026 at 01:32:42PM +0100, Christian Brauner wrote:
> > On Thu, Mar 12, 2026 at 11:04:35AM +0100, Mickaël Salaün wrote:
> > > Add a new LSM audit data type LSM_AUDIT_DATA_NS that logs namespace
> > > information in audit records. Two fields are provided, matching the
> > > field names of struct ns_common:
> > >
> > > - ns_type: the CLONE_NEW* flag identifying the namespace type, logged in
> > > hexadecimal.
> > >
> > > - inum: the proc inode number identifying a specific namespace instance.
> > > Namespace inode numbers are allocated by proc_alloc_inum() via
> > > ida_alloc_max() bounded to UINT_MAX, so the value always fits in 32
> > > bits.
> > >
> > > A new audit data type is needed because no existing LSM_AUDIT_DATA_*
> > > type carries namespace information. The closest alternatives (e.g.
> > > LSM_AUDIT_DATA_TASK or LSM_AUDIT_DATA_NONE with custom strings) would
> > > either lose the namespace type or require ad-hoc formatting that
> > > bypasses the structured audit data union.
> > >
> > > Cc: Christian Brauner <brauner@kernel.org>
> > > Cc: Günther Noack <gnoack@google.com>
> > > Cc: Paul Moore <paul@paul-moore.com>
> > > Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > > ---
> > > include/linux/lsm_audit.h | 5 +++++
> > > security/lsm_audit.c | 4 ++++
> > > 2 files changed, 9 insertions(+)
> > >
> > > diff --git a/include/linux/lsm_audit.h b/include/linux/lsm_audit.h
> > > index 382c56a97bba..6e20a56b8c22 100644
> > > --- a/include/linux/lsm_audit.h
> > > +++ b/include/linux/lsm_audit.h
> > > @@ -78,6 +78,7 @@ struct common_audit_data {
> > > #define LSM_AUDIT_DATA_NOTIFICATION 16
> > > #define LSM_AUDIT_DATA_ANONINODE 17
> > > #define LSM_AUDIT_DATA_NLMSGTYPE 18
> > > +#define LSM_AUDIT_DATA_NS 19
> > > union {
> > > struct path path;
> > > struct dentry *dentry;
> > > @@ -100,6 +101,10 @@ struct common_audit_data {
> > > int reason;
> > > const char *anonclass;
> > > u16 nlmsg_type;
> > > + struct {
> > > + u32 ns_type;
> > > + unsigned int inum;
> >
> > fwiw, you might want to start the 64-bit namespace id as well.
> > But either way:
>
> Right now these numbers are generated by ida_alloc_max(), which return
> an int. Is there an ongoing patch series for this change?
OK, we should not use the inode's number (32-bit) but the namespace ID
(64-bit) which is readable with the NS_GET_ID IOCTL on the namespace
FDs. I'll use that with ns_id instead of inum. I'll also update the
Landlock code and tests accordingly.
^ permalink raw reply
* Re: [PATCH v8 03/12] landlock: Replace union access_masks_all with helper functions
From: Mickaël Salaün @ 2026-04-01 17:57 UTC (permalink / raw)
To: Günther Noack
Cc: John Johansen, kernel test robot, linux-security-module,
Tingmao Wang, Justin Suess, Samasth Norway Ananda,
Matthieu Buffet, Mikhail Ivanov, konstantin.meskhidze,
Demi Marie Obenour, Alyssa Ross, Jann Horn, Tahera Fahimi,
Sebastian Andrzej Siewior, Kuniyuki Iwashima, Georgia Garcia
In-Reply-To: <20260330.6229e57c9563@gnoack.org>
On Mon, Mar 30, 2026 at 09:00:31PM +0200, Günther Noack wrote:
> On Mon, Mar 30, 2026 at 12:53:21PM +0200, Mickaël Salaün wrote:
> > On Mon, Mar 30, 2026 at 11:56:40AM +0200, Mickaël Salaün wrote:
> > > On Fri, Mar 27, 2026 at 05:48:28PM +0100, Günther Noack wrote:
> > > > * Stop using a union for access_masks_all.
> > > > * Expose helper functions for intersection checks and union operations.
> > > >
> > > > The memory layout of bitfields is only loosely defined by the C
> > > > standard, so our static assertion that expects a fixed size was
> > > > brittle, and it broke on some compilers when we attempted to add a
> > > > 17th file system access right.
> > > >
> > > > Reported-by: kernel test robot <lkp@intel.com>
> > > > Closes: https://lore.kernel.org/oe-kbuild-all/202603261438.jBx2DGNe-lkp@intel.com/
> > > > Signed-off-by: Günther Noack <gnoack3000@gmail.com>
> > > > ---
> > > > security/landlock/access.h | 21 ++++++++++++++-------
> > > > security/landlock/cred.h | 10 ++--------
> > > > security/landlock/ruleset.h | 13 ++++---------
> > > > 3 files changed, 20 insertions(+), 24 deletions(-)
> > >
> > > I'd prefer this approach:
> > >
> > > diff --git a/security/landlock/access.h b/security/landlock/access.h
> > > index 89dc8e7b93da..bc9efbb5c900 100644
> > > --- a/security/landlock/access.h
> > > +++ b/security/landlock/access.h
> > > @@ -50,7 +50,7 @@ struct access_masks {
> > > access_mask_t fs : LANDLOCK_NUM_ACCESS_FS;
> > > access_mask_t net : LANDLOCK_NUM_ACCESS_NET;
> > > access_mask_t scope : LANDLOCK_NUM_SCOPE;
> > > -};
> > > +} __packed;
> >
> > Actually, we can just use '__packed __aligned(sizeof(u32))' and avoid
> > the static_assert change. That would have no impact on x86, but pack it
> > on m68k.
>
> Thanks, good catch (and thanks for pushing it to mic-next).
> Fingers crossed that this works on m68k.
So, this works! I did some experiments with m68k and this architecture
is very special: it packs bitfields at byte granularity, not at
storage-unit granularity, except when the size of a bitfield is a
multiple of 8, in which case it aligns on this size.
I also look at the past versions of Landlock (in the stable branches),
and they are good because struct access_masks (and the related assert)
was introduced in v6.11 and fs was exactly 16 bits, which makes m68k
aligns on 2 bytes and then the size of the struct was 4 bytes.
Switching fs to 17 bits removes this optimization (I guess) and pack
(back) to 3 bytes, so recording more bits can take less space!
That's why we don't need a standalone fix to backport...
^ permalink raw reply
* Re: [PATCH v4 09/13] ima: Add support for staging measurements with prompt
From: steven chen @ 2026-04-01 17:52 UTC (permalink / raw)
To: Roberto Sassu, corbet, skhan, zohar, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <19a1815a1222bd78f6bfde30f60b60ebfacb65aa.camel@huaweicloud.com>
On 3/27/2026 9:45 AM, Roberto Sassu wrote:
> On Thu, 2026-03-26 at 15:44 -0700, steven chen wrote:
>> On 3/26/2026 10:30 AM, Roberto Sassu wrote:
>>> From: Roberto Sassu <roberto.sassu@huawei.com>
>>>
>>> Introduce the ability of staging the IMA measurement list and deleting them
>>> with a prompt.
>>>
>>> Staging means moving the current content of the measurement list to a
>>> separate location, and allowing users to read and delete it. This causes
>>> the measurement list to be atomically truncated before new measurements can
>>> be added. Staging can be done only once at a time. In the event of kexec(),
>>> staging is reverted and staged entries will be carried over to the new
>>> kernel.
>>>
>>> Introduce ascii_runtime_measurements_<algo>_staged and
>>> binary_runtime_measurements_<algo>_staged interfaces to stage and delete
>>> the measurements. Use 'echo A > <IMA interface>' and
>>> 'echo D > <IMA interface>' to respectively stage and delete the entire
>>> measurements list. Locking of these interfaces is also mediated with a call
>>> to _ima_measurements_open() and with ima_measurements_release().
>>>
>>> Implement the staging functionality by introducing the new global
>>> measurements list ima_measurements_staged, and ima_queue_stage() and
>>> ima_queue_delete_staged_all() to respectively move measurements from the
>>> current measurements list to the staged one, and to move staged
>>> measurements to the ima_measurements_trim list for deletion. Introduce
>>> ima_queue_delete() to delete the measurements.
>>>
>>> Finally, introduce the BINARY_STAGED AND BINARY_FULL binary measurements
>>> list types, to maintain the counters and the binary size of staged
>>> measurements and the full measurements list (including entries that were
>>> staged). BINARY still represents the current binary measurements list.
>>>
>>> Use the binary size for the BINARY + BINARY_STAGED types in
>>> ima_add_kexec_buffer(), since both measurements list types are copied to
>>> the secondary kernel during kexec. Use BINARY_FULL in
>>> ima_measure_kexec_event(), to generate a critical data record.
>>>
>>> It should be noted that the BINARY_FULL counter is not passed through
>>> kexec. Thus, the number of entries included in the kexec critical data
>>> records refers to the entries since the previous kexec records.
>>>
>>> Note: This code derives from the Alt-IMA Huawei project, whose license is
>>> GPL-2.0 OR MIT.
>>>
>>> Link: https://github.com/linux-integrity/linux/issues/1
>>> Suggested-by: Gregory Lumen <gregorylumen@linux.microsoft.com> (staging revert)
>>> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
>>> ---
>>> security/integrity/ima/Kconfig | 13 +++
>>> security/integrity/ima/ima.h | 8 +-
>>> security/integrity/ima/ima_fs.c | 167 ++++++++++++++++++++++++++---
>>> security/integrity/ima/ima_kexec.c | 22 +++-
>>> security/integrity/ima/ima_queue.c | 97 ++++++++++++++++-
>>> 5 files changed, 286 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
>>> index 976e75f9b9ba..e714726f3384 100644
>>> --- a/security/integrity/ima/Kconfig
>>> +++ b/security/integrity/ima/Kconfig
>>> @@ -332,4 +332,17 @@ config IMA_KEXEC_EXTRA_MEMORY_KB
>>> If set to the default value of 0, an extra half page of memory for those
>>> additional measurements will be allocated.
>>>
>>> +config IMA_STAGING
>>> + bool "Support for staging the measurements list"
>>> + default y
>>> + help
>>> + Add support for staging the measurements list.
>>> +
>>> + It allows user space to stage the measurements list for deletion and
>>> + to delete the staged measurements after confirmation.
>>> +
>>> + On kexec, staging is reverted and staged measurements are prepended
>>> + to the current measurements list when measurements are copied to the
>>> + secondary kernel.
>>> +
>>> endif
>>> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
>>> index 97b7d6024b5d..65db152a0a24 100644
>>> --- a/security/integrity/ima/ima.h
>>> +++ b/security/integrity/ima/ima.h
>>> @@ -30,9 +30,11 @@ enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8, TPM_PCR10 = 10 };
>>>
>>> /*
>>> * BINARY: current binary measurements list
>>> + * BINARY_STAGED: staged binary measurements list
>>> + * BINARY_FULL: binary measurements list since IMA init (lost after kexec)
>>> */
>>> enum binary_lists {
>>> - BINARY, BINARY__LAST
>>> + BINARY, BINARY_STAGED, BINARY_FULL, BINARY__LAST
>>> };
>>>
>>> /* digest size for IMA, fits SHA1 or MD5 */
>>> @@ -125,6 +127,7 @@ struct ima_queue_entry {
>>> struct ima_template_entry *entry;
>>> };
>>> extern struct list_head ima_measurements; /* list of all measurements */
>>> +extern struct list_head ima_measurements_staged; /* list of staged meas. */
>>>
>>> /* Some details preceding the binary serialized measurement list */
>>> struct ima_kexec_hdr {
>>> @@ -314,6 +317,8 @@ struct ima_template_desc *ima_template_desc_current(void);
>>> struct ima_template_desc *ima_template_desc_buf(void);
>>> struct ima_template_desc *lookup_template_desc(const char *name);
>>> bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
>>> +int ima_queue_stage(void);
>>> +int ima_queue_staged_delete_all(void);
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry);
>>> int ima_restore_measurement_list(loff_t bufsize, void *buf);
>>> int ima_measurements_show(struct seq_file *m, void *v);
>>> @@ -334,6 +339,7 @@ extern spinlock_t ima_queue_lock;
>>> extern atomic_long_t ima_num_entries[BINARY__LAST];
>>> extern atomic_long_t ima_num_violations;
>>> extern struct hlist_head __rcu *ima_htable;
>>> +extern struct mutex ima_extend_list_mutex;
>>>
>>> static inline unsigned int ima_hash_key(u8 *digest)
>>> {
>>> diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
>>> index 7709a4576322..39d9128e9f22 100644
>>> --- a/security/integrity/ima/ima_fs.c
>>> +++ b/security/integrity/ima/ima_fs.c
>>> @@ -24,6 +24,13 @@
>>>
>>> #include "ima.h"
>>>
>>> +/*
>>> + * Requests:
>>> + * 'A\n': stage the entire measurements list
>>> + * 'D\n': delete all staged measurements
>>> + */
>>> +#define STAGED_REQ_LENGTH 21
>>> +
>>> static DEFINE_MUTEX(ima_write_mutex);
>>> static DEFINE_MUTEX(ima_measure_mutex);
>>> static long ima_measure_users;
>>> @@ -97,6 +104,11 @@ static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
>>> return _ima_measurements_start(m, pos, &ima_measurements);
>>> }
>>>
>>> +static void *ima_measurements_staged_start(struct seq_file *m, loff_t *pos)
>>> +{
>>> + return _ima_measurements_start(m, pos, &ima_measurements_staged);
>>> +}
>>> +
>>> static void *_ima_measurements_next(struct seq_file *m, void *v, loff_t *pos,
>>> struct list_head *head)
>>> {
>>> @@ -118,6 +130,12 @@ static void *ima_measurements_next(struct seq_file *m, void *v, loff_t *pos)
>>> return _ima_measurements_next(m, v, pos, &ima_measurements);
>>> }
>>>
>>> +static void *ima_measurements_staged_next(struct seq_file *m, void *v,
>>> + loff_t *pos)
>>> +{
>>> + return _ima_measurements_next(m, v, pos, &ima_measurements_staged);
>>> +}
>>> +
>>> static void ima_measurements_stop(struct seq_file *m, void *v)
>>> {
>>> }
>>> @@ -283,6 +301,68 @@ static const struct file_operations ima_measurements_ops = {
>>> .release = ima_measurements_release,
>>> };
>>>
>>> +static const struct seq_operations ima_measurments_staged_seqops = {
>>> + .start = ima_measurements_staged_start,
>>> + .next = ima_measurements_staged_next,
>>> + .stop = ima_measurements_stop,
>>> + .show = ima_measurements_show
>>> +};
>>> +
>>> +static int ima_measurements_staged_open(struct inode *inode, struct file *file)
>>> +{
>>> + return _ima_measurements_open(inode, file,
>>> + &ima_measurments_staged_seqops);
>>> +}
>>> +
>>> +static ssize_t ima_measurements_staged_write(struct file *file,
>>> + const char __user *buf,
>>> + size_t datalen, loff_t *ppos)
>>> +{
>>> + char req[STAGED_REQ_LENGTH];
>>> + int ret;
>>> +
>>> + if (*ppos > 0 || datalen < 2 || datalen > STAGED_REQ_LENGTH)
>>> + return -EINVAL;
>>> +
>>> + if (copy_from_user(req, buf, datalen) != 0)
>>> + return -EFAULT;
>>> +
>>> + if (req[datalen - 1] != '\n')
>>> + return -EINVAL;
>>> +
>>> + req[datalen - 1] = '\0';
>>> +
>>> + switch (req[0]) {
>>> + case 'A':
>>> + if (datalen != 2)
>>> + return -EINVAL;
>>> +
>>> + ret = ima_queue_stage();
>>> + break;
>>> + case 'D':
>>> + if (datalen != 2)
>>> + return -EINVAL;
>>> +
>>> + ret = ima_queue_staged_delete_all();
>>> + break;
>> I think the following two steps may not work because of race condition:
>>
>> step1: ret = ima_queue_stage(); //this will put all logs in active list into staged list;
>> step2: ret = ima_queue_staged_delete_all(); //this will delete all logs in staged list;
>>
>> The following is the step of race condition:
>> 1. current active log list LA1;
>> 2. user agent read the TPM quote QA1 match list LA1;
>> 3. new event NewLog is added into active log list LA1+NewLog
>> 4. user agent call ima_queue_stage() and generated staged list
>> including LA1+NewLog.
>> 5. user agent call ima_queue_staged_delete_all();
>> The new log NewLog in step 3 is also deleted
> Please refer to the documentation patch which explains the intended
> workflow of this approach (Remote Attestation Agent Workflow).
>
> Roberto
So the user agent needs to deeply involve measurement list management:
in user space, user agent need to maintain two lists
user agent is more complicated in this case
Steven
>> Next time the attestation will fail if using the active log list in the
>> kernel.
>>
>> Thanks,
>>
>> Steven
>>
>>> + default:
>>> + ret = -EINVAL;
>>> + }
>>> +
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + return datalen;
>>> +}
>>> +
>>> +static const struct file_operations ima_measurements_staged_ops = {
>>> + .open = ima_measurements_staged_open,
>>> + .read = seq_read,
>>> + .write = ima_measurements_staged_write,
>>> + .llseek = seq_lseek,
>>> + .release = ima_measurements_release,
>>> +};
>>> +
>>> void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
>>> {
>>> u32 i;
>>> @@ -356,6 +436,28 @@ static const struct file_operations ima_ascii_measurements_ops = {
>>> .release = ima_measurements_release,
>>> };
>>>
>>> +static const struct seq_operations ima_ascii_measurements_staged_seqops = {
>>> + .start = ima_measurements_staged_start,
>>> + .next = ima_measurements_staged_next,
>>> + .stop = ima_measurements_stop,
>>> + .show = ima_ascii_measurements_show
>>> +};
>>> +
>>> +static int ima_ascii_measurements_staged_open(struct inode *inode,
>>> + struct file *file)
>>> +{
>>> + return _ima_measurements_open(inode, file,
>>> + &ima_ascii_measurements_staged_seqops);
>>> +}
>>> +
>>> +static const struct file_operations ima_ascii_measurements_staged_ops = {
>>> + .open = ima_ascii_measurements_staged_open,
>>> + .read = seq_read,
>>> + .write = ima_measurements_staged_write,
>>> + .llseek = seq_lseek,
>>> + .release = ima_measurements_release,
>>> +};
>>> +
>>> static ssize_t ima_read_policy(char *path)
>>> {
>>> void *data = NULL;
>>> @@ -459,10 +561,21 @@ static const struct seq_operations ima_policy_seqops = {
>>> };
>>> #endif
>>>
>>> -static int __init create_securityfs_measurement_lists(void)
>>> +static int __init create_securityfs_measurement_lists(bool staging)
>>> {
>>> + const struct file_operations *ascii_ops = &ima_ascii_measurements_ops;
>>> + const struct file_operations *binary_ops = &ima_measurements_ops;
>>> + mode_t permissions = S_IRUSR | S_IRGRP;
>>> + const char *file_suffix = "";
>>> int count = NR_BANKS(ima_tpm_chip);
>>>
>>> + if (staging) {
>>> + ascii_ops = &ima_ascii_measurements_staged_ops;
>>> + binary_ops = &ima_measurements_staged_ops;
>>> + file_suffix = "_staged";
>>> + permissions |= (S_IWUSR | S_IWGRP);
>>> + }
>>> +
>>> if (ima_sha1_idx >= NR_BANKS(ima_tpm_chip))
>>> count++;
>>>
>>> @@ -473,29 +586,32 @@ static int __init create_securityfs_measurement_lists(void)
>>>
>>> if (algo == HASH_ALGO__LAST)
>>> snprintf(file_name, sizeof(file_name),
>>> - "ascii_runtime_measurements_tpm_alg_%x",
>>> - ima_tpm_chip->allocated_banks[i].alg_id);
>>> + "ascii_runtime_measurements_tpm_alg_%x%s",
>>> + ima_tpm_chip->allocated_banks[i].alg_id,
>>> + file_suffix);
>>> else
>>> snprintf(file_name, sizeof(file_name),
>>> - "ascii_runtime_measurements_%s",
>>> - hash_algo_name[algo]);
>>> - dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
>>> + "ascii_runtime_measurements_%s%s",
>>> + hash_algo_name[algo], file_suffix);
>>> + dentry = securityfs_create_file(file_name, permissions,
>>> ima_dir, (void *)(uintptr_t)i,
>>> - &ima_ascii_measurements_ops);
>>> + ascii_ops);
>>> if (IS_ERR(dentry))
>>> return PTR_ERR(dentry);
>>>
>>> if (algo == HASH_ALGO__LAST)
>>> snprintf(file_name, sizeof(file_name),
>>> - "binary_runtime_measurements_tpm_alg_%x",
>>> - ima_tpm_chip->allocated_banks[i].alg_id);
>>> + "binary_runtime_measurements_tpm_alg_%x%s",
>>> + ima_tpm_chip->allocated_banks[i].alg_id,
>>> + file_suffix);
>>> else
>>> snprintf(file_name, sizeof(file_name),
>>> - "binary_runtime_measurements_%s",
>>> - hash_algo_name[algo]);
>>> - dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
>>> + "binary_runtime_measurements_%s%s",
>>> + hash_algo_name[algo], file_suffix);
>>> +
>>> + dentry = securityfs_create_file(file_name, permissions,
>>> ima_dir, (void *)(uintptr_t)i,
>>> - &ima_measurements_ops);
>>> + binary_ops);
>>> if (IS_ERR(dentry))
>>> return PTR_ERR(dentry);
>>> }
>>> @@ -503,6 +619,23 @@ static int __init create_securityfs_measurement_lists(void)
>>> return 0;
>>> }
>>>
>>> +static int __init create_securityfs_staging_links(void)
>>> +{
>>> + struct dentry *dentry;
>>> +
>>> + dentry = securityfs_create_symlink("binary_runtime_measurements_staged",
>>> + ima_dir, "binary_runtime_measurements_sha1_staged", NULL);
>>> + if (IS_ERR(dentry))
>>> + return PTR_ERR(dentry);
>>> +
>>> + dentry = securityfs_create_symlink("ascii_runtime_measurements_staged",
>>> + ima_dir, "ascii_runtime_measurements_sha1_staged", NULL);
>>> + if (IS_ERR(dentry))
>>> + return PTR_ERR(dentry);
>>> +
>>> + return 0;
>>> +}
>>> +
>>> /*
>>> * ima_open_policy: sequentialize access to the policy file
>>> */
>>> @@ -595,7 +728,13 @@ int __init ima_fs_init(void)
>>> goto out;
>>> }
>>>
>>> - ret = create_securityfs_measurement_lists();
>>> + ret = create_securityfs_measurement_lists(false);
>>> + if (ret == 0 && IS_ENABLED(CONFIG_IMA_STAGING)) {
>>> + ret = create_securityfs_measurement_lists(true);
>>> + if (ret == 0)
>>> + ret = create_securityfs_staging_links();
>>> + }
>>> +
>>> if (ret != 0)
>>> goto out;
>>>
>>> diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
>>> index d7d0fb639d99..d5503dd5cc9b 100644
>>> --- a/security/integrity/ima/ima_kexec.c
>>> +++ b/security/integrity/ima/ima_kexec.c
>>> @@ -42,8 +42,8 @@ void ima_measure_kexec_event(const char *event_name)
>>> long len;
>>> int n;
>>>
>>> - buf_size = ima_get_binary_runtime_size(BINARY);
>>> - len = atomic_long_read(&ima_num_entries[BINARY]);
>>> + buf_size = ima_get_binary_runtime_size(BINARY_FULL);
>>> + len = atomic_long_read(&ima_num_entries[BINARY_FULL]);
>>>
>>> n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
>>> "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
>>> @@ -106,13 +106,26 @@ static int ima_dump_measurement_list(unsigned long *buffer_size, void **buffer,
>>>
>>> memset(&khdr, 0, sizeof(khdr));
>>> khdr.version = 1;
>>> - /* This is an append-only list, no need to hold the RCU read lock */
>>> - list_for_each_entry_rcu(qe, &ima_measurements, later, true) {
>>> + /* It can race with ima_queue_stage() and ima_queue_delete_staged(). */
>>> + mutex_lock(&ima_extend_list_mutex);
>>> +
>>> + list_for_each_entry_rcu(qe, &ima_measurements_staged, later,
>>> + lockdep_is_held(&ima_extend_list_mutex)) {
>>> ret = ima_dump_measurement(&khdr, qe);
>>> if (ret < 0)
>>> break;
>>> }
>>>
>>> + list_for_each_entry_rcu(qe, &ima_measurements, later,
>>> + lockdep_is_held(&ima_extend_list_mutex)) {
>>> + if (!ret)
>>> + ret = ima_dump_measurement(&khdr, qe);
>>> + if (ret < 0)
>>> + break;
>>> + }
>>> +
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> /*
>>> * fill in reserved space with some buffer details
>>> * (eg. version, buffer size, number of measurements)
>>> @@ -167,6 +180,7 @@ void ima_add_kexec_buffer(struct kimage *image)
>>> extra_memory = CONFIG_IMA_KEXEC_EXTRA_MEMORY_KB * 1024;
>>>
>>> binary_runtime_size = ima_get_binary_runtime_size(BINARY) +
>>> + ima_get_binary_runtime_size(BINARY_STAGED) +
>>> extra_memory;
>>>
>>> if (binary_runtime_size >= ULONG_MAX - PAGE_SIZE)
>>> diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
>>> index b6d10dceb669..50519ed837d4 100644
>>> --- a/security/integrity/ima/ima_queue.c
>>> +++ b/security/integrity/ima/ima_queue.c
>>> @@ -26,6 +26,7 @@
>>> static struct tpm_digest *digests;
>>>
>>> LIST_HEAD(ima_measurements); /* list of all measurements */
>>> +LIST_HEAD(ima_measurements_staged); /* list of staged measurements */
>>> #ifdef CONFIG_IMA_KEXEC
>>> static unsigned long binary_runtime_size[BINARY__LAST];
>>> #else
>>> @@ -45,11 +46,11 @@ atomic_long_t ima_num_violations = ATOMIC_LONG_INIT(0);
>>> /* key: inode (before secure-hashing a file) */
>>> struct hlist_head __rcu *ima_htable;
>>>
>>> -/* mutex protects atomicity of extending measurement list
>>> +/* mutex protects atomicity of extending and staging measurement list
>>> * and extending the TPM PCR aggregate. Since tpm_extend can take
>>> * long (and the tpm driver uses a mutex), we can't use the spinlock.
>>> */
>>> -static DEFINE_MUTEX(ima_extend_list_mutex);
>>> +DEFINE_MUTEX(ima_extend_list_mutex);
>>>
>>> /*
>>> * Used internally by the kernel to suspend measurements.
>>> @@ -174,12 +175,16 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
>>> lockdep_is_held(&ima_extend_list_mutex));
>>>
>>> atomic_long_inc(&ima_num_entries[BINARY]);
>>> + atomic_long_inc(&ima_num_entries[BINARY_FULL]);
>>> +
>>> if (update_htable) {
>>> key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
>>> hlist_add_head_rcu(&qe->hnext, &htable[key]);
>>> }
>>>
>>> ima_update_binary_runtime_size(entry, BINARY);
>>> + ima_update_binary_runtime_size(entry, BINARY_FULL);
>>> +
>>> return 0;
>>> }
>>>
>>> @@ -280,6 +285,94 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
>>> return result;
>>> }
>>>
>>> +int ima_queue_stage(void)
>>> +{
>>> + int ret = 0;
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (!list_empty(&ima_measurements_staged)) {
>>> + ret = -EEXIST;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + if (list_empty(&ima_measurements)) {
>>> + ret = -ENOENT;
>>> + goto out_unlock;
>>> + }
>>> +
>>> + list_replace(&ima_measurements, &ima_measurements_staged);
>>> + INIT_LIST_HEAD(&ima_measurements);
>>> +
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED],
>>> + atomic_long_read(&ima_num_entries[BINARY]));
>>> + atomic_long_set(&ima_num_entries[BINARY], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC)) {
>>> + binary_runtime_size[BINARY_STAGED] =
>>> + binary_runtime_size[BINARY];
>>> + binary_runtime_size[BINARY] = 0;
>>> + }
>>> +out_unlock:
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return ret;
>>> +}
>>> +
>>> +static void ima_queue_delete(struct list_head *head);
>>> +
>>> +int ima_queue_staged_delete_all(void)
>>> +{
>>> + LIST_HEAD(ima_measurements_trim);
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (list_empty(&ima_measurements_staged)) {
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + list_replace(&ima_measurements_staged, &ima_measurements_trim);
>>> + INIT_LIST_HEAD(&ima_measurements_staged);
>>> +
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC))
>>> + binary_runtime_size[BINARY_STAGED] = 0;
>>> +
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> + ima_queue_delete(&ima_measurements_trim);
>>> + return 0;
>>> +}
>>> +
>>> +static void ima_queue_delete(struct list_head *head)
>>> +{
>>> + struct ima_queue_entry *qe, *qe_tmp;
>>> + unsigned int i;
>>> +
>>> + list_for_each_entry_safe(qe, qe_tmp, head, later) {
>>> + /*
>>> + * Safe to free template_data here without synchronize_rcu()
>>> + * because the only htable reader, ima_lookup_digest_entry(),
>>> + * accesses only entry->digests, not template_data. If new
>>> + * htable readers are added that access template_data, a
>>> + * synchronize_rcu() is required here.
>>> + */
>>> + for (i = 0; i < qe->entry->template_desc->num_fields; i++) {
>>> + kfree(qe->entry->template_data[i].data);
>>> + qe->entry->template_data[i].data = NULL;
>>> + qe->entry->template_data[i].len = 0;
>>> + }
>>> +
>>> + list_del(&qe->later);
>>> +
>>> + /* No leak if condition is false, referenced by ima_htable. */
>>> + if (IS_ENABLED(CONFIG_IMA_DISABLE_HTABLE)) {
>>> + kfree(qe->entry->digests);
>>> + kfree(qe->entry);
>>> + kfree(qe);
>>> + }
>>> + }
>>> +}
>>> +
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry)
>>> {
>>> int result = 0;
^ permalink raw reply
* Re: [PATCH v4 11/13] ima: Support staging and deleting N measurements entries
From: steven chen @ 2026-04-01 17:48 UTC (permalink / raw)
To: Roberto Sassu, corbet, skhan, zohar, dmitry.kasatkin,
eric.snowberg, paul, jmorris, serge
Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
gregorylumen, nramas, Roberto Sassu, steven chen
In-Reply-To: <af6aa732b85af36e07e4a82b29170e80b13dc7c4.camel@huaweicloud.com>
On 3/27/2026 10:02 AM, Roberto Sassu wrote:
> On Thu, 2026-03-26 at 16:19 -0700, steven chen wrote:
>> On 3/26/2026 10:30 AM, Roberto Sassu wrote:
>>> From: Roberto Sassu <roberto.sassu@huawei.com>
>>>
>>> Add support for sending a value N between 1 and ULONG_MAX to the staging
>>> interface. This value represents the number of measurements that should be
>>> deleted from the current measurements list.
>>>
>>> This staging method allows the remote attestation agents to easily separate
>>> the measurements that were verified (staged and deleted) from those that
>>> weren't due to the race between taking a TPM quote and reading the
>>> measurements list.
>>>
>>> In order to minimize the locking time of ima_extend_list_mutex, deleting
>>> N entries is realized by staging the entire current measurements list
>>> (with the lock), by determining the N-th staged entry (without the lock),
>>> and by splicing the entries in excess back to the current measurements list
>>> (with the lock). Finally, the N entries are deleted (without the lock).
>>>
>>> Flushing the hash table is not supported for N entries, since it would
>>> require removing the N entries one by one from the hash table under the
>>> ima_extend_list_mutex lock, which would increase the locking time.
>>>
>>> The ima_extend_list_mutex lock is necessary in ima_dump_measurement_list()
>>> because ima_queue_staged_delete_partial() uses __list_cut_position() to
>>> modify ima_measurements_staged, for which no RCU-safe variant exists. For
>>> the staging with prompt flavor alone, list_replace_rcu() could have been
>>> used instead, but since both flavors share the same kexec serialization
>>> path, the mutex is required regardless.
>>>
>>> Link: https://github.com/linux-integrity/linux/issues/1
>>> Suggested-by: Steven Chen <chenste@linux.microsoft.com>
>>> Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
>>> ---
>>> security/integrity/ima/Kconfig | 3 ++
>>> security/integrity/ima/ima.h | 1 +
>>> security/integrity/ima/ima_fs.c | 22 +++++++++-
>>> security/integrity/ima/ima_queue.c | 70 ++++++++++++++++++++++++++++++
>>> 4 files changed, 95 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
>>> index e714726f3384..6ddb4e77bff5 100644
>>> --- a/security/integrity/ima/Kconfig
>>> +++ b/security/integrity/ima/Kconfig
>>> @@ -341,6 +341,9 @@ config IMA_STAGING
>>> It allows user space to stage the measurements list for deletion and
>>> to delete the staged measurements after confirmation.
>>>
>>> + Or, alternatively, it allows user space to specify N measurements
>>> + entries to be deleted.
>>> +
>>> On kexec, staging is reverted and staged measurements are prepended
>>> to the current measurements list when measurements are copied to the
>>> secondary kernel.
>>> diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
>>> index 699b735dec7d..de0693fce53c 100644
>>> --- a/security/integrity/ima/ima.h
>>> +++ b/security/integrity/ima/ima.h
>>> @@ -319,6 +319,7 @@ struct ima_template_desc *lookup_template_desc(const char *name);
>>> bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
>>> int ima_queue_stage(void);
>>> int ima_queue_staged_delete_all(void);
>>> +int ima_queue_staged_delete_partial(unsigned long req_value);
>>> int ima_restore_measurement_entry(struct ima_template_entry *entry);
>>> int ima_restore_measurement_list(loff_t bufsize, void *buf);
>>> int ima_measurements_show(struct seq_file *m, void *v);
>>> diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
>>> index 39d9128e9f22..eb3f343c1138 100644
>>> --- a/security/integrity/ima/ima_fs.c
>>> +++ b/security/integrity/ima/ima_fs.c
>>> @@ -28,6 +28,7 @@
>>> * Requests:
>>> * 'A\n': stage the entire measurements list
>>> * 'D\n': delete all staged measurements
>>> + * '[1, ULONG_MAX]\n' delete N measurements entries
>>> */
>>> #define STAGED_REQ_LENGTH 21
>>>
>>> @@ -319,6 +320,7 @@ static ssize_t ima_measurements_staged_write(struct file *file,
>>> size_t datalen, loff_t *ppos)
>>> {
>>> char req[STAGED_REQ_LENGTH];
>>> + unsigned long req_value;
>>> int ret;
>>>
>>> if (*ppos > 0 || datalen < 2 || datalen > STAGED_REQ_LENGTH)
>>> @@ -346,7 +348,25 @@ static ssize_t ima_measurements_staged_write(struct file *file,
>>> ret = ima_queue_staged_delete_all();
>>> break;
>>> default:
>>> - ret = -EINVAL;
>>> + if (ima_flush_htable) {
>>> + pr_debug("Deleting staged N measurements not supported when flushing the hash table is requested\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = kstrtoul(req, 10, &req_value);
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + if (req_value == 0) {
>>> + pr_debug("Must delete at least one entry\n");
>>> + return -EINVAL;
>>> + }
>>> +
>>> + ret = ima_queue_stage();
>>> + if (ret < 0)
>>> + return ret;
>>> +
>>> + ret = ima_queue_staged_delete_partial(req_value);
>> The default processing is "Trim N" idea plus performance improvement.
>>
>> Here do everything in one time. And this is what I said in v3.
>>
>> [PATCH v3 1/3] ima: Remove ima_h_table structure
>> <https://lore.kernel.org/linux-integrity/c61aeaa79929a98cb3a6d30835972891fac3570f.camel@linux.ibm.com/T/#t>
> In your approach you do:
>
> lock ima_extend_measure_list_mutex
> scan entries until N
> cut list staged -> trim
> unlock ima_extend_measure_list_mutex
>
>
> In my approach I do:
> lock ima_extend_measure_list_mutex
> list replace active -> staged
> unlock ima_extend_measure_list_mutex
>
> scan entries until N
>
> lock ima_extend_measure_list_mutex
> cut list staged -> trim
> splice staged ->active
> unlock ima_extend_measure_list_mutex
>
> So, I guess if you refer to less user space locking time, you mean one
> lock/unlock and one list replace + list splice less in your solution.
>
> In exchange, you propose to hold the lock in the kernel while scanning
> N. I think it is a significant increase of kernel locking time vs a
> negligible increase of user space locking time (in the kernel, all
> processes need to wait for the ima_extend_measure_list_mutex to be
> released, in user space it is just the agent waiting).
>
> Roberto
Please the version 5:
[PATCH v5 0/3] Trim N entries of IMA event logs
<https://lore.kernel.org/linux-integrity/20260401172956.4581-1-chenste@linux.microsoft.com/T/#t>
Scanning N is moved out of the lock period.
"Trim N" proposal has less lock time than "Staging and deleting" proposal.
"Trim N" proposal has much less code than "Staging and deleting" proposal.
"Trim N" proposal user space operation is more simple than "Staging and
deleting".
Steven
>> The important two parts of trimming is "trim N" and performance improvement.
>>
>> The performance improvement include two parts:
>> hash table staging
>> active log list staging
>>
>> And I think "Trim N" plus performance improvement is the right direction
>> to go.
>> Lots of code for two steps "stage and trim" "stage" part can be removed.
>>
>> Also race condition may happen if not holding the list all time in user
>> space
>> during attestation period: from stage, read list, attestation and trimming.
>>
>> So in order to improve the above user space lock time, "Trim T:N" can be
>> used
>> not to hold list long in user space during attestation.
>>
>> For Trim T:N, T represent total log trimmed since system boot up. Please
>> refer to
>> https://lore.kernel.org/linux-integrity/20260205235849.7086-1-chenste@linux.microsoft.com/T/#t
>>
>> Thanks,
>>
>> Steven
>>> }
>>>
>>> if (ret < 0)
>>> diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
>>> index f5c18acfbc43..4fb557d61a88 100644
>>> --- a/security/integrity/ima/ima_queue.c
>>> +++ b/security/integrity/ima/ima_queue.c
>>> @@ -371,6 +371,76 @@ int ima_queue_staged_delete_all(void)
>>> return 0;
>>> }
>>>
>>> +int ima_queue_staged_delete_partial(unsigned long req_value)
>>> +{
>>> + unsigned long req_value_copy = req_value;
>>> + unsigned long size_to_remove = 0, num_to_remove = 0;
>>> + struct list_head *cut_pos = NULL;
>>> + LIST_HEAD(ima_measurements_trim);
>>> + struct ima_queue_entry *qe;
>>> + int ret = 0;
>>> +
>>> + /*
>>> + * Safe walk (no concurrent write), not under ima_extend_list_mutex
>>> + * for performance reasons.
>>> + */
>>> + list_for_each_entry(qe, &ima_measurements_staged, later) {
>>> + size_to_remove += get_binary_runtime_size(qe->entry);
>>> + num_to_remove++;
>>> +
>>> + if (--req_value_copy == 0) {
>>> + /* qe->later always points to a valid list entry. */
>>> + cut_pos = &qe->later;
>>> + break;
>>> + }
>>> + }
>>> +
>>> + /* Nothing to remove, undoing staging. */
>>> + if (req_value_copy > 0) {
>>> + size_to_remove = 0;
>>> + num_to_remove = 0;
>>> + ret = -ENOENT;
>>> + }
>>> +
>>> + mutex_lock(&ima_extend_list_mutex);
>>> + if (list_empty(&ima_measurements_staged)) {
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> + return -ENOENT;
>>> + }
>>> +
>>> + if (cut_pos != NULL)
>>> + /*
>>> + * ima_dump_measurement_list() does not modify the list,
>>> + * cut_pos remains the same even if it was computed before
>>> + * the lock.
>>> + */
>>> + __list_cut_position(&ima_measurements_trim,
>>> + &ima_measurements_staged, cut_pos);
>>> +
>>> + atomic_long_sub(num_to_remove, &ima_num_entries[BINARY_STAGED]);
>>> + atomic_long_add(atomic_long_read(&ima_num_entries[BINARY_STAGED]),
>>> + &ima_num_entries[BINARY]);
>>> + atomic_long_set(&ima_num_entries[BINARY_STAGED], 0);
>>> +
>>> + if (IS_ENABLED(CONFIG_IMA_KEXEC)) {
>>> + binary_runtime_size[BINARY_STAGED] -= size_to_remove;
>>> + binary_runtime_size[BINARY] +=
>>> + binary_runtime_size[BINARY_STAGED];
>>> + binary_runtime_size[BINARY_STAGED] = 0;
>>> + }
>>> +
>>> + /*
>>> + * Splice (prepend) any remaining non-deleted staged entries to the
>>> + * active list (RCU not needed, there cannot be concurrent readers).
>>> + */
>>> + list_splice(&ima_measurements_staged, &ima_measurements);
>>> + INIT_LIST_HEAD(&ima_measurements_staged);
>>> + mutex_unlock(&ima_extend_list_mutex);
>>> +
>>> + ima_queue_delete(&ima_measurements_trim, false);
>>> + return ret;
>>> +}
>>> +
>>> static void ima_queue_delete(struct list_head *head, bool flush_htable)
>>> {
>>> struct ima_queue_entry *qe, *qe_tmp;
^ permalink raw reply
* [PATCH v5 3/3] ima: add new critical data record to measure log trim
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Add a new critical data record to measure the trimming event when
ima event records are deleted since system boot up.
If all IMA event logs are saved in the userspace, use this log to get total
numbers of records deleted since system boot up at that point.
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
security/integrity/ima/ima_fs.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 8e26e0f34311..38d0a49b587f 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -43,6 +43,7 @@ static int valid_policy = 1;
#define IMA_LOG_TRIM_REQ_NUM_LENGTH 15
#define IMA_LOG_TRIM_REQ_TOTAL_LENGTH 32
+#define IMA_LOG_TRIM_EVENT_LEN 256
atomic_long_t ima_number_entries = ATOMIC_LONG_INIT(0);
static long trimcount;
/* mutex protects atomicity of trimming measurement list
@@ -52,6 +53,22 @@ static long trimcount;
static DEFINE_MUTEX(ima_measure_lock);
static long ima_measure_users;
+static void ima_measure_trim_event(void)
+{
+ char ima_log_trim_event[IMA_LOG_TRIM_EVENT_LEN];
+ struct timespec64 ts;
+ u64 time_ns;
+ int n;
+
+ ktime_get_real_ts64(&ts);
+ time_ns = (u64)ts.tv_sec * 1000000000ULL + ts.tv_nsec;
+ n = scnprintf(ima_log_trim_event, IMA_LOG_TRIM_EVENT_LEN,
+ "time= %llu; number= %lu;", time_ns, trimcount);
+
+ ima_measure_critical_data("ima_log_trim", "trim ima event logs",
+ ima_log_trim_event, n, false, NULL, 0);
+}
+
static ssize_t ima_show_htable_value(char __user *buf, size_t count,
loff_t *ppos, atomic_long_t *val)
{
@@ -436,6 +453,9 @@ static ssize_t ima_log_trim_write(struct file *file,
if (ret < 0)
goto out;
+ if (ret > 0)
+ ima_measure_trim_event();
+
trimcount += ret;
ret = datalen;
--
2.43.0
^ permalink raw reply related
* [PATCH v5 2/3] ima: trim N IMA event log records
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Trim N entries of the IMA event logs. Do not clean the hash table.
The values saved in hash table were already used.
Provide a userspace interface ima_trim_log:
When read this interface, it returns total number T of entries trimmed
since system boot up.
When write to this interface need to provide two numbers T:N to let
kernel to trim N entries of IMA event logs.
Kernel measurement list lock time performance improvement by not
clean the hash table.
when kernel get log trim request T:N
- Get the T, compare with the total trimmed number
- if equal, then do trim N and change T to T+N
- else return error
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
.../admin-guide/kernel-parameters.txt | 4 +
security/integrity/ima/ima.h | 4 +-
security/integrity/ima/ima_fs.c | 198 +++++++++++++++++-
security/integrity/ima/ima_kexec.c | 2 +-
security/integrity/ima/ima_queue.c | 96 +++++++++
5 files changed, 296 insertions(+), 8 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index e92c0056e4e0..cd1a1d0bf0e2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2197,6 +2197,10 @@
Use the canonical format for the binary runtime
measurements, instead of host native format.
+ ima_flush_htable [IMA]
+ Flush the measurement list hash table when trim all
+ or a part of it for deletion.
+
ima_hash= [IMA]
Format: { md5 | sha1 | rmd160 | sha256 | sha384
| sha512 | ... }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index e3d71d8d56e3..5cbee3a295a0 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -243,11 +243,13 @@ void ima_post_key_create_or_update(struct key *keyring, struct key *key,
const void *payload, size_t plen,
unsigned long flags, bool create);
#endif
-
+extern atomic_long_t ima_number_entries;
#ifdef CONFIG_IMA_KEXEC
void ima_measure_kexec_event(const char *event_name);
+long ima_delete_event_log(long req_val);
#else
static inline void ima_measure_kexec_event(const char *event_name) {}
+static inline long ima_delete_event_log(long req_val) { return 0; }
#endif
/*
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index 87045b09f120..8e26e0f34311 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -21,6 +21,9 @@
#include <linux/rcupdate.h>
#include <linux/parser.h>
#include <linux/vmalloc.h>
+#include <linux/ktime.h>
+#include <linux/timekeeping.h>
+#include <linux/ima.h>
#include "ima.h"
@@ -38,6 +41,17 @@ __setup("ima_canonical_fmt", default_canonical_fmt_setup);
static int valid_policy = 1;
+#define IMA_LOG_TRIM_REQ_NUM_LENGTH 15
+#define IMA_LOG_TRIM_REQ_TOTAL_LENGTH 32
+atomic_long_t ima_number_entries = ATOMIC_LONG_INIT(0);
+static long trimcount;
+/* mutex protects atomicity of trimming measurement list
+ * and also protects atomicity the measurement list read
+ * write operation.
+ */
+static DEFINE_MUTEX(ima_measure_lock);
+static long ima_measure_users;
+
static ssize_t ima_show_htable_value(char __user *buf, size_t count,
loff_t *ppos, atomic_long_t *val)
{
@@ -64,8 +78,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
char __user *buf,
size_t count, loff_t *ppos)
{
- return ima_show_htable_value(buf, count, ppos, &ima_htable.len);
-
+ return ima_show_htable_value(buf, count, ppos, &ima_number_entries);
}
static const struct file_operations ima_measurements_count_ops = {
@@ -202,16 +215,77 @@ static const struct seq_operations ima_measurments_seqops = {
.show = ima_measurements_show
};
+/*
+ * _ima_measurements_open - open the IMA measurements file
+ * @inode: inode of the file being opened
+ * @file: file being opened
+ * @seq_ops: sequence operations for the file
+ *
+ * Returns 0 on success, or negative error code.
+ * Implements mutual exclusion between readers and writer
+ * of the measurements file. Multiple readers are allowed,
+ * but writer get exclusive access only no other readers/writers.
+ * Readers is not allowed when there is a writer.
+ */
+static int _ima_measurements_open(struct inode *inode, struct file *file,
+ const struct seq_operations *seq_ops)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+ int ret;
+
+ if (write && !capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ mutex_lock(&ima_measure_lock);
+ if ((write && ima_measure_users != 0) ||
+ (!write && ima_measure_users < 0)) {
+ mutex_unlock(&ima_measure_lock);
+ return -EBUSY;
+ }
+
+ ret = seq_open(file, seq_ops);
+ if (ret < 0) {
+ mutex_unlock(&ima_measure_lock);
+ return ret;
+ }
+
+ if (write)
+ ima_measure_users--;
+ else
+ ima_measure_users++;
+
+ mutex_unlock(&ima_measure_lock);
+ return ret;
+}
+
static int ima_measurements_open(struct inode *inode, struct file *file)
{
- return seq_open(file, &ima_measurments_seqops);
+ return _ima_measurements_open(inode, file, &ima_measurments_seqops);
+}
+
+static int ima_measurements_release(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+ int ret;
+
+ mutex_lock(&ima_measure_lock);
+ ret = seq_release(inode, file);
+ if (!ret) {
+ if (!write)
+ ima_measure_users--;
+ else
+ ima_measure_users++;
+ }
+
+ mutex_unlock(&ima_measure_lock);
+ return ret;
}
static const struct file_operations ima_measurements_ops = {
.open = ima_measurements_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = seq_release,
+ .release = ima_measurements_release,
};
void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
@@ -279,14 +353,114 @@ static const struct seq_operations ima_ascii_measurements_seqops = {
static int ima_ascii_measurements_open(struct inode *inode, struct file *file)
{
- return seq_open(file, &ima_ascii_measurements_seqops);
+ return _ima_measurements_open(inode, file, &ima_ascii_measurements_seqops);
}
static const struct file_operations ima_ascii_measurements_ops = {
.open = ima_ascii_measurements_open,
.read = seq_read,
.llseek = seq_lseek,
- .release = seq_release,
+ .release = ima_measurements_release,
+};
+
+static int ima_log_trim_open(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+
+ if (!write && capable(CAP_SYS_ADMIN))
+ return 0;
+ else if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ return _ima_measurements_open(inode, file, &ima_measurments_seqops);
+}
+
+static ssize_t ima_log_trim_read(struct file *file, char __user *buf, size_t size, loff_t *ppos)
+{
+ char tmpbuf[IMA_LOG_TRIM_REQ_NUM_LENGTH];
+ ssize_t len;
+
+ len = scnprintf(tmpbuf, sizeof(tmpbuf), "%li\n", trimcount);
+ return simple_read_from_buffer(buf, size, ppos, tmpbuf, len);
+}
+
+static ssize_t ima_log_trim_write(struct file *file,
+ const char __user *buf, size_t datalen, loff_t *ppos)
+{
+ char tmpbuf[IMA_LOG_TRIM_REQ_TOTAL_LENGTH];
+ char *p = tmpbuf;
+ long count, ret, val = 0, max = LONG_MAX;
+
+ if (*ppos > 0 || datalen > IMA_LOG_TRIM_REQ_TOTAL_LENGTH || datalen < 2) {
+ ret = -EINVAL;
+ goto out;
+ }
+
+ if (copy_from_user(tmpbuf, buf, datalen) != 0) {
+ ret = -EFAULT;
+ goto out;
+ }
+
+ p = tmpbuf;
+
+ while (*p && *p != ':') {
+ if (!isdigit((unsigned char)*p))
+ return -EINVAL;
+
+ /* digit value */
+ int d = *p - '0';
+
+ /* overflow check: val * 10 + d > max -> (val > (max - d) / 10) */
+ if (val > (max - d) / 10)
+ return -ERANGE;
+
+ val = val * 10 + d;
+ p++;
+ }
+
+ if (*p != ':')
+ return -EINVAL;
+
+ /* verify trim count matches */
+ if (val != trimcount)
+ return -EINVAL;
+
+ p++; /* skip ':' */
+ ret = kstrtoul(p, 0, &count);
+
+ if (ret < 0)
+ goto out;
+
+ ret = ima_delete_event_log(count);
+
+ if (ret < 0)
+ goto out;
+
+ trimcount += ret;
+
+ ret = datalen;
+out:
+ return ret;
+}
+
+static int ima_log_trim_release(struct inode *inode, struct file *file)
+{
+ bool write = !!(file->f_mode & FMODE_WRITE);
+
+ if (!write && capable(CAP_SYS_ADMIN))
+ return 0;
+ else if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+
+ return ima_measurements_release(inode, file);
+}
+
+static const struct file_operations ima_log_trim_ops = {
+ .open = ima_log_trim_open,
+ .read = ima_log_trim_read,
+ .write = ima_log_trim_write,
+ .llseek = generic_file_llseek,
+ .release = ima_log_trim_release
};
static ssize_t ima_read_policy(char *path)
@@ -528,6 +702,18 @@ int __init ima_fs_init(void)
goto out;
}
+ if (IS_ENABLED(CONFIG_IMA_LOG_TRIMMING)) {
+ dentry = securityfs_create_file("ima_trim_log",
+ S_IRUSR | S_IRGRP | S_IWUSR | S_IWGRP,
+ ima_dir, NULL, &ima_log_trim_ops);
+ if (IS_ERR(dentry)) {
+ ret = PTR_ERR(dentry);
+ goto out;
+ }
+ }
+
+ trimcount = 0;
+
dentry = securityfs_create_file("runtime_measurements_count",
S_IRUSR | S_IRGRP, ima_dir, NULL,
&ima_measurements_count_ops);
diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 7362f68f2d8b..bee997683e03 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -41,7 +41,7 @@ void ima_measure_kexec_event(const char *event_name)
int n;
buf_size = ima_get_binary_runtime_size();
- len = atomic_long_read(&ima_htable.len);
+ len = atomic_long_read(&ima_number_entries);
n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
"kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index 590637e81ad1..07225e19b9b5 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -22,6 +22,14 @@
#define AUDIT_CAUSE_LEN_MAX 32
+bool ima_flush_htable;
+static int __init ima_flush_htable_setup(char *str)
+{
+ ima_flush_htable = true;
+ return 1;
+}
+__setup("ima_flush_htable", ima_flush_htable_setup);
+
/* pre-allocated array of tpm_digest structures to extend a PCR */
static struct tpm_digest *digests;
@@ -114,6 +122,7 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
list_add_tail_rcu(&qe->later, &ima_measurements);
atomic_long_inc(&ima_htable.len);
+ atomic_long_inc(&ima_number_entries);
if (update_htable) {
key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
@@ -220,6 +229,93 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
return result;
}
+/**
+ * ima_delete_event_log - delete IMA event entry
+ * @num_records: number of records to delete
+ *
+ * delete num_records entries off the measurement list.
+ * Returns num_records, or negative error code.
+ */
+long ima_delete_event_log(long num_records)
+{
+ long len, cur = num_records, tmp_len = 0;
+ struct ima_queue_entry *qe, *qe_tmp;
+ LIST_HEAD(ima_measurements_to_delete);
+ struct list_head *list_ptr;
+
+ if (!IS_ENABLED(CONFIG_IMA_LOG_TRIMMING))
+ return -EOPNOTSUPP;
+
+ if (num_records <= 0)
+ return num_records;
+
+ list_ptr = &ima_measurements;
+
+ len = atomic_long_read(&ima_number_entries);
+
+ if (num_records <= len) {
+ list_for_each_entry(qe, list_ptr, later) {
+ if (cur > 0) {
+ tmp_len += get_binary_runtime_size(qe->entry);
+ --cur;
+ }
+ if (cur == 0) {
+ qe_tmp = qe;
+ break;
+ }
+ }
+ }
+ else {
+ return -ENOENT;
+ }
+
+
+ mutex_lock(&ima_extend_list_mutex);
+ len = atomic_long_read(&ima_number_entries);
+
+ if (num_records == len) {
+ list_replace(&ima_measurements, &ima_measurements_to_delete);
+ INIT_LIST_HEAD(&ima_measurements);
+ atomic_long_set(&ima_number_entries, 0);
+ list_ptr = &ima_measurements_to_delete;
+ }
+ else {
+ __list_cut_position(&ima_measurements_to_delete, &ima_measurements,
+ &qe_tmp->later);
+ atomic_long_sub(num_records, &ima_number_entries);
+ if (IS_ENABLED(CONFIG_IMA_KEXEC))
+ binary_runtime_size -= tmp_len;
+ }
+
+ mutex_unlock(&ima_extend_list_mutex);
+
+ if (ima_flush_htable)
+ synchronize_rcu();
+
+ list_for_each_entry_safe(qe, qe_tmp, &ima_measurements_to_delete, later) {
+ /*
+ * Ok because after list delete qe is only accessed by
+ * ima_lookup_digest_entry().
+ */
+ for (int i = 0; i < qe->entry->template_desc->num_fields; i++) {
+ kfree(qe->entry->template_data[i].data);
+ qe->entry->template_data[i].data = NULL;
+ qe->entry->template_data[i].len = 0;
+ }
+
+ list_del(&qe->later);
+
+ /* No leak if !ima_flush_htable, referenced by ima_htable. */
+ if (ima_flush_htable) {
+ kfree(qe->entry->digests);
+ kfree(qe->entry);
+ kfree(qe);
+ }
+ }
+
+ return num_records;
+}
+
int ima_restore_measurement_entry(struct ima_template_entry *entry)
{
int result = 0;
--
2.43.0
^ permalink raw reply related
* [PATCH v5 1/3] ima: make ima event log trimming configurable
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
In-Reply-To: <20260401172956.4581-1-chenste@linux.microsoft.com>
Make ima event log trimming function configurable.
Suggested-by: Mimi Zohar <zohar@linux.ibm.com>
Signed-off-by: steven chen <chenste@linux.microsoft.com>
---
security/integrity/ima/Kconfig | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/security/integrity/ima/Kconfig b/security/integrity/ima/Kconfig
index 976e75f9b9ba..322964ae4772 100644
--- a/security/integrity/ima/Kconfig
+++ b/security/integrity/ima/Kconfig
@@ -332,4 +332,16 @@ config IMA_KEXEC_EXTRA_MEMORY_KB
If set to the default value of 0, an extra half page of memory for those
additional measurements will be allocated.
+config IMA_LOG_TRIMMING
+ bool "IMA Event Log Trimming"
+ default n
+ help
+ Say Y here if you want support for IMA Event Log Trimming.
+ This creates the file /sys/kernel/security/integrity/ima/ima_trim_log.
+ Userspace
+ - writes to this file to trigger IMA event log trimming
+ - reads this file to get number of entried trimming last time
+
+ If unsure, say N.
+
endif
--
2.43.0
^ permalink raw reply related
* [PATCH v5 0/3] Trim N entries of IMA event logs
From: steven chen @ 2026-04-01 17:29 UTC (permalink / raw)
To: linux-integrity
Cc: zohar, roberto.sassu, dmitry.kasatkin, eric.snowberg, corbet,
serge, paul, jmorris, linux-security-module, anirudhve, chenste,
gregorylumen, nramas, sushring, linux-doc
The Integrity Measurement Architecture (IMA) maintains a measurement list
—a record of system events used for integrity verification. The IMA event
logs are the entries within this measurement list, each representing a
specific event or measurement that contributes to the system's integrity
assessment.
This update introduces the ability to trim, or remove, N entries from the
current measurement list. Trimming involves deleting N entries from the
list. This action atomically truncates the measurement list, ensuring that
no new measurements can be added until the operation is complete.
Importantly, only one writer can initiate this trimming process at a time,
maintaining consistency and preventing race conditions.
A userspace interface, ima_trim_log, has been provided for this purpose.
When this interface is read, it returns the total number T of entries
trimmed since system boot up. This value T need to be preserved across
kexec soft reboots. By writing two number T:N to this interface, userspace
can request the kernel to trim N entries from the IMA event logs.
To maintain a complete record, userspace is responsible for concatenating
and storing the logs before initiating trimming. Userspace can then send
the collected data to remote verifiers for validation. After receiving
confirmation from the remote verifiers, userspace may instruct the kernel
to proceed with trimming the IMA event logs accordingly.
The primary benefit of this solution is the ability to free valuable
kernel memory by delegating the task of reconstructing the full
measurement list from log chunks to userspace. Trust is not required in
userspace for the integrity of the measurement list, as its integrity is
cryptographically protected by the Trusted Platform Module (TPM).
Multiple readers are allowed to access the ima_trim_log interface
concurrently, while only one writer can trigger log trimming at any time.
During trimming, readers do not see the list and cannot access it while
deletion is in progress, ensuring atomicity.
Introduce the new kernel option ima_flush_htable to decide whether or not
the digests of measurement entries are flushed from the hash table (from
reference [2]).
The ima_measure_users counter (protected by the ima_measure_lock mutex) has
been introduced to protect access to the measurement list part. The open
method of all the measurement interfaces has been extended to allow only
one writer at a time or, in alternative, multiple readers. The write
permission is used to stage/delete the measurements, the read permission
to read them. Write requires also the CAP_SYS_ADMIN capability (from
reference [2]). This ima_measure_users needs to be preserved across kexec
soft reboots
The total trimmed number T and the ima_measure_users both need to be
preserved across kexec soft reboot and new patch will be added for this
purpose in next version.
New IMA log trim event is added when trimming finish.
The time required for trimming is minimal, and IMA event logs are briefly
on hold during this process, preventing read or add operations. This short
interruption has no impact on the overall functionality of IMA.
A new critical data record "ima_log_trim" is added in this solution. This
record logs the trim event with number of entries deleted total T since
system start and time when this happened. User space can get the total
number T of entries trimmed by checking "ima_log_trim" event in the
measurement list.
The following are how user space to use the measurement list and
ima_log_trim interface
1. get the total numer trimmed T through "ima_log_trim" interface
2. get the PCR quote
3. read the measurement list file, close the file, send for verification
4. wait for response from verifier, until get the good response from
verifier with number N that matched the PCR quote got in step 2
5. get the number N from the above message
6. write the T:N to the ima_log_trim interface when no conflict
when kernel get log trim request T:N
Get the T, compare with the total trimmed number
if equal, then do trim N and change T to T+N
else return error
Using above way to trim the log, the time for user space to hold the list
will be trimming T:N operation itself at the step 6. User space agent
race condition is solved too in this way.
References:
-----------
[1] [PATCH 0/1] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20251202232857.8211-1-chenste@linux.microsoft.com/T/#t
[2] [RFC][PATCH] ima: Add support for staging measurements for deletion
https://lore.kernel.org/linux-integrity/207fd6d7-53c-57bb-36d8-13a0902052d1@linux.microsoft.com/T/#t
[3] [PATCH v2 0/1] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20251210235314.3341-1-chenste@linux.microsoft.com/T/#t
[4] [PATCH v3 0/3] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20260106020713.3994-1-chenste@linux.microsoft.com/T/#t
[5] [PATCH v4 0/3] Trim N entries of IMA event logs
https://lore.kernel.org/linux-integrity/20260205235849.7086-1-chenste@linux.microsoft.com/T/#t
Change Log v5:
- lock time performance improvement
- Keep hash table unchanged because log already use the hash value
- Updated patch descriptions as necessary.
Change Log v4:
- Incorporated feedback from Roberto on v3 series.
- Update "ima_log_trim" interface definition
When read this interface, return total number of records trimmed T
need to write T:N to this interface to trim N records
- Update user space use case on how to trim IMA event logs
- Updated patch descriptions as necessary.
Change Log v3:
- Incorporated feedback from Mimi on v2 series.
- split patch into multiple patches
- lock time performance improvement
- Updated patch descriptions as necessary.
Change Log v2:
- Incorporated feedback from the Roberto on v1 series.
- Adapted code from Roberto's RFC [Reference 2]
- Add IMA log trim event log to record trim event
- Updated patch descriptions as necessary
steven chen (3):
ima: make ima event log trimming configurable
ima: trim N IMA event log records
ima: add new critical data record to measure log trim
.../admin-guide/kernel-parameters.txt | 4 +
security/integrity/ima/Kconfig | 12 +
security/integrity/ima/ima.h | 4 +-
security/integrity/ima/ima_fs.c | 218 +++++++++++++++++-
security/integrity/ima/ima_kexec.c | 2 +-
security/integrity/ima/ima_queue.c | 96 ++++++++
6 files changed, 328 insertions(+), 8 deletions(-)
--
2.43.0
^ permalink raw reply
* Re: [PATCH] landlock: Document fallocate(2) as another truncation corner case
From: Günther Noack @ 2026-04-01 17:13 UTC (permalink / raw)
To: Mickaël Salaün; +Cc: linux-security-module
In-Reply-To: <20260401.oor1chahp2oF@digikod.net>
On Wed, Apr 01, 2026 at 06:30:28PM +0200, Mickaël Salaün wrote:
> On Wed, Apr 01, 2026 at 05:09:10PM +0200, Günther Noack wrote:
> > Reinforce the already stated policy that LANDLOCK_ACCESS_FS_TRUNCATE should
> > always go hand in hand with LANDLOCK_ACCESS_FS_WRITE_FILE, as their
> > meanings and enforcement overlap in counterintuitive ways.
> >
> > On many common file systems, fallocate(2) offers a way to shorten files as
> > long as the file is opened for writing, side-stepping the
> > LANDLOCK_ACCESS_FS_TRUNCATE right.
> >
> > Assisted-by: Gemini-CLI:gemini-3.1
> > Signed-off-by: Günther Noack <gnoack@google.com>
> > ---
> > Documentation/userspace-api/landlock.rst | 8 ++++++--
> > 1 file changed, 6 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
> > index 7f86d7a37dc2..d5691ec136cc 100644
> > --- a/Documentation/userspace-api/landlock.rst
> > +++ b/Documentation/userspace-api/landlock.rst
> > @@ -378,8 +378,8 @@ Truncating files
> >
> > The operations covered by ``LANDLOCK_ACCESS_FS_WRITE_FILE`` and
> > ``LANDLOCK_ACCESS_FS_TRUNCATE`` both change the contents of a file and sometimes
> > -overlap in non-intuitive ways. It is recommended to always specify both of
> > -these together.
> > +overlap in non-intuitive ways. It is strongly recommended to always specify
> > +both of these together (either granting both, or granting none).
> >
> > A particularly surprising example is :manpage:`creat(2)`. The name suggests
> > that this system call requires the rights to create and write files. However,
> > @@ -391,6 +391,10 @@ It should also be noted that truncating files does not require the
> > system call, this can also be done through :manpage:`open(2)` with the flags
> > ``O_RDONLY | O_TRUNC``.
> >
> > +At the same time, on some filesystems, :manpage:`fallocate(2)` offers a way to
> > +shorten file contents with ``FALLOC_FL_COLLAPSE_RANGE`` when the file is opened
> > +for writing, sidestepping the ``LANDLOCK_ACCESS_FS_TRUNCATE`` right.
>
> Interesting, which filesystems? Shouldn't it be fixed in the code
> instead?
It works on ext4, and I also see mentions of FALLOC_FL_COLLAPSE_RANGE
in XFS, F2FS, SMB and NTFS3.
I should mention, it is not *exactly* the same as a truncation, but
you can remove a chunk of the file from the middle, which also leads
to a shorter file. For example, assuming a block size of 1024:
1. Make a file with 2*1024 bytes: 1024*'A', then 1024*'B'
2. fallocate(collapse range, 0, 1024)
Resulting file is 1024*'B', and the file is shortened to 1024 bytes.
So this is not *exactly* a truncation. (The man page says that an
attempt to remove the end of a file results in EINVAL, so you have to
take it from the middle, and it needs to align with block boundaries.)
But it's quite similar, also shortens the file, and it does not
require the Landlock truncation access right.
I agree, another way would potentially be to call the LSM ftruncate
hook. I suspect this would stay compatible with other LSMs, because
the LSM ftruncate hook is a relatively recent addition (but have not
checked in detail).
The implementation of fallocate is vfs_fallocate() in fs/open.c - I
only had a tentative look now; it checks that the file->f_mode is open
for writing and calls security_file_permission() with MAY_WRITE.
I always saw LANDLOCK_ACCESS_FS_WRITE_FILE and
LANDLOCK_ACCESS_FS_TRUNCATE as rights that should always go together,
so I suspect that it does not make a big difference in practice, and
that is why I am suggesting to just document it more clearly for now.
—Günther
^ permalink raw reply
* Re: [RFC PATCH v1 02/11] security: Add LSM_AUDIT_DATA_NS for namespace audit records
From: Mickaël Salaün @ 2026-04-01 16:38 UTC (permalink / raw)
To: Christian Brauner
Cc: Günther Noack, Paul Moore, Serge E . Hallyn, Justin Suess,
Lennart Poettering, Mikhail Ivanov, Nicolas Bouchinet,
Shervin Oloumi, Tingmao Wang, kernel-team, linux-fsdevel,
linux-kernel, linux-security-module
In-Reply-To: <20260325-fachgebiet-parzelle-8ae15e6f305f@brauner>
On Wed, Mar 25, 2026 at 01:32:42PM +0100, Christian Brauner wrote:
> On Thu, Mar 12, 2026 at 11:04:35AM +0100, Mickaël Salaün wrote:
> > Add a new LSM audit data type LSM_AUDIT_DATA_NS that logs namespace
> > information in audit records. Two fields are provided, matching the
> > field names of struct ns_common:
> >
> > - ns_type: the CLONE_NEW* flag identifying the namespace type, logged in
> > hexadecimal.
> >
> > - inum: the proc inode number identifying a specific namespace instance.
> > Namespace inode numbers are allocated by proc_alloc_inum() via
> > ida_alloc_max() bounded to UINT_MAX, so the value always fits in 32
> > bits.
> >
> > A new audit data type is needed because no existing LSM_AUDIT_DATA_*
> > type carries namespace information. The closest alternatives (e.g.
> > LSM_AUDIT_DATA_TASK or LSM_AUDIT_DATA_NONE with custom strings) would
> > either lose the namespace type or require ad-hoc formatting that
> > bypasses the structured audit data union.
> >
> > Cc: Christian Brauner <brauner@kernel.org>
> > Cc: Günther Noack <gnoack@google.com>
> > Cc: Paul Moore <paul@paul-moore.com>
> > Signed-off-by: Mickaël Salaün <mic@digikod.net>
> > ---
> > include/linux/lsm_audit.h | 5 +++++
> > security/lsm_audit.c | 4 ++++
> > 2 files changed, 9 insertions(+)
> >
> > diff --git a/include/linux/lsm_audit.h b/include/linux/lsm_audit.h
> > index 382c56a97bba..6e20a56b8c22 100644
> > --- a/include/linux/lsm_audit.h
> > +++ b/include/linux/lsm_audit.h
> > @@ -78,6 +78,7 @@ struct common_audit_data {
> > #define LSM_AUDIT_DATA_NOTIFICATION 16
> > #define LSM_AUDIT_DATA_ANONINODE 17
> > #define LSM_AUDIT_DATA_NLMSGTYPE 18
> > +#define LSM_AUDIT_DATA_NS 19
> > union {
> > struct path path;
> > struct dentry *dentry;
> > @@ -100,6 +101,10 @@ struct common_audit_data {
> > int reason;
> > const char *anonclass;
> > u16 nlmsg_type;
> > + struct {
> > + u32 ns_type;
> > + unsigned int inum;
>
> fwiw, you might want to start the 64-bit namespace id as well.
> But either way:
Right now these numbers are generated by ida_alloc_max(), which return
an int. Is there an ongoing patch series for this change?
>
> Reviewed-by: Christian Brauner <brauner@kernel.org>
>
^ permalink raw reply
* Re: [PATCH] landlock: Document fallocate(2) as another truncation corner case
From: Mickaël Salaün @ 2026-04-01 16:30 UTC (permalink / raw)
To: Günther Noack; +Cc: linux-security-module
In-Reply-To: <20260401150911.1038072-1-gnoack@google.com>
On Wed, Apr 01, 2026 at 05:09:10PM +0200, Günther Noack wrote:
> Reinforce the already stated policy that LANDLOCK_ACCESS_FS_TRUNCATE should
> always go hand in hand with LANDLOCK_ACCESS_FS_WRITE_FILE, as their
> meanings and enforcement overlap in counterintuitive ways.
>
> On many common file systems, fallocate(2) offers a way to shorten files as
> long as the file is opened for writing, side-stepping the
> LANDLOCK_ACCESS_FS_TRUNCATE right.
>
> Assisted-by: Gemini-CLI:gemini-3.1
> Signed-off-by: Günther Noack <gnoack@google.com>
> ---
> Documentation/userspace-api/landlock.rst | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
> index 7f86d7a37dc2..d5691ec136cc 100644
> --- a/Documentation/userspace-api/landlock.rst
> +++ b/Documentation/userspace-api/landlock.rst
> @@ -378,8 +378,8 @@ Truncating files
>
> The operations covered by ``LANDLOCK_ACCESS_FS_WRITE_FILE`` and
> ``LANDLOCK_ACCESS_FS_TRUNCATE`` both change the contents of a file and sometimes
> -overlap in non-intuitive ways. It is recommended to always specify both of
> -these together.
> +overlap in non-intuitive ways. It is strongly recommended to always specify
> +both of these together (either granting both, or granting none).
>
> A particularly surprising example is :manpage:`creat(2)`. The name suggests
> that this system call requires the rights to create and write files. However,
> @@ -391,6 +391,10 @@ It should also be noted that truncating files does not require the
> system call, this can also be done through :manpage:`open(2)` with the flags
> ``O_RDONLY | O_TRUNC``.
>
> +At the same time, on some filesystems, :manpage:`fallocate(2)` offers a way to
> +shorten file contents with ``FALLOC_FL_COLLAPSE_RANGE`` when the file is opened
> +for writing, sidestepping the ``LANDLOCK_ACCESS_FS_TRUNCATE`` right.
Interesting, which filesystems? Shouldn't it be fixed in the code
instead?
> +
> The truncate right is associated with the opened file (see below).
>
> Rights associated with file descriptors
> --
> 2.53.0.1185.g05d4b7b318-goog
>
>
^ permalink raw reply
* [PATCH v2 1/4] selftests/landlock: Fix snprintf truncation checks in audit helpers
From: Mickaël Salaün @ 2026-04-01 16:14 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, linux-security-module, Justin Suess,
Tingmao Wang
In-Reply-To: <20260401161503.1136946-1-mic@digikod.net>
snprintf() returns the number of characters that would have been
written, excluding the terminating NUL byte. When the output is
truncated, this return value equals or exceeds the buffer size. Fix
matches_log_domain_allocated() and matches_log_domain_deallocated() to
detect truncation with ">=" instead of ">".
Cc: Günther Noack <gnoack@google.com>
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Reviewed-by: Günther Noack <gnoack@google.com>
Link: https://lore.kernel.org/r/20260312100444.2609563-8-mic@digikod.net
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
Changes since v1:
https://lore.kernel.org/r/20260312100444.2609563-8-mic@digikod.net
- New patch (split from the drain fix).
---
tools/testing/selftests/landlock/audit.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 44eb433e9666..1049a0582af5 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -309,7 +309,7 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
log_match_len =
snprintf(log_match, sizeof(log_match), log_template, pid);
- if (log_match_len > sizeof(log_match))
+ if (log_match_len >= sizeof(log_match))
return -E2BIG;
return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
@@ -326,7 +326,7 @@ static int __maybe_unused matches_log_domain_deallocated(
log_match_len = snprintf(log_match, sizeof(log_match), log_template,
num_denials);
- if (log_match_len > sizeof(log_match))
+ if (log_match_len >= sizeof(log_match))
return -E2BIG;
return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
--
2.53.0
^ permalink raw reply related
* [PATCH v2 4/4] selftests/landlock: Skip stale records in audit_match_record()
From: Mickaël Salaün @ 2026-04-01 16:14 UTC (permalink / raw)
To: Günther Noack
Cc: Mickaël Salaün, linux-security-module, Justin Suess,
Tingmao Wang
In-Reply-To: <20260401161503.1136946-1-mic@digikod.net>
Domain deallocation records are emitted asynchronously from kworker
threads (via free_ruleset_work()). Stale deallocation records from a
previous test can arrive during the current test's deallocation read
loop and be picked up by audit_match_record() instead of the expected
record, causing a domain ID mismatch. The audit.layers test (which
creates 16 nested domains) is particularly vulnerable because it reads
16 deallocation records in sequence, providing a large window for stale
records to interleave.
The same issue affects audit_flags.signal, where deallocation records
from a previous test (audit.layers) can leak into the next test and be
picked up by audit_match_record() instead of the expected record.
Fix this by continuing to read records when the type matches but the
content pattern does not. Stale records are silently consumed, and the
loop only stops when both type and pattern match (or the socket times
out with -EAGAIN).
Additionally, extend matches_log_domain_deallocated() with an
expected_domain_id parameter. When set, the regex pattern includes the
specific domain ID as a literal hex value, so that deallocation records
for a different domain do not match the pattern at all. This handles
the case where the stale record has the same denial count as the
expected one (e.g. both have denials=1), which the type+pattern loop
alone cannot distinguish. Callers that already know the expected domain
ID (from a prior denial or allocation record) now pass it to filter
precisely.
When expected_domain_id is set, matches_log_domain_deallocated() also
temporarily increases the socket timeout to audit_tv_dom_drop (1 second)
to wait for the asynchronous kworker deallocation, and restores
audit_tv_default afterward. This removes the need for callers to manage
the timeout switch manually.
Cc: Günther Noack <gnoack@google.com>
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
Changes since v1:
- New patch.
---
tools/testing/selftests/landlock/audit.h | 81 ++++++++++++++-----
tools/testing/selftests/landlock/audit_test.c | 32 ++++----
2 files changed, 75 insertions(+), 38 deletions(-)
diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 74e1c3d763be..b439a9b00f0e 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -249,9 +249,9 @@ static __maybe_unused char *regex_escape(const char *const src, char *dst,
static int audit_match_record(int audit_fd, const __u16 type,
const char *const pattern, __u64 *domain_id)
{
- struct audit_message msg;
+ struct audit_message msg, last_mismatch = {};
int ret, err = 0;
- bool matches_record = !type;
+ int num_type_match = 0;
regmatch_t matches[2];
regex_t regex;
@@ -259,21 +259,35 @@ static int audit_match_record(int audit_fd, const __u16 type,
if (ret)
return -EINVAL;
- do {
+ /*
+ * Reads records until one matches both the expected type and the
+ * pattern. Type-matching records with non-matching content are
+ * silently consumed, which handles stale domain deallocation records
+ * from a previous test emitted asynchronously by kworker threads.
+ */
+ while (true) {
memset(&msg, 0, sizeof(msg));
err = audit_recv(audit_fd, &msg);
- if (err)
+ if (err) {
+ if (num_type_match) {
+ printf("DATA: %s\n", last_mismatch.data);
+ printf("ERROR: %d record(s) matched type %u"
+ " but not pattern: %s\n",
+ num_type_match, type, pattern);
+ }
goto out;
+ }
- if (msg.header.nlmsg_type == type)
- matches_record = true;
- } while (!matches_record);
+ if (type && msg.header.nlmsg_type != type)
+ continue;
- ret = regexec(®ex, msg.data, ARRAY_SIZE(matches), matches, 0);
- if (ret) {
- printf("DATA: %s\n", msg.data);
- printf("ERROR: no match for pattern: %s\n", pattern);
- err = -ENOENT;
+ ret = regexec(®ex, msg.data, ARRAY_SIZE(matches), matches,
+ 0);
+ if (!ret)
+ break;
+
+ num_type_match++;
+ last_mismatch = msg;
}
if (domain_id) {
@@ -316,21 +330,48 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
domain_id);
}
-static int __maybe_unused matches_log_domain_deallocated(
- int audit_fd, unsigned int num_denials, __u64 *domain_id)
+/*
+ * Matches a domain deallocation record. When expected_domain_id is non-zero,
+ * the pattern includes the specific domain ID so that stale deallocation
+ * records from a previous test (with a different domain ID) are skipped by
+ * audit_match_record(), and the socket timeout is temporarily increased to
+ * audit_tv_dom_drop to wait for the asynchronous kworker deallocation.
+ */
+static int __maybe_unused
+matches_log_domain_deallocated(int audit_fd, unsigned int num_denials,
+ __u64 expected_domain_id, __u64 *domain_id)
{
static const char log_template[] = REGEX_LANDLOCK_PREFIX
" status=deallocated denials=%u$";
- char log_match[sizeof(log_template) + 10];
- int log_match_len;
+ static const char log_template_with_id[] =
+ "^audit([0-9.:]\\+): domain=\\(%llx\\)"
+ " status=deallocated denials=%u$";
+ char log_match[sizeof(log_template_with_id) + 32];
+ int log_match_len, err;
+
+ if (expected_domain_id)
+ log_match_len = snprintf(log_match, sizeof(log_match),
+ log_template_with_id,
+ expected_domain_id, num_denials);
+ else
+ log_match_len = snprintf(log_match, sizeof(log_match),
+ log_template, num_denials);
- log_match_len = snprintf(log_match, sizeof(log_match), log_template,
- num_denials);
if (log_match_len >= sizeof(log_match))
return -E2BIG;
- return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
- domain_id);
+ if (expected_domain_id)
+ setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO,
+ &audit_tv_dom_drop, sizeof(audit_tv_dom_drop));
+
+ err = audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
+ domain_id);
+
+ if (expected_domain_id)
+ setsockopt(audit_fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
+ sizeof(audit_tv_default));
+
+ return err;
}
struct audit_records {
diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
index f92ba6774faa..a76c3686a0fe 100644
--- a/tools/testing/selftests/landlock/audit_test.c
+++ b/tools/testing/selftests/landlock/audit_test.c
@@ -139,13 +139,16 @@ TEST_F(audit, layers)
WEXITSTATUS(status) != EXIT_SUCCESS)
_metadata->exit_code = KSFT_FAIL;
- /* Purges log from deallocated domains. */
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop, sizeof(audit_tv_dom_drop)));
+ /*
+ * Purges log from deallocated domains. Records arrive in LIFO order
+ * (innermost domain first) because landlock_put_hierarchy() walks the
+ * chain sequentially in a single kworker context.
+ */
for (i = ARRAY_SIZE(*domain_stack) - 1; i >= 0; i--) {
__u64 deallocated_dom = 2;
EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 1,
+ (*domain_stack)[i],
&deallocated_dom));
EXPECT_EQ((*domain_stack)[i], deallocated_dom)
{
@@ -154,8 +157,6 @@ TEST_F(audit, layers)
}
}
EXPECT_EQ(0, munmap(domain_stack, sizeof(*domain_stack)));
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default, sizeof(audit_tv_default)));
EXPECT_EQ(0, close(ruleset_fd));
}
@@ -270,13 +271,9 @@ TEST_F(audit, thread)
EXPECT_EQ(0, close(pipe_parent[1]));
ASSERT_EQ(0, pthread_join(thread, NULL));
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop, sizeof(audit_tv_dom_drop)));
- EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 1,
- &deallocated_dom));
+ EXPECT_EQ(0, matches_log_domain_deallocated(
+ self->audit_fd, 1, denial_dom, &deallocated_dom));
EXPECT_EQ(denial_dom, deallocated_dom);
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default, sizeof(audit_tv_default)));
}
FIXTURE(audit_flags)
@@ -432,22 +429,21 @@ TEST_F(audit_flags, signal)
if (variant->restrict_flags &
LANDLOCK_RESTRICT_SELF_LOG_SAME_EXEC_OFF) {
+ /*
+ * No deallocation record: denials=0 never matches a real
+ * record.
+ */
EXPECT_EQ(-EAGAIN,
- matches_log_domain_deallocated(self->audit_fd, 0,
+ matches_log_domain_deallocated(self->audit_fd, 0, 0,
&deallocated_dom));
EXPECT_EQ(deallocated_dom, 2);
} else {
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_dom_drop,
- sizeof(audit_tv_dom_drop)));
EXPECT_EQ(0, matches_log_domain_deallocated(self->audit_fd, 2,
+ *self->domain_id,
&deallocated_dom));
EXPECT_NE(deallocated_dom, 2);
EXPECT_NE(deallocated_dom, 0);
EXPECT_EQ(deallocated_dom, *self->domain_id);
- EXPECT_EQ(0, setsockopt(self->audit_fd, SOL_SOCKET, SO_RCVTIMEO,
- &audit_tv_default,
- sizeof(audit_tv_default)));
}
}
--
2.53.0
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox