Linux Security Modules development
 help / color / mirror / Atom feed
* Re: [PATCH v6] lsm: Add LSM hook security_unix_find
From: Paul Moore @ 2026-03-11 16:08 UTC (permalink / raw)
  To: Justin Suess
  Cc: Günther Noack, brauner, demiobenour, fahimitahera, hi, horms,
	ivanov.mikhail1, jannh, jmorris, john.johansen,
	konstantin.meskhidze, linux-security-module, m, matthieu, mic,
	netdev, samasth.norway.ananda, serge, viro
In-Reply-To: <abFhawSTjNoa-KaH@suesslenovo>

On Wed, Mar 11, 2026 at 8:34 AM Justin Suess <utilityemal77@gmail.com> wrote:
>
> On Tue, Mar 10, 2026 at 06:39:12PM -0400, Paul Moore wrote:
> > On Thu, Feb 19, 2026 at 3:26 PM Günther Noack <gnoack3000@gmail.com> wrote:
> > > On Thu, Feb 19, 2026 at 03:04:59PM -0500, Justin Suess wrote:
> > > > Add a LSM hook security_unix_find.
> > > >
> > > > This hook is called to check the path of a named unix socket before a
> > > > connection is initiated. The peer socket may be inspected as well.
> > > >
> > > > Why existing hooks are unsuitable:
> > > >
> > > > Existing socket hooks, security_unix_stream_connect(),
> > > > security_unix_may_send(), and security_socket_connect() don't provide
> > > > TOCTOU-free / namespace independent access to the paths of sockets.
> > > >
> > > > (1) We cannot resolve the path from the struct sockaddr in existing hooks.
> > > > This requires another path lookup. A change in the path between the
> > > > two lookups will cause a TOCTOU bug.
> > > >
> > > > (2) We cannot use the struct path from the listening socket, because it
> > > > may be bound to a path in a different namespace than the caller,
> > > > resulting in a path that cannot be referenced at policy creation time.
> > > >
> > > > Cc: Günther Noack <gnoack3000@gmail.com>
> > > > Cc: Tingmao Wang <m@maowtm.org>
> > > > Signed-off-by: Justin Suess <utilityemal77@gmail.com>
> > > > ---
> > > >  include/linux/lsm_hook_defs.h |  5 +++++
> > > >  include/linux/security.h      | 11 +++++++++++
> > > >  net/unix/af_unix.c            | 13 ++++++++++---
> > > >  security/security.c           | 20 ++++++++++++++++++++
> > > >  4 files changed, 46 insertions(+), 3 deletions(-)
> >
> > ...
> >
> > > Reviewed-by: Günther Noack <gnoack3000@gmail.com>
> > >
> > > Thank you, this looks good. I'll include it in the next version of the
> > > Unix connect patch set again.
> >
> > I'm looking for this patchset to review/ACK the new hook in context,
> > but I'm not seeing it in my inbox or lore.  Did I simply miss the
> > patchset or is it still a work in progress?  No worries if it hasn't
> > been posted yet, I just wanted to make sure I wasn't holding this up
> > any more than I already may have :)
>
> Good Morning Paul,
>
> Can't speak to the rest of the patch, but I sent this LSM hook for
> review purposes before inclusion with the rest of the V6 of this patch.
>
> Günther added his review tag, but I was asked to make some minor comment / commit
> message updates. I sent the same patch, with updated comments/commit to him
> in a follow up, off-list email to avoid spamming the list. No code changes were
> made, just comments.
>
> I don't think this particular patch will change substantially, unless we find
> something unexpected. But the way we use the hook may change (esp wrt to
> locking and the SOCK_DEAD state), which is important for your review.
>
> So you may want to hold off your review until the full V6 series gets sent so
> you can review the hook in context. There were some questions about
> locking that needed proper digging into. [1]

Great, thanks for the update, that was helpful.  As you recommend,
I'll hold off on reviewing this further until we have the full context
of the other patchset; we've already talked about this hook addition a
few times anyway, and based on a quick look yesterday, nothing
particularly evil jumped out at me.

-- 
paul-moore.com

^ permalink raw reply

* Re: LSM namespacing API
From: Casey Schaufler @ 2026-03-11 16:37 UTC (permalink / raw)
  To: Stephen Smalley
  Cc: Dr. Greg, Paul Moore, Ondrej Mosnacek, linux-security-module,
	selinux, John Johansen, Casey Schaufler
In-Reply-To: <CAEjxPJ7yuJ6sAZ-ViqT04M5WPC9O39m5UUGw2f3+GDR87tvbsA@mail.gmail.com>

On 3/9/2026 11:15 AM, Stephen Smalley wrote:
> On Fri, Mar 6, 2026 at 4:01 PM Casey Schaufler <casey@schaufler-ca.com> wrote:
>> On 3/6/2026 9:48 AM, Dr. Greg wrote:
>>> On Tue, Mar 03, 2026 at 11:46:53AM -0500, Paul Moore wrote:
>>>
>>> Good morning, I hope the week is winding down well for everyone.
>>>
>>>> On Tue, Mar 3, 2026 at 8:30???AM Stephen Smalley
>>>>> I think my only caveat here is that your proposal is quite a bit more
>>>>> complex than what I implemented here:
>>>>> [1] https://lore.kernel.org/selinux/20251003190959.3288-2-stephen.smalley.work@gmail.com/
>>>>> [2] https://lore.kernel.org/selinux/20251003191328.3605-1-stephen.smalley.work@gmail.com/
>>>>> and I'm not sure the extra complexity is worth it.
>>>>>
>>>>> In particular:
>>>>> 1. Immediately unsharing the namespace upon lsm_set_self_attr() allows
>>>>> the caller to immediately and unambiguously know if the operation is
>>>>> supported and allowed ...
>>>> Performing the unshare operation immediately looks much less like a
>>>> LSM attribute and more like its own syscall.  That isn't a problem
>>>> in my eyes, it just means if this is the direction we want to go we
>>>> should implement a lsm_unshare(2) API, or something similar.
>>> Stephen's take on this is correct, the least complicated path forward
>>> is a simple call, presumably lsm_unshare(2), that instructs the LSM(s)
>>> to carry out whatever is needed to create a new security namespace.
>>>
>>> There are only two public implementations of what can be referred to
>>> as major security namespacing efforts; Stephen's work with SeLinux and
>>> our TSEM implementation.
>> Please be just a tiny bit careful before you make this sort of assertion:
>>
>>         https://lwn.net/Articles/645403/
> I believe both AppArmor and TOMOYO also have namespacing
> implementations already upstream, so SELinux is certainly not the only
> one. Looks like the Smack implementation you cited above was based on
> extending user namespaces rather than purely Smack-internal like the
> others; is that why it wasn't ultimately merged?

Less sophisticated solutions to the problem Smack namespaces were
intended to address became available. The effort stalled for lack of
a use case that required it. Much of what you want a namespace for
can be accomplished using process specific rules.



^ permalink raw reply

* [PATCH v3 1/3] ima: Remove ima_h_table structure
From: Roberto Sassu @ 2026-03-11 17:19 UTC (permalink / raw)
  To: corbet, skhan, zohar, dmitry.kasatkin, eric.snowberg, paul,
	jmorris, serge
  Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
	gregorylumen, chenste, nramas, Roberto Sassu

From: Roberto Sassu <roberto.sassu@huawei.com>

With the upcoming change of dynamically allocating and replacing the hash
table, we would need to keep the counters for number of measurements
entries and violations.

Since anyway, those counters don't belong there, remove the ima_h_table
structure instead and move the counters and the hash table as a separate
variables.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
Changelog:
v2:
 - Not present in this version

v1:
 - Not present in this version
---
 security/integrity/ima/ima.h       |  9 +++------
 security/integrity/ima/ima_api.c   |  2 +-
 security/integrity/ima/ima_fs.c    | 19 +++++++++----------
 security/integrity/ima/ima_kexec.c |  2 +-
 security/integrity/ima/ima_queue.c | 17 ++++++++++-------
 5 files changed, 24 insertions(+), 25 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index c38a9eb945b6..1f2c81ec0fba 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -298,12 +298,9 @@ int ima_lsm_policy_change(struct notifier_block *nb, unsigned long event,
  */
 extern spinlock_t ima_queue_lock;
 
-struct ima_h_table {
-	atomic_long_t len;	/* number of stored measurements in the list */
-	atomic_long_t violations;
-	struct hlist_head queue[IMA_MEASURE_HTABLE_SIZE];
-};
-extern struct ima_h_table ima_htable;
+extern atomic_long_t ima_num_entries;
+extern atomic_long_t ima_num_violations;
+extern struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE];
 
 static inline unsigned int ima_hash_key(u8 *digest)
 {
diff --git a/security/integrity/ima/ima_api.c b/security/integrity/ima/ima_api.c
index 0916f24f005f..122d127e108d 100644
--- a/security/integrity/ima/ima_api.c
+++ b/security/integrity/ima/ima_api.c
@@ -146,7 +146,7 @@ void ima_add_violation(struct file *file, const unsigned char *filename,
 	int result;
 
 	/* can overflow, only indicator */
-	atomic_long_inc(&ima_htable.violations);
+	atomic_long_inc(&ima_num_violations);
 
 	result = ima_alloc_init_template(&event_data, &entry, NULL);
 	if (result < 0) {
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index ca4931a95098..aaa460d70ff7 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -38,8 +38,8 @@ __setup("ima_canonical_fmt", default_canonical_fmt_setup);
 
 static int valid_policy = 1;
 
-static ssize_t ima_show_htable_value(char __user *buf, size_t count,
-				     loff_t *ppos, atomic_long_t *val)
+static ssize_t ima_show_counter(char __user *buf, size_t count, loff_t *ppos,
+				atomic_long_t *val)
 {
 	char tmpbuf[32];	/* greater than largest 'long' string value */
 	ssize_t len;
@@ -48,15 +48,14 @@ static ssize_t ima_show_htable_value(char __user *buf, size_t count,
 	return simple_read_from_buffer(buf, count, ppos, tmpbuf, len);
 }
 
-static ssize_t ima_show_htable_violations(struct file *filp,
-					  char __user *buf,
-					  size_t count, loff_t *ppos)
+static ssize_t ima_show_num_violations(struct file *filp, char __user *buf,
+				       size_t count, loff_t *ppos)
 {
-	return ima_show_htable_value(buf, count, ppos, &ima_htable.violations);
+	return ima_show_counter(buf, count, ppos, &ima_num_violations);
 }
 
-static const struct file_operations ima_htable_violations_ops = {
-	.read = ima_show_htable_violations,
+static const struct file_operations ima_num_violations_ops = {
+	.read = ima_show_num_violations,
 	.llseek = generic_file_llseek,
 };
 
@@ -64,7 +63,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
 					   char __user *buf,
 					   size_t count, loff_t *ppos)
 {
-	return ima_show_htable_value(buf, count, ppos, &ima_htable.len);
+	return ima_show_counter(buf, count, ppos, &ima_num_entries);
 
 }
 
@@ -545,7 +544,7 @@ int __init ima_fs_init(void)
 	}
 
 	dentry = securityfs_create_file("violations", S_IRUSR | S_IRGRP,
-				   ima_dir, NULL, &ima_htable_violations_ops);
+				   ima_dir, NULL, &ima_num_violations_ops);
 	if (IS_ERR(dentry)) {
 		ret = PTR_ERR(dentry);
 		goto out;
diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 36a34c54de58..5801649fbbef 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -43,7 +43,7 @@ void ima_measure_kexec_event(const char *event_name)
 	int n;
 
 	buf_size = ima_get_binary_runtime_size();
-	len = atomic_long_read(&ima_htable.len);
+	len = atomic_long_read(&ima_num_entries);
 
 	n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
 		      "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index 319522450854..4837fc6d9ada 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -32,11 +32,14 @@ static unsigned long binary_runtime_size;
 static unsigned long binary_runtime_size = ULONG_MAX;
 #endif
 
+/* num of stored meas. in the list */
+atomic_long_t ima_num_entries = ATOMIC_LONG_INIT(0);
+/* num of violations in the list */
+atomic_long_t ima_num_violations = ATOMIC_LONG_INIT(0);
+
 /* key: inode (before secure-hashing a file) */
-struct ima_h_table ima_htable = {
-	.len = ATOMIC_LONG_INIT(0),
-	.violations = ATOMIC_LONG_INIT(0),
-	.queue[0 ... IMA_MEASURE_HTABLE_SIZE - 1] = HLIST_HEAD_INIT
+struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE] = {
+	[0 ... IMA_MEASURE_HTABLE_SIZE - 1] = HLIST_HEAD_INIT
 };
 
 /* mutex protects atomicity of extending measurement list
@@ -61,7 +64,7 @@ static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value,
 
 	key = ima_hash_key(digest_value);
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(qe, &ima_htable.queue[key], hnext) {
+	hlist_for_each_entry_rcu(qe, &ima_htable[key], hnext) {
 		rc = memcmp(qe->entry->digests[ima_hash_algo_idx].digest,
 			    digest_value, hash_digest_size[ima_hash_algo]);
 		if ((rc == 0) && (qe->entry->pcr == pcr)) {
@@ -113,10 +116,10 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 	INIT_LIST_HEAD(&qe->later);
 	list_add_tail_rcu(&qe->later, &ima_measurements);
 
-	atomic_long_inc(&ima_htable.len);
+	atomic_long_inc(&ima_num_entries);
 	if (update_htable) {
 		key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
-		hlist_add_head_rcu(&qe->hnext, &ima_htable.queue[key]);
+		hlist_add_head_rcu(&qe->hnext, &ima_htable[key]);
 	}
 
 	if (binary_runtime_size != ULONG_MAX) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 2/3] ima: Replace static htable queue with dynamically allocated array
From: Roberto Sassu @ 2026-03-11 17:19 UTC (permalink / raw)
  To: corbet, skhan, zohar, dmitry.kasatkin, eric.snowberg, paul,
	jmorris, serge
  Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
	gregorylumen, chenste, nramas, Roberto Sassu
In-Reply-To: <20260311171956.2317781-1-roberto.sassu@huaweicloud.com>

From: Roberto Sassu <roberto.sassu@huawei.com>

The IMA hash table is a fixed-size array of hlist_head buckets:

    struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE];

IMA_MEASURE_HTABLE_SIZE is (1 << IMA_HASH_BITS) = 1024 buckets, each a
struct hlist_head (one pointer, 8 bytes on 64-bit). That is 8 KiB allocated
in BSS for every kernel, regardless of whether IMA is ever used, and
regardless of how many measurements are actually made.

Replace the fixed-size array with a RCU-protected pointer to a dynamically
allocated array that is initialized in ima_init_htable(), which is called
from ima_init() during early boot. ima_init_htable() calls the static
function ima_alloc_replace_htable() which, other than initializing the hash
table the first time, can also hot-swap the existing hash table with a
blank one.

The allocation in ima_alloc_replace_htable() uses kcalloc() so the buckets
are zero-initialised (equivalent to HLIST_HEAD_INIT { .first = NULL }).
Callers of ima_alloc_replace_htable() must call synchronize_rcu() and free
the returned hash table.

Finally, access the hash table with rcu_dereference() in
ima_lookup_digest_entry() (reader side) and with
rcu_dereference_protected() in ima_add_digest_entry() (writer side).

No functional change: bucket count, hash function, and all locking remain
identical.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
Changelog:
v2:
 - Not present in this version

v1:
 - Not present in this version
---
 security/integrity/ima/ima.h       |  3 +-
 security/integrity/ima/ima_init.c  |  5 +++
 security/integrity/ima/ima_queue.c | 49 +++++++++++++++++++++++++++---
 3 files changed, 51 insertions(+), 6 deletions(-)

diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index 1f2c81ec0fba..ccd037d49de7 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -285,6 +285,7 @@ bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
 int ima_restore_measurement_entry(struct ima_template_entry *entry);
 int ima_restore_measurement_list(loff_t bufsize, void *buf);
 int ima_measurements_show(struct seq_file *m, void *v);
+int __init ima_init_htable(void);
 unsigned long ima_get_binary_runtime_size(void);
 int ima_init_template(void);
 void ima_init_template_list(void);
@@ -300,7 +301,7 @@ extern spinlock_t ima_queue_lock;
 
 extern atomic_long_t ima_num_entries;
 extern atomic_long_t ima_num_violations;
-extern struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE];
+extern struct hlist_head __rcu *ima_htable;
 
 static inline unsigned int ima_hash_key(u8 *digest)
 {
diff --git a/security/integrity/ima/ima_init.c b/security/integrity/ima/ima_init.c
index a2f34f2d8ad7..7e0aa09a12e6 100644
--- a/security/integrity/ima/ima_init.c
+++ b/security/integrity/ima/ima_init.c
@@ -140,6 +140,11 @@ int __init ima_init(void)
 	rc = ima_init_digests();
 	if (rc != 0)
 		return rc;
+
+	rc = ima_init_htable();
+	if (rc != 0)
+		return rc;
+
 	rc = ima_add_boot_aggregate();	/* boot aggregate must be first entry */
 	if (rc != 0)
 		return rc;
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index 4837fc6d9ada..2050b9d21e70 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -38,9 +38,7 @@ atomic_long_t ima_num_entries = ATOMIC_LONG_INIT(0);
 atomic_long_t ima_num_violations = ATOMIC_LONG_INIT(0);
 
 /* key: inode (before secure-hashing a file) */
-struct hlist_head ima_htable[IMA_MEASURE_HTABLE_SIZE] = {
-	[0 ... IMA_MEASURE_HTABLE_SIZE - 1] = HLIST_HEAD_INIT
-};
+struct hlist_head __rcu *ima_htable;
 
 /* mutex protects atomicity of extending measurement list
  * and extending the TPM PCR aggregate. Since tpm_extend can take
@@ -54,17 +52,54 @@ static DEFINE_MUTEX(ima_extend_list_mutex);
  */
 static bool ima_measurements_suspended;
 
+/* Callers must call synchronize_rcu() and free the hash table. */
+static struct hlist_head *ima_alloc_replace_htable(void)
+{
+	struct hlist_head *old_htable, *new_htable;
+
+	/* Initializing to zeros is equivalent to call HLIST_HEAD_INIT. */
+	new_htable = kcalloc(IMA_MEASURE_HTABLE_SIZE, sizeof(struct hlist_head),
+			     GFP_KERNEL);
+	if (!new_htable)
+		return ERR_PTR(-ENOMEM);
+
+	old_htable = rcu_replace_pointer(ima_htable, new_htable,
+				lockdep_is_held(&ima_extend_list_mutex));
+
+	return old_htable;
+}
+
+int __init ima_init_htable(void)
+{
+	struct hlist_head *old_htable;
+
+	mutex_lock(&ima_extend_list_mutex);
+	old_htable = ima_alloc_replace_htable();
+	mutex_unlock(&ima_extend_list_mutex);
+
+	/* Synchronize_rcu() and kfree() not necessary, only for robustness. */
+	synchronize_rcu();
+
+	if (IS_ERR(old_htable))
+		return PTR_ERR(old_htable);
+
+	kfree(old_htable);
+	return 0;
+}
+
 /* lookup up the digest value in the hash table, and return the entry */
 static struct ima_queue_entry *ima_lookup_digest_entry(u8 *digest_value,
 						       int pcr)
 {
 	struct ima_queue_entry *qe, *ret = NULL;
+	struct hlist_head *htable;
 	unsigned int key;
 	int rc;
 
 	key = ima_hash_key(digest_value);
 	rcu_read_lock();
-	hlist_for_each_entry_rcu(qe, &ima_htable[key], hnext) {
+	htable = rcu_dereference(ima_htable);
+	hlist_for_each_entry_rcu(qe, &htable[key], hnext) {
 		rc = memcmp(qe->entry->digests[ima_hash_algo_idx].digest,
 			    digest_value, hash_digest_size[ima_hash_algo]);
 		if ((rc == 0) && (qe->entry->pcr == pcr)) {
@@ -104,6 +139,7 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 				bool update_htable)
 {
 	struct ima_queue_entry *qe;
+	struct hlist_head *htable;
 	unsigned int key;
 
 	qe = kmalloc_obj(*qe);
@@ -116,10 +152,13 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 	INIT_LIST_HEAD(&qe->later);
 	list_add_tail_rcu(&qe->later, &ima_measurements);
 
+	htable = rcu_dereference_protected(ima_htable,
+				lockdep_is_held(&ima_extend_list_mutex));
+
 	atomic_long_inc(&ima_num_entries);
 	if (update_htable) {
 		key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
-		hlist_add_head_rcu(&qe->hnext, &ima_htable[key]);
+		hlist_add_head_rcu(&qe->hnext, &htable[key]);
 	}
 
 	if (binary_runtime_size != ULONG_MAX) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH v3 3/3] ima: Add support for staging measurements for deletion
From: Roberto Sassu @ 2026-03-11 17:19 UTC (permalink / raw)
  To: corbet, skhan, zohar, dmitry.kasatkin, eric.snowberg, paul,
	jmorris, serge
  Cc: linux-doc, linux-kernel, linux-integrity, linux-security-module,
	gregorylumen, chenste, nramas, Roberto Sassu
In-Reply-To: <20260311171956.2317781-1-roberto.sassu@huaweicloud.com>

From: Roberto Sassu <roberto.sassu@huawei.com>

Introduce the ability of staging the IMA measurement list for deletion.
Staging means moving the current content of the measurement list to a
separate location, and allowing users to read and delete it. This causes
the measurement list to be atomically truncated before new measurements can
be added. Staging can be done only once at a time. In the event of kexec(),
staging is reverted and staged entries will be carried over to the new
kernel.

Staged measurements can be deleted entirely, or partially, with the
non-deleted ones added back to the IMA measurements list. This allows the
remote attestation agents to easily separate the measurements that where
verified (staged and deleted) from those that weren't due to the race
between taking a TPM quote and reading the measurements list.

User space is responsible to concatenate the staged IMA measurements list
portions (excluding the measurements added back to the IMA measurements
list) following the temporal order in which the operations were done,
together with the current measurement list. Then, it can send the collected
data to the remote verifiers.

The benefit of staging and deleting is the ability to free precious kernel
memory, in exchange of delegating user space to reconstruct the full
measurement list from the chunks. No trust needs to be given to user space,
since the integrity of the measurement list is protected by the TPM.

By default, staging the measurements list does not alter the hash table.
When staging and deleting are done, IMA is still able to detect collisions
on the staged and later deleted measurement entries, by keeping the entry
digests (only template data are freed).

However, since during the measurements list serialization only the SHA1
digest is passed, and since there are no template data to recalculate the
other digests from, the hash table is currently not populated with digests
from staged/deleted entries after kexec().

Introduce the new kernel option ima_flush_htable to decide whether or not
the digests of staged measurement entries are flushed from the hash table,
when they are deleted. Flushing the hash table is supported only when
deleting all the staged measurements, since in that case the old hash table
can be quickly swapped with a blank one (otherwise entries would have to be
removed one by one for partial deletion).

Then, introduce ascii_runtime_measurements_<algo>_staged and
binary_runtime_measurements_<algo>_staged interfaces to stage and delete
the measurements. Use 'echo A > <IMA interface>' and
'echo D > <IMA interface>' to respectively stage and delete the entire
measurements list. Use 'echo N > <IMA interface>', with N between 1 and
ULONG_MAX - 1, to delete the selected staged portion of the measurements
list.

The ima_measure_users counter (protected by the ima_measure_mutex mutex)
has been introduced to protect access to the measurements list and the
staged part. The open method of all the measurement interfaces has been
extended to allow only one writer at a time or, in alternative, multiple
readers. The write permission is used to stage and delete the measurements,
the read permission to read them. Write requires also the CAP_SYS_ADMIN
capability.

Finally, introduce the binary_lists enum and make binary_runtime_size
and ima_num_entries as arrays, to keep track of their values for the
current IMA measurements list (BINARY), current list plus staged
measurements (BINARY_STAGED) and the cumulative list since IMA
initialization (BINARY_FULL).

Use BINARY in ima_show_measurements_count(), BINARY_STAGED in
ima_add_kexec_buffer() and BINARY_FULL in ima_measure_kexec_event().

It should be noted that the BINARY_FULL counter is not passed through
kexec. Thus, the number of entries included in the kexec critical data
records refers to the entries since the previous kexec records.

Note: This code derives from the Alt-IMA Huawei project, whose license is
      GPL-2.0 OR MIT.

Link: https://github.com/linux-integrity/linux/issues/1
Signed-off-by: Roberto Sassu <roberto.sassu@huawei.com>
---
Changelog
v2:
 - Forbid partial deletion when flushing hash table (suggested by Mimi)
 - Ignore ima_flush_htable if CONFIG_IMA_DISABLE_HTABLE is enabled
 - BINARY_SIZE_* renamed to BINARY_* for better clarity
 - Removed ima_measurements_staged_exist and testing list empty instead
 - ima_queue_stage_trim() and ima_queue_delete_staged_trimmed() renamed to
   ima_queue_stage() and ima_queue_delete_staged()
 - New delete interval [1, ULONG_MAX - 1]
 - Rename ima_measure_lock to ima_measure_mutex
 - Move seq_open() and seq_release() outside the ima_measure_mutex lock
 - Drop ima_measurements_staged_read() and use seq_read() instead
 - Optimize create_securityfs_measurement_lists() changes
 - New file name format with _staged suffix at the end of the file name
 - Use _rcu list variant in ima_dump_measurement_list()
 - Remove support for direct trimming and splice the remaining entries to
   the active list (suggested by Mimi)
 - Hot swap the hash table if flushing is requested

v1:
 - Support for direct trimming without staging
 - Support unstaging on kexec (requested by Gregory Lumen)
---
 .../admin-guide/kernel-parameters.txt         |   4 +
 security/integrity/ima/ima.h                  |  17 +-
 security/integrity/ima/ima_fs.c               | 266 ++++++++++++++++--
 security/integrity/ima/ima_kexec.c            |  43 ++-
 security/integrity/ima/ima_queue.c            | 205 +++++++++++++-
 5 files changed, 484 insertions(+), 51 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index cb850e5290c2..7a377812aa0a 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2345,6 +2345,10 @@ Kernel parameters
 			Use the canonical format for the binary runtime
 			measurements, instead of host native format.
 
+	ima_flush_htable  [IMA]
+			Flush the IMA hash table when deleting all the
+			staged measurement entries.
+
 	ima_hash=	[IMA]
 			Format: { md5 | sha1 | rmd160 | sha256 | sha384
 				   | sha512 | ... }
diff --git a/security/integrity/ima/ima.h b/security/integrity/ima/ima.h
index ccd037d49de7..e8aaf1e62139 100644
--- a/security/integrity/ima/ima.h
+++ b/security/integrity/ima/ima.h
@@ -28,6 +28,15 @@ enum ima_show_type { IMA_SHOW_BINARY, IMA_SHOW_BINARY_NO_FIELD_LEN,
 		     IMA_SHOW_BINARY_OLD_STRING_FMT, IMA_SHOW_ASCII };
 enum tpm_pcrs { TPM_PCR0 = 0, TPM_PCR8 = 8, TPM_PCR10 = 10 };
 
+/*
+ * BINARY: current binary measurements list
+ * BINARY_STAGED: current binary measurements list + staged entries
+ * BINARY_FULL: binary measurements list since IMA init (lost after kexec)
+ */
+enum binary_lists {
+	BINARY, BINARY_STAGED, BINARY_FULL, BINARY__LAST
+};
+
 /* digest size for IMA, fits SHA1 or MD5 */
 #define IMA_DIGEST_SIZE		SHA1_DIGEST_SIZE
 #define IMA_EVENT_NAME_LEN_MAX	255
@@ -118,6 +127,7 @@ struct ima_queue_entry {
 	struct ima_template_entry *entry;
 };
 extern struct list_head ima_measurements;	/* list of all measurements */
+extern struct list_head ima_measurements_staged; /* list of staged meas. */
 
 /* Some details preceding the binary serialized measurement list */
 struct ima_kexec_hdr {
@@ -282,11 +292,13 @@ struct ima_template_desc *ima_template_desc_current(void);
 struct ima_template_desc *ima_template_desc_buf(void);
 struct ima_template_desc *lookup_template_desc(const char *name);
 bool ima_template_has_modsig(const struct ima_template_desc *ima_template);
+int ima_queue_stage(void);
+int ima_queue_delete_staged(unsigned long req_value);
 int ima_restore_measurement_entry(struct ima_template_entry *entry);
 int ima_restore_measurement_list(loff_t bufsize, void *buf);
 int ima_measurements_show(struct seq_file *m, void *v);
 int __init ima_init_htable(void);
-unsigned long ima_get_binary_runtime_size(void);
+unsigned long ima_get_binary_runtime_size(enum binary_lists binary_list);
 int ima_init_template(void);
 void ima_init_template_list(void);
 int __init ima_init_digests(void);
@@ -299,9 +311,10 @@ int ima_lsm_policy_change(struct notifier_block *nb, unsigned long event,
  */
 extern spinlock_t ima_queue_lock;
 
-extern atomic_long_t ima_num_entries;
+extern atomic_long_t ima_num_entries[BINARY__LAST];
 extern atomic_long_t ima_num_violations;
 extern struct hlist_head __rcu *ima_htable;
+extern struct mutex ima_extend_list_mutex;
 
 static inline unsigned int ima_hash_key(u8 *digest)
 {
diff --git a/security/integrity/ima/ima_fs.c b/security/integrity/ima/ima_fs.c
index aaa460d70ff7..cf85b0892275 100644
--- a/security/integrity/ima/ima_fs.c
+++ b/security/integrity/ima/ima_fs.c
@@ -24,7 +24,17 @@
 
 #include "ima.h"
 
+/*
+ * Requests:
+ * 'A\n': stage the entire measurements list
+ * 'D\n': delete all staged measurements
+ * '[1, ULONG_MAX - 1]\n' delete N measurements entries and unstage the rest
+ */
+#define STAGED_REQ_LENGTH 21
+
 static DEFINE_MUTEX(ima_write_mutex);
+static DEFINE_MUTEX(ima_measure_mutex);
+static long ima_measure_users;
 
 bool ima_canonical_fmt;
 static int __init default_canonical_fmt_setup(char *str)
@@ -63,7 +73,7 @@ static ssize_t ima_show_measurements_count(struct file *filp,
 					   char __user *buf,
 					   size_t count, loff_t *ppos)
 {
-	return ima_show_counter(buf, count, ppos, &ima_num_entries);
+	return ima_show_counter(buf, count, ppos, &ima_num_entries[BINARY]);
 
 }
 
@@ -73,14 +83,15 @@ static const struct file_operations ima_measurements_count_ops = {
 };
 
 /* returns pointer to hlist_node */
-static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
+static void *_ima_measurements_start(struct seq_file *m, loff_t *pos,
+				     struct list_head *head)
 {
 	loff_t l = *pos;
 	struct ima_queue_entry *qe;
 
 	/* we need a lock since pos could point beyond last element */
 	rcu_read_lock();
-	list_for_each_entry_rcu(qe, &ima_measurements, later) {
+	list_for_each_entry_rcu(qe, head, later) {
 		if (!l--) {
 			rcu_read_unlock();
 			return qe;
@@ -90,7 +101,18 @@ static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
 	return NULL;
 }
 
-static void *ima_measurements_next(struct seq_file *m, void *v, loff_t *pos)
+static void *ima_measurements_start(struct seq_file *m, loff_t *pos)
+{
+	return _ima_measurements_start(m, pos, &ima_measurements);
+}
+
+static void *ima_measurements_staged_start(struct seq_file *m, loff_t *pos)
+{
+	return _ima_measurements_start(m, pos, &ima_measurements_staged);
+}
+
+static void *_ima_measurements_next(struct seq_file *m, void *v, loff_t *pos,
+				    struct list_head *head)
 {
 	struct ima_queue_entry *qe = v;
 
@@ -102,7 +124,18 @@ static void *ima_measurements_next(struct seq_file *m, void *v, loff_t *pos)
 	rcu_read_unlock();
 	(*pos)++;
 
-	return (&qe->later == &ima_measurements) ? NULL : qe;
+	return (&qe->later == head) ? NULL : qe;
+}
+
+static void *ima_measurements_next(struct seq_file *m, void *v, loff_t *pos)
+{
+	return _ima_measurements_next(m, v, pos, &ima_measurements);
+}
+
+static void *ima_measurements_staged_next(struct seq_file *m, void *v,
+					  loff_t *pos)
+{
+	return _ima_measurements_next(m, v, pos, &ima_measurements_staged);
 }
 
 static void ima_measurements_stop(struct seq_file *m, void *v)
@@ -198,16 +231,145 @@ static const struct seq_operations ima_measurments_seqops = {
 	.show = ima_measurements_show
 };
 
+static int ima_measure_lock(bool write)
+{
+	mutex_lock(&ima_measure_mutex);
+	if ((write && ima_measure_users != 0) ||
+	    (!write && ima_measure_users < 0)) {
+		mutex_unlock(&ima_measure_mutex);
+		return -EBUSY;
+	}
+
+	if (write)
+		ima_measure_users--;
+	else
+		ima_measure_users++;
+	mutex_unlock(&ima_measure_mutex);
+	return 0;
+}
+
+static void ima_measure_unlock(bool write)
+{
+	mutex_lock(&ima_measure_mutex);
+	if (write)
+		ima_measure_users++;
+	else
+		ima_measure_users--;
+	mutex_unlock(&ima_measure_mutex);
+}
+
+static int _ima_measurements_open(struct inode *inode, struct file *file,
+				  const struct seq_operations *seq_ops)
+{
+	bool write = (file->f_mode & FMODE_WRITE);
+	int ret;
+
+	if (write && !capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	ret = ima_measure_lock(write);
+	if (ret < 0)
+		return ret;
+
+	ret = seq_open(file, seq_ops);
+	if (ret < 0)
+		ima_measure_unlock(write);
+
+	return ret;
+}
+
 static int ima_measurements_open(struct inode *inode, struct file *file)
 {
-	return seq_open(file, &ima_measurments_seqops);
+	return _ima_measurements_open(inode, file, &ima_measurments_seqops);
+}
+
+static int ima_measurements_release(struct inode *inode, struct file *file)
+{
+	bool write = (file->f_mode & FMODE_WRITE);
+	int ret;
+
+	ret = seq_release(inode, file);
+
+	ima_measure_unlock(write);
+
+	return ret;
 }
 
 static const struct file_operations ima_measurements_ops = {
 	.open = ima_measurements_open,
 	.read = seq_read,
 	.llseek = seq_lseek,
-	.release = seq_release,
+	.release = ima_measurements_release,
+};
+
+static const struct seq_operations ima_measurments_staged_seqops = {
+	.start = ima_measurements_staged_start,
+	.next = ima_measurements_staged_next,
+	.stop = ima_measurements_stop,
+	.show = ima_measurements_show
+};
+
+static int ima_measurements_staged_open(struct inode *inode, struct file *file)
+{
+	return _ima_measurements_open(inode, file,
+				      &ima_measurments_staged_seqops);
+}
+
+static ssize_t ima_measurements_staged_write(struct file *file,
+					     const char __user *buf,
+					     size_t datalen, loff_t *ppos)
+{
+	char req[STAGED_REQ_LENGTH];
+	unsigned long req_value;
+	int ret;
+
+	if (*ppos > 0 || datalen < 2 || datalen > STAGED_REQ_LENGTH)
+		return -EINVAL;
+
+	if (copy_from_user(req, buf, datalen) != 0)
+		return -EFAULT;
+
+	if (req[datalen - 1] != '\n')
+		return -EINVAL;
+
+	req[datalen - 1] = '\0';
+
+	switch (req[0]) {
+	case 'A':
+		if (datalen != 2)
+			return -EINVAL;
+
+		ret = ima_queue_stage();
+		break;
+	case 'D':
+		if (datalen != 2)
+			return -EINVAL;
+
+		ret = ima_queue_delete_staged(ULONG_MAX);
+		break;
+	default:
+		ret = kstrtoul(req, 10, &req_value);
+		if (ret < 0)
+			return ret;
+
+		if (req_value == ULONG_MAX)
+			return -ERANGE;
+
+		ret = ima_queue_delete_staged(req_value);
+	}
+
+	if (ret < 0)
+		return ret;
+
+	return datalen;
+}
+
+static const struct file_operations ima_measurements_staged_ops = {
+	.open = ima_measurements_staged_open,
+	.read = seq_read,
+	.write = ima_measurements_staged_write,
+	.llseek = seq_lseek,
+	.release = ima_measurements_release,
 };
 
 void ima_print_digest(struct seq_file *m, u8 *digest, u32 size)
@@ -272,14 +434,37 @@ static const struct seq_operations ima_ascii_measurements_seqops = {
 
 static int ima_ascii_measurements_open(struct inode *inode, struct file *file)
 {
-	return seq_open(file, &ima_ascii_measurements_seqops);
+	return _ima_measurements_open(inode, file,
+				      &ima_ascii_measurements_seqops);
 }
 
 static const struct file_operations ima_ascii_measurements_ops = {
 	.open = ima_ascii_measurements_open,
 	.read = seq_read,
 	.llseek = seq_lseek,
-	.release = seq_release,
+	.release = ima_measurements_release,
+};
+
+static const struct seq_operations ima_ascii_measurements_staged_seqops = {
+	.start = ima_measurements_staged_start,
+	.next = ima_measurements_staged_next,
+	.stop = ima_measurements_stop,
+	.show = ima_ascii_measurements_show
+};
+
+static int ima_ascii_measurements_staged_open(struct inode *inode,
+					      struct file *file)
+{
+	return _ima_measurements_open(inode, file,
+				      &ima_ascii_measurements_staged_seqops);
+}
+
+static const struct file_operations ima_ascii_measurements_staged_ops = {
+	.open = ima_ascii_measurements_staged_open,
+	.read = seq_read,
+	.write = ima_measurements_staged_write,
+	.llseek = seq_lseek,
+	.release = ima_measurements_release,
 };
 
 static ssize_t ima_read_policy(char *path)
@@ -385,10 +570,21 @@ static const struct seq_operations ima_policy_seqops = {
 };
 #endif
 
-static int __init create_securityfs_measurement_lists(void)
+static int __init create_securityfs_measurement_lists(bool staging)
 {
+	const struct file_operations *ascii_ops = &ima_ascii_measurements_ops;
+	const struct file_operations *binary_ops = &ima_measurements_ops;
+	mode_t permissions = S_IRUSR | S_IRGRP;
+	const char *file_suffix = "";
 	int count = NR_BANKS(ima_tpm_chip);
 
+	if (staging) {
+		ascii_ops = &ima_ascii_measurements_staged_ops;
+		binary_ops = &ima_measurements_staged_ops;
+		file_suffix = "_staged";
+		permissions |= S_IWUSR | S_IWGRP;
+	}
+
 	if (ima_sha1_idx >= NR_BANKS(ima_tpm_chip))
 		count++;
 
@@ -398,26 +594,33 @@ static int __init create_securityfs_measurement_lists(void)
 		struct dentry *dentry;
 
 		if (algo == HASH_ALGO__LAST)
-			sprintf(file_name, "ascii_runtime_measurements_tpm_alg_%x",
-				ima_tpm_chip->allocated_banks[i].alg_id);
+			snprintf(file_name, sizeof(file_name),
+				 "ascii_runtime_measurements_tpm_alg_%x%s",
+				 ima_tpm_chip->allocated_banks[i].alg_id,
+				 file_suffix);
 		else
-			sprintf(file_name, "ascii_runtime_measurements_%s",
-				hash_algo_name[algo]);
-		dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
+			snprintf(file_name, sizeof(file_name),
+				 "ascii_runtime_measurements_%s%s",
+				 hash_algo_name[algo], file_suffix);
+		dentry = securityfs_create_file(file_name, permissions,
 						ima_dir, (void *)(uintptr_t)i,
-						&ima_ascii_measurements_ops);
+						ascii_ops);
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 
 		if (algo == HASH_ALGO__LAST)
-			sprintf(file_name, "binary_runtime_measurements_tpm_alg_%x",
-				ima_tpm_chip->allocated_banks[i].alg_id);
+			snprintf(file_name, sizeof(file_name),
+				 "binary_runtime_measurements_tpm_alg_%x%s",
+				 ima_tpm_chip->allocated_banks[i].alg_id,
+				 file_suffix);
 		else
-			sprintf(file_name, "binary_runtime_measurements_%s",
-				hash_algo_name[algo]);
-		dentry = securityfs_create_file(file_name, S_IRUSR | S_IRGRP,
+			snprintf(file_name, sizeof(file_name),
+				 "binary_runtime_measurements_%s%s",
+				 hash_algo_name[algo], file_suffix);
+
+		dentry = securityfs_create_file(file_name, permissions,
 						ima_dir, (void *)(uintptr_t)i,
-						&ima_measurements_ops);
+						binary_ops);
 		if (IS_ERR(dentry))
 			return PTR_ERR(dentry);
 	}
@@ -517,7 +720,10 @@ int __init ima_fs_init(void)
 		goto out;
 	}
 
-	ret = create_securityfs_measurement_lists();
+	ret = create_securityfs_measurement_lists(false);
+	if (ret == 0)
+		ret = create_securityfs_measurement_lists(true);
+
 	if (ret != 0)
 		goto out;
 
@@ -535,6 +741,20 @@ int __init ima_fs_init(void)
 		goto out;
 	}
 
+	dentry = securityfs_create_symlink("binary_runtime_measurements_staged",
+		ima_dir, "binary_runtime_measurements_sha1_staged", NULL);
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		goto out;
+	}
+
+	dentry = securityfs_create_symlink("ascii_runtime_measurements_staged",
+		ima_dir, "ascii_runtime_measurements_sha1_staged", NULL);
+	if (IS_ERR(dentry)) {
+		ret = PTR_ERR(dentry);
+		goto out;
+	}
+
 	dentry = securityfs_create_file("runtime_measurements_count",
 				   S_IRUSR | S_IRGRP, ima_dir, NULL,
 				   &ima_measurements_count_ops);
diff --git a/security/integrity/ima/ima_kexec.c b/security/integrity/ima/ima_kexec.c
index 5801649fbbef..70ee3a039df2 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -42,8 +42,8 @@ void ima_measure_kexec_event(const char *event_name)
 	long len;
 	int n;
 
-	buf_size = ima_get_binary_runtime_size();
-	len = atomic_long_read(&ima_num_entries);
+	buf_size = ima_get_binary_runtime_size(BINARY_FULL);
+	len = atomic_long_read(&ima_num_entries[BINARY_FULL]);
 
 	n = scnprintf(ima_kexec_event, IMA_KEXEC_EVENT_LEN,
 		      "kexec_segment_size=%lu;ima_binary_runtime_size=%lu;"
@@ -80,6 +80,17 @@ static int ima_alloc_kexec_file_buf(size_t segment_size)
 	return 0;
 }
 
+static int ima_dump_measurement(struct ima_kexec_hdr *khdr,
+				struct ima_queue_entry *qe)
+{
+	if (ima_kexec_file.count >= ima_kexec_file.size)
+		return -EINVAL;
+
+	khdr->count++;
+	ima_measurements_show(&ima_kexec_file, qe);
+	return 0;
+}
+
 static int ima_dump_measurement_list(unsigned long *buffer_size, void **buffer,
 				     unsigned long segment_size)
 {
@@ -95,17 +106,26 @@ static int ima_dump_measurement_list(unsigned long *buffer_size, void **buffer,
 
 	memset(&khdr, 0, sizeof(khdr));
 	khdr.version = 1;
-	/* This is an append-only list, no need to hold the RCU read lock */
-	list_for_each_entry_rcu(qe, &ima_measurements, later, true) {
-		if (ima_kexec_file.count < ima_kexec_file.size) {
-			khdr.count++;
-			ima_measurements_show(&ima_kexec_file, qe);
-		} else {
-			ret = -EINVAL;
+	/* It can race with ima_queue_stage() and ima_queue_delete_staged(). */
+	mutex_lock(&ima_extend_list_mutex);
+
+	list_for_each_entry_rcu(qe, &ima_measurements_staged, later,
+				lockdep_is_held(&ima_extend_list_mutex)) {
+		ret = ima_dump_measurement(&khdr, qe);
+		if (ret < 0)
 			break;
-		}
 	}
 
+	list_for_each_entry_rcu(qe, &ima_measurements, later,
+				lockdep_is_held(&ima_extend_list_mutex)) {
+		if (!ret)
+			ret = ima_dump_measurement(&khdr, qe);
+		if (ret < 0)
+			break;
+	}
+
+	mutex_unlock(&ima_extend_list_mutex);
+
 	/*
 	 * fill in reserved space with some buffer details
 	 * (eg. version, buffer size, number of measurements)
@@ -159,7 +179,8 @@ void ima_add_kexec_buffer(struct kimage *image)
 	else
 		extra_memory = CONFIG_IMA_KEXEC_EXTRA_MEMORY_KB * 1024;
 
-	binary_runtime_size = ima_get_binary_runtime_size() + extra_memory;
+	binary_runtime_size = ima_get_binary_runtime_size(BINARY_STAGED) +
+			      extra_memory;
 
 	if (binary_runtime_size >= ULONG_MAX - PAGE_SIZE)
 		kexec_segment_size = ULONG_MAX;
diff --git a/security/integrity/ima/ima_queue.c b/security/integrity/ima/ima_queue.c
index 2050b9d21e70..08cd60fa959e 100644
--- a/security/integrity/ima/ima_queue.c
+++ b/security/integrity/ima/ima_queue.c
@@ -22,29 +22,48 @@
 
 #define AUDIT_CAUSE_LEN_MAX 32
 
+bool ima_flush_htable;
+static int __init ima_flush_htable_setup(char *str)
+{
+	if (IS_ENABLED(CONFIG_IMA_DISABLE_HTABLE)) {
+		pr_warn("Hash table not enabled, ignoring request to flush\n");
+		return 1;
+	}
+
+	ima_flush_htable = true;
+	return 1;
+}
+__setup("ima_flush_htable", ima_flush_htable_setup);
+
 /* pre-allocated array of tpm_digest structures to extend a PCR */
 static struct tpm_digest *digests;
 
 LIST_HEAD(ima_measurements);	/* list of all measurements */
+LIST_HEAD(ima_measurements_staged); /* list of staged measurements */
 #ifdef CONFIG_IMA_KEXEC
-static unsigned long binary_runtime_size;
+static unsigned long binary_runtime_size[BINARY__LAST];
 #else
-static unsigned long binary_runtime_size = ULONG_MAX;
+static unsigned long binary_runtime_size[BINARY__LAST] = {
+	[0 ... BINARY__LAST - 1] = ULONG_MAX
+};
 #endif
 
 /* num of stored meas. in the list */
-atomic_long_t ima_num_entries = ATOMIC_LONG_INIT(0);
+atomic_long_t ima_num_entries[BINARY__LAST] = {
+	[0 ... BINARY__LAST - 1] = ATOMIC_LONG_INIT(0)
+};
+
 /* num of violations in the list */
 atomic_long_t ima_num_violations = ATOMIC_LONG_INIT(0);
 
 /* key: inode (before secure-hashing a file) */
 struct hlist_head __rcu *ima_htable;
 
-/* mutex protects atomicity of extending measurement list
+/* mutex protects atomicity of extending and staging measurement list
  * and extending the TPM PCR aggregate. Since tpm_extend can take
  * long (and the tpm driver uses a mutex), we can't use the spinlock.
  */
-static DEFINE_MUTEX(ima_extend_list_mutex);
+DEFINE_MUTEX(ima_extend_list_mutex);
 
 /*
  * Used internally by the kernel to suspend measurements.
@@ -140,7 +159,7 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 {
 	struct ima_queue_entry *qe;
 	struct hlist_head *htable;
-	unsigned int key;
+	unsigned int key, i;
 
 	qe = kmalloc_obj(*qe);
 	if (qe == NULL) {
@@ -155,19 +174,25 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
 	htable = rcu_dereference_protected(ima_htable,
 				lockdep_is_held(&ima_extend_list_mutex));
 
-	atomic_long_inc(&ima_num_entries);
+	for (i = 0; i < BINARY__LAST; i++)
+		atomic_long_inc(&ima_num_entries[i]);
+
 	if (update_htable) {
 		key = ima_hash_key(entry->digests[ima_hash_algo_idx].digest);
 		hlist_add_head_rcu(&qe->hnext, &htable[key]);
 	}
 
-	if (binary_runtime_size != ULONG_MAX) {
-		int size;
+	for (i = 0; i < BINARY__LAST; i++) {
+		if (binary_runtime_size[i] != ULONG_MAX) {
+			int size;
 
-		size = get_binary_runtime_size(entry);
-		binary_runtime_size = (binary_runtime_size < ULONG_MAX - size) ?
-		     binary_runtime_size + size : ULONG_MAX;
+			size = get_binary_runtime_size(entry);
+			binary_runtime_size[i] =
+				(binary_runtime_size[i] < ULONG_MAX - size) ?
+				binary_runtime_size[i] + size : ULONG_MAX;
+		}
 	}
+
 	return 0;
 }
 
@@ -176,12 +201,18 @@ static int ima_add_digest_entry(struct ima_template_entry *entry,
  * entire binary_runtime_measurement list, including the ima_kexec_hdr
  * structure.
  */
-unsigned long ima_get_binary_runtime_size(void)
+unsigned long ima_get_binary_runtime_size(enum binary_lists binary_list)
 {
-	if (binary_runtime_size >= (ULONG_MAX - sizeof(struct ima_kexec_hdr)))
+	unsigned long val;
+
+	mutex_lock(&ima_extend_list_mutex);
+	val = binary_runtime_size[binary_list];
+	mutex_unlock(&ima_extend_list_mutex);
+
+	if (val >= (ULONG_MAX - sizeof(struct ima_kexec_hdr)))
 		return ULONG_MAX;
 	else
-		return binary_runtime_size + sizeof(struct ima_kexec_hdr);
+		return val + sizeof(struct ima_kexec_hdr);
 }
 
 static int ima_pcr_extend(struct tpm_digest *digests_arg, int pcr)
@@ -262,6 +293,150 @@ int ima_add_template_entry(struct ima_template_entry *entry, int violation,
 	return result;
 }
 
+int ima_queue_stage(void)
+{
+	int ret = 0;
+
+	mutex_lock(&ima_extend_list_mutex);
+	if (!list_empty(&ima_measurements_staged)) {
+		ret = -EEXIST;
+		goto out_unlock;
+	}
+
+	if (list_empty(&ima_measurements)) {
+		ret = -ENOENT;
+		goto out_unlock;
+	}
+
+	list_replace(&ima_measurements, &ima_measurements_staged);
+	INIT_LIST_HEAD(&ima_measurements);
+	atomic_long_set(&ima_num_entries[BINARY], 0);
+	if (IS_ENABLED(CONFIG_IMA_KEXEC))
+		binary_runtime_size[BINARY] = 0;
+out_unlock:
+	mutex_unlock(&ima_extend_list_mutex);
+	return ret;
+}
+
+int ima_queue_delete_staged(unsigned long req_value)
+{
+	unsigned long req_value_copy = req_value;
+	unsigned long size_to_remove = 0, num_to_remove = 0;
+	struct ima_queue_entry *qe, *qe_tmp;
+	struct list_head *cut_pos = NULL;
+	LIST_HEAD(ima_measurements_trim);
+	struct hlist_head *old_queue = NULL;
+	unsigned int i;
+
+	if (req_value == 0) {
+		pr_err("Must delete at least one entry\n");
+		return -EINVAL;
+	}
+
+	if (req_value < ULONG_MAX && ima_flush_htable) {
+		pr_err("Deleting staged N measurements not supported when flushing the hash table is requested\n");
+		return -EINVAL;
+	}
+
+	/*
+	 * Safe walk (no concurrent write), not under ima_extend_list_mutex
+	 * for performance reasons.
+	 */
+	list_for_each_entry(qe, &ima_measurements_staged, later) {
+		size_to_remove += get_binary_runtime_size(qe->entry);
+		num_to_remove++;
+
+		if (req_value < ULONG_MAX && --req_value_copy == 0) {
+			/* qe->later always points to a valid list entry. */
+			cut_pos = &qe->later;
+			break;
+		}
+	}
+
+	if (req_value < ULONG_MAX && req_value_copy > 0)
+		return -ENOENT;
+
+	mutex_lock(&ima_extend_list_mutex);
+	if (list_empty(&ima_measurements_staged)) {
+		mutex_unlock(&ima_extend_list_mutex);
+		return -ENOENT;
+	}
+
+	if (req_value < ULONG_MAX) {
+		/*
+		 * ima_dump_measurement_list() does not modify the list,
+		 * cut_pos remains the same even if it was computed before
+		 * the lock.
+		 */
+		__list_cut_position(&ima_measurements_trim,
+				    &ima_measurements_staged, cut_pos);
+	} else {
+		list_replace(&ima_measurements_staged, &ima_measurements_trim);
+		INIT_LIST_HEAD(&ima_measurements_staged);
+	}
+
+	atomic_long_sub(num_to_remove, &ima_num_entries[BINARY_STAGED]);
+	atomic_long_add(atomic_long_read(&ima_num_entries[BINARY_STAGED]),
+			&ima_num_entries[BINARY]);
+	atomic_long_set(&ima_num_entries[BINARY_STAGED],
+			atomic_long_read(&ima_num_entries[BINARY]));
+
+	if (IS_ENABLED(CONFIG_IMA_KEXEC)) {
+		binary_runtime_size[BINARY_STAGED] -= size_to_remove;
+		binary_runtime_size[BINARY] +=
+					binary_runtime_size[BINARY_STAGED];
+		binary_runtime_size[BINARY_STAGED] =
+					binary_runtime_size[BINARY];
+	}
+
+	if (ima_flush_htable) {
+		old_queue = ima_alloc_replace_htable();
+		if (IS_ERR(old_queue)) {
+			mutex_unlock(&ima_extend_list_mutex);
+			return PTR_ERR(old_queue);
+		}
+	}
+
+	/*
+	 * Splice (prepend) any remaining non-deleted staged entries to the
+	 * active list (RCU not needed, there cannot be concurrent readers).
+	 */
+	list_splice(&ima_measurements_staged, &ima_measurements);
+	INIT_LIST_HEAD(&ima_measurements_staged);
+	mutex_unlock(&ima_extend_list_mutex);
+
+	if (ima_flush_htable) {
+		synchronize_rcu();
+		kfree(old_queue);
+	}
+
+	list_for_each_entry_safe(qe, qe_tmp, &ima_measurements_trim, later) {
+		/*
+		 * Safe to free template_data here without synchronize_rcu()
+		 * because the only htable reader, ima_lookup_digest_entry(),
+		 * accesses only entry->digests, not template_data. If new
+		 * htable readers are added that access template_data, a
+		 * synchronize_rcu() is required here.
+		 */
+		for (i = 0; i < qe->entry->template_desc->num_fields; i++) {
+			kfree(qe->entry->template_data[i].data);
+			qe->entry->template_data[i].data = NULL;
+			qe->entry->template_data[i].len = 0;
+		}
+
+		list_del(&qe->later);
+
+		/* No leak if !ima_flush_htable, referenced by ima_htable. */
+		if (ima_flush_htable) {
+			kfree(qe->entry->digests);
+			kfree(qe->entry);
+			kfree(qe);
+		}
+	}
+
+	return 0;
+}
+
 int ima_restore_measurement_entry(struct ima_template_entry *entry)
 {
 	int result = 0;
-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH 0/3] Firmware LSM hook
From: Leon Romanovsky @ 2026-03-11 19:16 UTC (permalink / raw)
  To: Paul Moore
  Cc: James Morris, Serge E. Hallyn, Jason Gunthorpe, Saeed Mahameed,
	Itay Avraham, Dave Jiang, Jonathan Cameron, linux-security-module,
	linux-kernel, linux-rdma, Chiara Meiohas, Maher Sanalla,
	Edward Srouji
In-Reply-To: <CAHC9VhR0iuzYRpi3vPdKAbsOJ-DoMvWV-c7TXVcAmb3u8J4JwA@mail.gmail.com>

On Wed, Mar 11, 2026 at 12:06:09PM -0400, Paul Moore wrote:
> On Wed, Mar 11, 2026 at 4:20 AM Leon Romanovsky <leon@kernel.org> wrote:
> > On Tue, Mar 10, 2026 at 05:40:02PM -0400, Paul Moore wrote:
> > > On Tue, Mar 10, 2026 at 3:30 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > On Tue, Mar 10, 2026 at 02:24:40PM -0400, Paul Moore wrote:
> > > > > On Tue, Mar 10, 2026 at 5:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > On Mon, Mar 09, 2026 at 07:10:25PM -0400, Paul Moore wrote:
> > > > > > > On Mon, Mar 9, 2026 at 3:37 PM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > > > > On Mon, Mar 09, 2026 at 02:32:39PM -0400, Paul Moore wrote:
> > > > > > > > > On Mon, Mar 9, 2026 at 7:15 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > ...
> > > > >
> > > > > > > > > Hi Leon,
> > > > > > > > >
> > > > > > > > > At the link below, you'll find guidance on submitting new LSM hooks.
> > > > > > > > > Please take a look and let me know if you have any questions.
> > > > > > > > >
> > > > > > > > > https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsm-hooks
> > > > > > > >
> > > > > > > > I assume that you are referring to this part:
> > > > > > >
> > > > > > > I'm referring to all of the guidance, but yes, at the very least that
> > > > > > > is something that I think we need to see in a future revision of this
> > > > > > > patchset.
> > > > > > >
> > > > > > > >  * New LSM hooks must demonstrate their usefulness by providing a meaningful
> > > > > > > >    implementation for at least one in-kernel LSM. The goal is to demonstrate
> > > > > > > >    the purpose and expected semantics of the hooks. Out of tree kernel code,
> > > > > > > >    and pass through implementations, such as the BPF LSM, are not eligible
> > > > > > > >    for LSM hook reference implementations.
> > > > > > > >
> > > > > > > > The point is that we are not inspecting a kernel call, but the FW mailbox,
> > > > > > > > which has very little meaning to the kernel. From the kernel's perspective,
> > > > > > > > all relevant checks have already been performed, but the existing capability
> > > > > > > > granularity does not allow us to distinguish between FW_CMD1 and FW_CMD2.
> > > > > > >
> > > > > > > It might help if you could phrase this differently, as I'm not
> > > > > > > entirely clear on your argument.  LSMs are not limited to enforcing
> > > > > > > access controls on requests the kernel understands (see the SELinux
> > > > > > > userspace object manager concept), and the idea of access controls
> > > > > > > with greater granularity than capabilities is one of the main reasons
> > > > > > > people look to LSMs for access control (SELinux, AppArmor, Smack,
> > > > > > > etc.).
> > > > > >
> > > > > > I should note that my understanding of LSM is limited, so some parts of my
> > > > > > answers may be inaccurate.
> > > > > >
> > > > > > What I am referring to is a different level of granularity — specifically,
> > > > > > the internals of the firmware commands. In the proposed approach, BPF
> > > > > > programs would make decisions based on data passed through the mailbox.
> > > > > > That mailbox format varies across vendors, and may even differ between
> > > > > > firmware versions from the same vendor.
> > > > >
> > > > > That helps, thank you.
> > > > >
> > > > > > > > Here we propose a generic interface that can be applied to all FWCTL
> > > > > > > > devices without out-of-tree kernel code at all.
> > > > > > >
> > > > > > > I expected to see a patch implementing some meaningful support for
> > > > > > > access controls using these hooks in one of the existing LSMs, I did
> > > > > > > not see that in this patchset.
> > > > > >
> > > > > > In some cases, the mailbox is forwarded from user space unchanged, but
> > > > > > in others the kernel modifies it before submitting it to the FW.
> > > > >
> > > > > Without a standard format, opcode definitions, etc. I suspect
> > > > > integrating this into an LSM will present a number of challenges.
> > > >
> > > > The opcode is relatively easy to extract from the mailbox and pass to the LSM.
> > > > All drivers implement some variant of mlx5ctl_validate_rpc()/devx_is_general_cmd()
> > > > to validate the opcode. The problem is that this check alone is not sufficient.
> > > >
> > > > > Instead of performing an LSM access control check before submitting
> > > > > the firmware command, it might be easier from an LSM perspective to
> > > > > have the firmware call into the kernel/LSM for an access control
> > > > > decision before performing a security-relevant action.
> > > >
> > > > Ultimately, the LSM must make a decision for each executed firmware
> > > > command. This will need to be handled one way or another, and will
> > > > likely require parsing the mailbox again.
> > >
> > > As it's unlikely that parsing the mailbox is something that a LSM will
> > > want to handle,
> >
> > I believe this approach offers the cleanest and most natural way to support
> > all mailbox‑based devices.
> >
> > > my suggestion was to leverage the existing mailbox parsing in the firmware
> > > and require the firmware to call into the LSM when authorization is needed.
> > >
> > > > > This removes the challenge of parsing/interpreting the arbitrary firmware commands,
> > > > > but it does add some additional complexity of having to generically
> > > > > represent the security relevant actions the firmware might request
> > > >
> > > > The difference here is that the proposed LSM hook is intended to disable
> > > > certain functionality provided by the firmware, effectively depending on
> > > > the operator’s preferences.
> > >
> > > My suggestion would also allow a LSM hook to disable certain firmware
> > > functionality; however, the firmware itself would need to call the LSM
> > > to check if the functionality is authorized.
> >
> > This suggestion adds an extra call from the FW to the LSM for every command, even
> > for systems which don't have LSM at all.
> 
> If latency is a concern, I imagine we could create an LSM hook to
> report whether any LSMs provided firmware access controls.  The
> firmware could then use that hook, potentially caching the result, to
> limit its calls into the LSM.
> 
> > The FW must pass the already parsed data
> > back to the LSM; otherwise, the LSM   has no basis to decide whether to accept or
> > reject the request.
> >
> > For example, consider the MLX5_CMD_OP_QUERY_DCT command handled in
> > mlx5ctl_validate_rpc(). DCT in RDMA refers to Dynamically Connected
> > Transport, a Mellanox-specific extension that effectively introduces a new
> > QP‑type family on top of the standard RC/UC/UD transports. This type does not
> > exist for other vendors, each of whom provides its own vendor‑specific
> > extensions. All parameters here are tightly coupled to those specific
> > commands.
> >
> > It is unrealistic to expect different firmware implementations to supply
> > their data in a common format that would allow the LSM to make a generic
> > decision.
> 
> That's unfortunate as that would be the easiest path forward.
> Regardless, you are welcome to work on whatever implementation you
> think makes sense for any of the in-tree LSMs, with that in place we
> can take another look at the firmware command hooks.
> 
> Good luck.

I'll take advantage of the upcoming weekend and look into what can be done
here.

Thanks

> 
> -- 
> paul-moore.com

^ permalink raw reply

* Re: [PATCH v2 7/10] security: Hornet LSM
From: Paul Moore @ 2026-03-11 20:50 UTC (permalink / raw)
  To: Blaise Boscaccy, Blaise Boscaccy, Jonathan Corbet, James Morris,
	Serge E. Hallyn, Mickaël Salaün, Günther Noack,
	Dr. David Alan Gilbert, Andrew Morton, James.Bottomley, dhowells,
	Fan Wu, Ryan Foster, linux-security-module, linux-doc,
	linux-kernel, bpf
In-Reply-To: <20260227233930.2418522-8-bboscaccy@linux.microsoft.com>

On Feb 27, 2026 Blaise Boscaccy <bboscaccy@linux.microsoft.com> wrote:
> 
> This adds the Hornet Linux Security Module which provides enhanced
> signature verification and data validation for eBPF programs. This
> allows users to continue to maintain an invariant that all code
> running inside of the kernel has actually been signed and verified, by
> the kernel.
> 
> This effort builds upon the currently excepted upstream solution. It
> further hardens it by providing deterministic, in-kernel checking of
> map hashes to solidify auditing along with preventing TOCTOU attacks
> against lskel map hashes.
> 
> Target map hashes are passed in via PKCS#7 signed attributes. Hornet
> determines the extent which the eBFP program is signed and defers to
> other LSMs for policy decisions.
> 
> Signed-off-by: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> Nacked-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> ---
>  Documentation/admin-guide/LSM/Hornet.rst | 310 ++++++++++++++++++++++
>  Documentation/admin-guide/LSM/index.rst  |   1 +
>  MAINTAINERS                              |   9 +
>  include/linux/oid_registry.h             |   3 +
>  include/uapi/linux/lsm.h                 |   1 +
>  security/Kconfig                         |   3 +-
>  security/Makefile                        |   1 +
>  security/hornet/Kconfig                  |  11 +
>  security/hornet/Makefile                 |   7 +
>  security/hornet/hornet.asn1              |  13 +
>  security/hornet/hornet_lsm.c             | 323 +++++++++++++++++++++++
>  11 files changed, 681 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/admin-guide/LSM/Hornet.rst
>  create mode 100644 security/hornet/Kconfig
>  create mode 100644 security/hornet/Makefile
>  create mode 100644 security/hornet/hornet.asn1
>  create mode 100644 security/hornet/hornet_lsm.c
> 
> diff --git a/Documentation/admin-guide/LSM/Hornet.rst b/Documentation/admin-guide/LSM/Hornet.rst
> new file mode 100644
> index 000000000000..0dd4c03b8a7e
> --- /dev/null
> +++ b/Documentation/admin-guide/LSM/Hornet.rst
> @@ -0,0 +1,310 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +======
> +Hornet
> +======
> +
> +Hornet is a Linux Security Module that provides extensible signature
> +verification for eBPF programs. This is selectable at build-time with
> +``CONFIG_SECURITY_HORNET``.
> +
> +Overview
> +========
> +
> +Hornet addresses concerns from users who require strict audit trails and
> +verification guarantees for eBPF programs, especially in
> +security-sensitive environments. Many production systems need assurance
> +that only authorized, unmodified eBPF programs are loaded into the
> +kernel. Hornet provides this assurance through cryptographic signature
> +verification.
> +
> +When an eBPF program is loaded via the ``bpf()`` syscall, Hornet
> +verifies a PKCS#7 signature attached to the program instructions. The
> +signature is checked against the kernel's secondary keyring using the
> +existing kernel cryptographic infrastructure. In addition to signing the
> +program bytecode, Hornet supports signing SHA-256 hashes of associated
> +BPF maps, enabling integrity verification of map contents at load time
> +and at runtime.
> +
> +After verification, Hornet classifies the program into one of the
> +following integrity states and passes the result to a downstream LSM hook
> +(``bpf_prog_load_post_integrity``), allowing other security modules to
> +make policy decisions based on the verification outcome:
> +
> +``LSM_INT_VERDICT_OK``
> +  The program signature and all map hashes verified successfully.
> +
> +``LSM_INT_VERDICT_UNSIGNED``
> +  No signature was provided with the program.
> +
> +``LSM_INT_VERDICT_PARTIALSIG``
> +  The program signature verified, but the signing certificate is not
> +  trusted in the secondary keyring ...

Do you think there is value in separating this case out from _PARTIALSIG?
Maybe a LSM_INT_VERDICT_UNKNOWNKEY?

> +  ... or the signature did not contain
> +  hornet map hash data.
> +
> +``LSM_INT_VERDICT_BADSIG``
> +  The signature or a map hash failed verification.
> +
> +Hornet itself does not enforce a policy on whether unsigned or partially
> +signed programs should be rejected. It delegates that decision to
> +downstream LSMs via the ``bpf_prog_load_post_integrity`` hook, making it
> +a composable building block in a larger security architecture.
> +
> +Use Cases
> +=========
> +
> +- **Locked-down production environments**: Ensure only eBPF programs
> +  signed by a trusted authority can be loaded, preventing unauthorized
> +  or tampered programs from running in the kernel.
> +
> +- **Audit and compliance**: Provide cryptographic evidence that loaded
> +  eBPF programs match their expected build artifacts, supporting
> +  compliance requirements in regulated industries.
> +
> +- **Supply chain integrity**: Verify that eBPF programs and their
> +  associated map data have not been modified since they were built and
> +  signed, protecting against supply chain attacks.
> +
> +Threat Model
> +============
> +
> +Hornet protects against the following threats:
> +
> +- **Unauthorized eBPF program loading**: Programs that have not been
> +  signed by a trusted key will be reported as unsigned or badly signed.
> +
> +- **Tampering with program instructions**: Any modification to the eBPF
> +  bytecode after signing will cause signature verification to fail.
> +
> +- **Tampering with map data**: When map hashes are included in the
> +  signature, Hornet verifies that frozen BPF maps match their expected
> +  SHA-256 hashes at load time. Maps are also re-verified before program
> +  execution via ``BPF_PROG_RUN``.
> +
> +Hornet does **not** protect against:
> +
> +- Compromise of the signing key itself.
> +- Attacks that occur after a program has been loaded and verified.
> +- Programs loaded by the kernel itself (kernel-internal loads bypass
> +  the ``BPF_PROG_RUN`` map check).
> +
> +Known Limitations
> +=================
> +
> +- Hornet requires programs to use :doc:`light skeletons
> +  </bpf/libbpf/libbpf_naming_convention>` (lskels) for the signing
> +  workflow, as the tooling operates on lskel-generated headers.
> +
> +- A maximum of 64 maps per program can be tracked for hash
> +  verification.
> +
> +- Map hash verification requires the maps to be frozen before loading.
> +  Maps that are not frozen at load time will cause verification to fail
> +  when their hashes are included in the signature.
> +
> +- Hornet relies on the kernel's secondary keyring
> +  (``VERIFY_USE_SECONDARY_KEYRING``) for certificate trust. Keys must
> +  be provisioned into this keyring before programs can be verified.

I would add a bullet point describing the SHA256 limitation.  If I
understand things correctly this restriction comes from the core BPF
code and not Hornet itself, so it would be nice to have this documented
as it isn't immediately clear when looking only at the Hornet code.

> +Configuration
> +=============
> +
> +Build Configuration
> +-------------------
> +
> +Enable Hornet by setting the following kernel configuration option::
> +
> +  CONFIG_SECURITY_HORNET=y
> +
> +This option is found under :menuselection:`Security options --> Hornet
> +support` and depends on ``CONFIG_SECURITY``.
> +
> +When enabled, Hornet is included in the default LSM initialization order
> +and will appear in ``/sys/kernel/security/lsm``.
> +
> +Architecture
> +============
> +
> +Signature Verification Flow
> +---------------------------
> +
> +The following describes what happens when a userspace program calls
> +``bpf(BPF_PROG_LOAD, ...)`` with a signature attached:
> +
> +1. The ``bpf_prog_load_integrity`` LSM hook is invoked.
> +
> +2. Hornet reads the signature from the userspace buffer specified by
> +   ``attr->signature`` (with length ``attr->signature_size``).
> +
> +3. The PKCS#7 signature is verified against the program instructions
> +   using ``verify_pkcs7_signature()`` with the kernel's secondary
> +   keyring.
> +
> +4. The PKCS#7 message is parsed and its trust chain is validated via
> +   ``validate_pkcs7_trust()``.
> +
> +5. Hornet extracts the authenticated attribute identified by
> +   ``OID_hornet_data`` (OID ``2.25.316487325684022475439036912669789383960``)
> +   from the PKCS#7 message. This attribute contains an ASN.1-encoded set
> +   of map index/hash pairs.
> +
> +6. For each map hash entry, Hornet retrieves the corresponding BPF map
> +   via its file descriptor, confirms it is frozen, computes its SHA-256
> +   hash, and compares it against the signed hash.
> +
> +7. The resulting integrity verdict is passed to the
> +   ``bpf_prog_load_post_integrity`` hook so that downstream LSMs can
> +   enforce policy.
> +
> +Runtime Map Verification
> +------------------------
> +
> +When ``bpf(BPF_PROG_RUN, ...)`` is called from userspace, Hornet
> +re-verifies the hashes of all maps associated with the program. This
> +ensures that map contents have not been modified between program load
> +and execution. If any map hash no longer matches, the ``BPF_PROG_RUN``
> +command is denied.
> +
> +Userspace Interface
> +-------------------
> +
> +Signatures are passed to the kernel through fields in ``union bpf_attr``
> +when using the ``BPF_PROG_LOAD`` command:
> +
> +``signature``
> +  A pointer to a userspace buffer containing the PKCS#7 signature.
> +
> +``signature_size``
> +  The size of the signature buffer in bytes.
> +
> +ASN.1 Schema
> +------------
> +
> +Map hashes are encoded as a signed attribute in the PKCS#7 message using
> +the following ASN.1 schema::
> +
> +  HornetData ::= SET OF Map
> +
> +  Map ::= SEQUENCE {
> +      index   INTEGER,
> +      sha     OCTET STRING
> +  }
> +
> +Each ``Map`` entry contains the index of the map in the program's
> +``fd_array`` and its expected SHA-256 hash. A zero-length ``sha`` field
> +indicates that the map at that index should be skipped during
> +verification.
> +
> +Tooling
> +=======
> +
> +Helper scripts and a signature generation tool are provided in
> +``scripts/hornet/`` to support the development of signed eBPF light
> +skeletons.
> +
> +gen_sig
> +-------
> +
> +``gen_sig`` is a C program (using OpenSSL) that creates a PKCS#7
> +signature over eBPF program instructions and optionally includes
> +SHA-256 hashes of BPF maps as signed attributes.
> +
> +Usage::
> +
> +  gen_sig --data <instructions.bin> \
> +          --cert <signer.crt> \
> +          --key <signer.key> \
> +          [--pass <passphrase>] \
> +          --out <signature.p7b> \
> +          [--add <mapfile.bin>:<index> ...]
> +
> +``--data``
> +  Path to the binary file containing eBPF program instructions to sign.
> +
> +``--cert``
> +  Path to the signing certificate (PEM or DER format).
> +
> +``--key``
> +  Path to the private key (PEM or DER format).
> +
> +``--pass``
> +  Optional passphrase for the private key.
> +
> +``--out``
> +  Path to write the output PKCS#7 signature.
> +
> +``--add``
> +  Attach a map hash as a signed attribute. The argument is a path to a
> +  binary map file followed by a colon and the map's index in the
> +  ``fd_array``. This option may be specified multiple times.
> +
> +extract-skel.sh
> +---------------
> +
> +Extracts a named field from an autogenerated eBPF lskel header file.
> +Used internally by other helper scripts.
> +
> +extract-insn.sh
> +---------------
> +
> +Extracts the eBPF program instructions (``opts_insn``) from an lskel
> +header into a binary file suitable for signing with ``gen_sig``.
> +
> +extract-map.sh
> +--------------
> +
> +Extracts the map data (``opts_data``) from an lskel header into a
> +binary file suitable for hashing with ``gen_sig``.
> +
> +write-sig.sh
> +------------
> +
> +Replaces the signature data in an lskel header with a new signature
> +from a binary file. This is used to embed a freshly generated signature
> +back into the header after signing.
> +
> +Signing Workflow
> +================
> +
> +A typical workflow for building and signing an eBPF light skeleton is:
> +
> +1. **Compile the eBPF program**::
> +
> +     clang -O2 -target bpf -c program.bpf.c -o program.bpf.o
> +
> +2. **Generate the light skeleton header** using ``bpftool``::
> +
> +     bpftool gen skeleton -S program.bpf.o > loader.h
> +
> +3. **Extract instructions and map data** from the generated header::
> +
> +     scripts/hornet/extract-insn.sh loader.h > insn.bin
> +     scripts/hornet/extract-map.sh loader.h > map.bin
> +
> +4. **Generate the signature** with ``gen_sig``::
> +
> +     scripts/hornet/gen_sig \
> +       --key signing_key.pem \
> +       --cert signing_key.x509 \
> +       --data insn.bin \
> +       --add map.bin:0 \
> +       --out sig.bin
> +
> +5. **Embed the signature** back into the header::
> +
> +     scripts/hornet/write-sig.sh loader.h sig.bin > signed_loader.h
> +
> +6. **Build the loader program** using the signed header::
> +
> +     cc -o loader loader.c -lbpf
> +
> +The resulting loader program will pass the embedded signature to the
> +kernel when loading the eBPF program, enabling Hornet to verify it.
> +
> +Testing
> +=======
> +
> +Self-tests are provided in ``tools/testing/selftests/hornet/``. The test
> +suite builds a minimal eBPF program (``trivial.bpf.c``), signs it using
> +the workflow described above, and verifies that the signed program loads
> +successfully.
> diff --git a/Documentation/admin-guide/LSM/index.rst b/Documentation/admin-guide/LSM/index.rst
> index b44ef68f6e4d..57f6e9fbe5fd 100644
> --- a/Documentation/admin-guide/LSM/index.rst
> +++ b/Documentation/admin-guide/LSM/index.rst
> @@ -49,3 +49,4 @@ subdirectories.
>     SafeSetID
>     ipe
>     landlock
> +   Hornet
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 55af015174a5..6e91234a9ba4 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -11682,6 +11682,15 @@ S:	Maintained
>  F:	Documentation/devicetree/bindings/iio/pressure/honeywell,mprls0025pa.yaml
>  F:	drivers/iio/pressure/mprls0025pa*
>  
> +HORNET SECURITY MODULE
> +M:	Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> +L:	linux-security-module@vger.kernel.org
> +S:	Supported
> +T:	git https://github.com/blaiseboscaccy/hornet.git
> +F:	Documentation/admin-guide/LSM/Hornet.rst
> +F:	scripts/hornet/
> +F:	security/hornet/
> +
>  HP BIOSCFG DRIVER
>  M:	Jorge Lopez <jorge.lopez2@hp.com>
>  L:	platform-driver-x86@vger.kernel.org
> diff --git a/include/linux/oid_registry.h b/include/linux/oid_registry.h
> index ebce402854de..bf852715aaea 100644
> --- a/include/linux/oid_registry.h
> +++ b/include/linux/oid_registry.h
> @@ -150,6 +150,9 @@ enum OID {
>  	OID_id_ml_dsa_65,			/* 2.16.840.1.101.3.4.3.18 */
>  	OID_id_ml_dsa_87,			/* 2.16.840.1.101.3.4.3.19 */
>  
> +	/* Hornet LSM */
> +	OID_hornet_data,	  /* 2.25.316487325684022475439036912669789383960 */
> +
>  	OID__NR
>  };
>  
> diff --git a/include/uapi/linux/lsm.h b/include/uapi/linux/lsm.h
> index 938593dfd5da..2ff9bcdd551e 100644
> --- a/include/uapi/linux/lsm.h
> +++ b/include/uapi/linux/lsm.h
> @@ -65,6 +65,7 @@ struct lsm_ctx {
>  #define LSM_ID_IMA		111
>  #define LSM_ID_EVM		112
>  #define LSM_ID_IPE		113
> +#define LSM_ID_HORNET		114
>  
>  /*
>   * LSM_ATTR_XXX definitions identify different LSM attributes
> diff --git a/security/Kconfig b/security/Kconfig
> index 6a4393fce9a1..283c4a103209 100644
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -230,6 +230,7 @@ source "security/safesetid/Kconfig"
>  source "security/lockdown/Kconfig"
>  source "security/landlock/Kconfig"
>  source "security/ipe/Kconfig"
> +source "security/hornet/Kconfig"
>  
>  source "security/integrity/Kconfig"
>  
> @@ -274,7 +275,7 @@ config LSM
>  	default "landlock,lockdown,yama,loadpin,safesetid,apparmor,selinux,smack,tomoyo,ipe,bpf" if DEFAULT_SECURITY_APPARMOR
>  	default "landlock,lockdown,yama,loadpin,safesetid,tomoyo,ipe,bpf" if DEFAULT_SECURITY_TOMOYO
>  	default "landlock,lockdown,yama,loadpin,safesetid,ipe,bpf" if DEFAULT_SECURITY_DAC
> -	default "landlock,lockdown,yama,loadpin,safesetid,selinux,smack,tomoyo,apparmor,ipe,bpf"
> +	default "landlock,lockdown,yama,loadpin,safesetid,selinux,smack,tomoyo,apparmor,ipe,hornet,bpf"
>  	help
>  	  A comma-separated list of LSMs, in initialization order.
>  	  Any LSMs left off this list, except for those with order
> diff --git a/security/Makefile b/security/Makefile
> index 4601230ba442..b68cb56e419b 100644
> --- a/security/Makefile
> +++ b/security/Makefile
> @@ -26,6 +26,7 @@ obj-$(CONFIG_CGROUPS)			+= device_cgroup.o
>  obj-$(CONFIG_BPF_LSM)			+= bpf/
>  obj-$(CONFIG_SECURITY_LANDLOCK)		+= landlock/
>  obj-$(CONFIG_SECURITY_IPE)		+= ipe/
> +obj-$(CONFIG_SECURITY_HORNET)		+= hornet/
>  
>  # Object integrity file lists
>  obj-$(CONFIG_INTEGRITY)			+= integrity/
> diff --git a/security/hornet/Kconfig b/security/hornet/Kconfig
> new file mode 100644
> index 000000000000..19406aa237ac
> --- /dev/null
> +++ b/security/hornet/Kconfig
> @@ -0,0 +1,11 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +config SECURITY_HORNET
> +	bool "Hornet support"
> +	depends on SECURITY
> +	default n
> +	help
> +	  This selects Hornet.
> +	  Further information can be found in
> +	  Documentation/admin-guide/LSM/Hornet.rst.
> +
> +	  If you are unsure how to answer this question, answer N.
> diff --git a/security/hornet/Makefile b/security/hornet/Makefile
> new file mode 100644
> index 000000000000..26b6f954f762
> --- /dev/null
> +++ b/security/hornet/Makefile
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: GPL-2.0-only
> +obj-$(CONFIG_SECURITY_HORNET) := hornet.o
> +
> +hornet-y := hornet.asn1.o \
> +	hornet_lsm.o \
> +
> +$(obj)/hornet.asn1.o: $(obj)/hornet.asn1.c $(obj)/hornet.asn1.h
> diff --git a/security/hornet/hornet.asn1 b/security/hornet/hornet.asn1
> new file mode 100644
> index 000000000000..c8d47b16b65d
> --- /dev/null
> +++ b/security/hornet/hornet.asn1
> @@ -0,0 +1,13 @@
> +-- SPDX-License-Identifier: BSD-3-Clause
> +--
> +-- Copyright (C) 2009 IETF Trust and the persons identified as authors
> +-- of the code
> +--
> +-- https://www.rfc-editor.org/rfc/rfc5652#section-3
> +
> +HornetData ::= SET OF Map
> +
> +Map ::= SEQUENCE {
> +	index			INTEGER ({ hornet_map_index }),
> +	sha			OCTET STRING ({ hornet_map_hash })
> +} ({ hornet_next_map })
> diff --git a/security/hornet/hornet_lsm.c b/security/hornet/hornet_lsm.c
> new file mode 100644
> index 000000000000..6c821d6441fb
> --- /dev/null
> +++ b/security/hornet/hornet_lsm.c
> @@ -0,0 +1,323 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Hornet Linux Security Module
> + *
> + * Author: Blaise Boscaccy <bboscaccy@linux.microsoft.com>
> + *
> + * Copyright (C) 2026 Microsoft Corporation
> + */
> +
> +#include <linux/lsm_hooks.h>
> +#include <uapi/linux/lsm.h>
> +#include <linux/bpf.h>
> +#include <linux/verification.h>
> +#include <crypto/public_key.h>
> +#include <linux/module_signature.h>
> +#include <crypto/pkcs7.h>
> +#include <linux/sort.h>
> +#include <linux/asn1_decoder.h>
> +#include <linux/oid_registry.h>
> +#include "hornet.asn1.h"
> +
> +#define MAX_USED_MAPS 64
> +
> +struct hornet_maps {
> +	bpfptr_t fd_array;
> +};
> +
> +struct hornet_parse_context {
> +	int indexes[MAX_USED_MAPS];
> +	bool skips[MAX_USED_MAPS];
> +	unsigned char hashes[SHA256_DIGEST_SIZE * MAX_USED_MAPS];
> +	int hash_count;
> +};

I might include a brief comment at near the top of this file referencing
the hash algorithm limitation in the Hornet docs, otherwise someone is
surely going to advocate for hash agility improvements at some point.

> +struct hornet_prog_security_struct {
> +	bool checked[MAX_USED_MAPS];
> +	unsigned char hashes[SHA256_DIGEST_SIZE * MAX_USED_MAPS];
> +};
> +
> +struct hornet_map_security_struct {
> +	bool checked;
> +	int index;
> +};
> +
> +struct lsm_blob_sizes hornet_blob_sizes __ro_after_init = {
> +	.lbs_bpf_map = sizeof(struct hornet_map_security_struct),
> +	.lbs_bpf_prog = sizeof(struct hornet_prog_security_struct),
> +};
> +
> +static inline struct hornet_prog_security_struct *
> +hornet_bpf_prog_security(struct bpf_prog *prog)
> +{
> +	return prog->aux->security + hornet_blob_sizes.lbs_bpf_prog;
> +}
> +
> +static inline struct hornet_map_security_struct *
> +hornet_bpf_map_security(struct bpf_map *map)
> +{
> +	return map->security + hornet_blob_sizes.lbs_bpf_map;
> +}
> +
> +static int hornet_verify_hashes(struct hornet_maps *maps,
> +				struct hornet_parse_context *ctx,
> +				struct bpf_prog *prog)
> +{
> +	int map_fd;
> +	u32 i;
> +	struct bpf_map *map;
> +	int err = 0;
> +	unsigned char hash[SHA256_DIGEST_SIZE];
> +	struct hornet_prog_security_struct *security = hornet_bpf_prog_security(prog);
> +	struct hornet_map_security_struct *map_security;
> +
> +	for (i = 0; i < ctx->hash_count; i++) {
> +		if (ctx->skips[i]) {
> +			security->checked[i] = false;

I'm not going to argue against an explicit false assignement here, but
as a FYI, when the LSM framework allocates the various object blobs it
(re)sets the blob memory to zero via kzalloc().  Even if/when the LSM
framwork moves to some other allocation scheme we will still need to keep
that reset-to-zero behavior.

The same applies to the BPF map blobs.

> +			continue;
> +		}
> +
> +		err = copy_from_bpfptr_offset(&map_fd, maps->fd_array,
> +					      ctx->indexes[i] * sizeof(map_fd),
> +					      sizeof(map_fd));
> +		if (err < 0)
> +			return LSM_INT_VERDICT_BADSIG;
> +
> +		CLASS(fd, f)(map_fd);
> +		if (fd_empty(f))
> +			return LSM_INT_VERDICT_BADSIG;
> +		if (unlikely(fd_file(f)->f_op != &bpf_map_fops))
> +			return LSM_INT_VERDICT_BADSIG;

I'm wondering if it is worth defining a generic LSM_INT_VERDICT_FAULT
verdict to indicate a system error when verifying the integrity rather
than a bad signature.  Yes, the enforcement action will likely be the
same, but it might help when debugging or chasing forensic data.

> +		map = fd_file(f)->private_data;
> +		if (!map->frozen)
> +			return LSM_INT_VERDICT_BADSIG;
> +
> +		map->ops->map_get_hash(map, SHA256_DIGEST_SIZE, hash);
> +
> +		err = memcmp(hash, &ctx->hashes[i * SHA256_DIGEST_SIZE],
> +			      SHA256_DIGEST_SIZE);
> +		if (err)
> +			return LSM_INT_VERDICT_BADSIG;
> +
> +		security->checked[i] = true;
> +		memcpy(&security->hashes[i * SHA256_DIGEST_SIZE], hash, SHA256_DIGEST_SIZE);
> +		map_security = hornet_bpf_map_security(map);
> +		map_security->checked = true;
> +		map_security->index = i;
> +	}
> +	return LSM_INT_VERDICT_OK;
> +}
> +
> +int hornet_next_map(void *context, size_t hdrlen,
> +		     unsigned char tag,
> +		     const void *value, size_t vlen)
> +{
> +	struct hornet_parse_context *ctx = (struct hornet_parse_context *)context;
> +
> +	ctx->hash_count++;

Do we need a check here to ensure that ctx->hash_count doesn't exceed
MAX_USED_MAPS?  If not here, where do we ensure we don't blow past
MAX_USED_MAPS?

What does Hornet do if the number of hashed maps is greater then
MAX_USED_MAPS?  I'm guessing we would want it to return an error and
fail the load?

> +	return 0;
> +}
> +
> +int hornet_map_index(void *context, size_t hdrlen,
> +		     unsigned char tag,
> +		     const void *value, size_t vlen)
> +{
> +	struct hornet_parse_context *ctx = (struct hornet_parse_context *)context;
> +
> +	if (vlen > 1)
> +		return -EINVAL;
> +
> +	ctx->indexes[ctx->hash_count] = *(u8 *)value;
> +	return 0;
> +}
> +
> +int hornet_map_hash(void *context, size_t hdrlen,
> +		    unsigned char tag,
> +		    const void *value, size_t vlen)
> +
> +{
> +	struct hornet_parse_context *ctx = (struct hornet_parse_context *)context;
> +
> +	if (vlen != SHA256_DIGEST_SIZE && vlen != 0)
> +		return -EINVAL;
> +
> +	if (vlen) {
> +		ctx->skips[ctx->hash_count] = false;
> +		memcpy(&ctx->hashes[ctx->hash_count * SHA256_DIGEST_SIZE], value, vlen);
> +	} else
> +		ctx->skips[ctx->hash_count] = true;
> +
> +	return 0;
> +}
> +
> +static int hornet_check_program(struct bpf_prog *prog, union bpf_attr *attr,
> +				struct bpf_token *token, bool is_kernel)
> +{
> +	struct hornet_maps maps = {0};
> +	bpfptr_t usig = make_bpfptr(attr->signature, is_kernel);
> +	struct pkcs7_message *msg;
> +	struct hornet_parse_context *ctx;
> +	void *sig;
> +	int err;
> +	const void *authattrs;
> +	size_t authattrs_len;
> +
> +	if (!attr->signature)
> +		return LSM_INT_VERDICT_UNSIGNED;
> +
> +	ctx = kzalloc(sizeof(struct hornet_parse_context), GFP_KERNEL);
> +	if (!ctx)
> +		return -ENOMEM;

I think I mentioned this previously, but let me repeat myself in case I
didn't ... we don't want to mix LSM_INT_VERDICT enums and errno values
in the return value.  Yes, you can probably get away with it in the
majority of cases, but I worry it is a problem waiting to happen.  I
count only four parameters right now, so adding a verdict enum pointer
shouldn't be too difficult.

> +	maps.fd_array = make_bpfptr(attr->fd_array, is_kernel);
> +	sig = kzalloc(attr->signature_size, GFP_KERNEL);
> +	if (!sig) {
> +		err = -ENOMEM;
> +		goto out;
> +	}
> +	err = copy_from_bpfptr(sig, usig, attr->signature_size);
> +	if (err != 0)
> +		goto cleanup_sig;
> +
> +	err = verify_pkcs7_signature(prog->insnsi, prog->len * sizeof(struct bpf_insn),
> +				     sig, attr->signature_size, VERIFY_USE_SECONDARY_KEYRING,
> +				     VERIFYING_BPF_SIGNATURE, NULL, NULL);
> +	if (err < 0) {
> +		err = LSM_INT_VERDICT_BADSIG;
> +		goto cleanup_sig;
> +	}
> +
> +	msg = pkcs7_parse_message(sig, attr->signature_size);
> +	if (IS_ERR(msg)) {
> +		err = LSM_INT_VERDICT_BADSIG;
> +		goto cleanup_sig;
> +	}
> +
> +	if (validate_pkcs7_trust(msg, VERIFY_USE_SECONDARY_KEYRING)) {
> +		err = LSM_INT_VERDICT_PARTIALSIG;
> +		goto cleanup_msg;
> +	}
> +	if (pkcs7_get_authattr(msg, OID_hornet_data,
> +			       &authattrs, &authattrs_len) == -ENODATA) {
> +		err = LSM_INT_VERDICT_PARTIALSIG;
> +		goto cleanup_msg;
> +	}
> +
> +	err = asn1_ber_decoder(&hornet_decoder, ctx, authattrs, authattrs_len);
> +	if (err < 0 || authattrs == NULL) {
> +		err = LSM_INT_VERDICT_PARTIALSIG;
> +		goto cleanup_msg;
> +	}
> +	err = hornet_verify_hashes(&maps, ctx, prog);
> +
> +cleanup_msg:
> +	pkcs7_free_message(msg);
> +cleanup_sig:
> +	kfree(sig);
> +out:
> +	kfree(ctx);
> +	return err;
> +}
> +
> +static const struct lsm_id hornet_lsmid = {
> +	.name = "hornet",
> +	.id = LSM_ID_HORNET,
> +};
> +
> +static int hornet_bpf_prog_load_integrity(struct bpf_prog *prog, union bpf_attr *attr,
> +					  struct bpf_token *token, bool is_kernel)
> +{
> +	int result = hornet_check_program(prog, attr, token, is_kernel);

Can you explain a bit why we check for the kernel flag in hornet_bpf(),
but not here?  It may be that a brief comment in hornet_bpf() explaining
the kernel flag exception would be helpful.

> +	if (result < 0)
> +		return result;
> +
> +	return security_bpf_prog_load_post_integrity(prog, attr, token, is_kernel,
> +						     &hornet_lsmid, result);
> +}
> +
> +static int hornet_verify_map(struct bpf_prog *prog, int index)
> +{
> +	unsigned char hash[SHA256_DIGEST_SIZE];
> +	int i;
> +	struct bpf_map *map;
> +	struct hornet_prog_security_struct *security = hornet_bpf_prog_security(prog);
> +	struct hornet_map_security_struct *map_security;
> +
> +	if (!security->checked[index])
> +		return 0;
> +
> +	for (i = 0; i < prog->aux->used_map_cnt; i++) {
> +		map = prog->aux->used_maps[i];
> +		map_security = hornet_bpf_map_security(map);
> +		if (map_security->index != index)
> +			continue;
> +
> +		if (!map->frozen)
> +			return -EINVAL;

Unless there is serious tampering going on we should never see an
unfrozen map here, yes?

We probably also want to use a return value other than -EINVAL as this
is a access/permission denial.  I would think -EACCES or -EPERM would be
more appropriate.

> +		map->ops->map_get_hash(map, SHA256_DIGEST_SIZE, hash);
> +		if (memcmp(hash, &security->hashes[index * SHA256_DIGEST_SIZE],
> +			   SHA256_DIGEST_SIZE) != 0)

Presumably this is just being extra careful?

> +			return -EINVAL;

See above, -EACCES or -EPERM is likely a better choice here.

> +		else
> +			return 0;
> +	}
> +	return -EINVAL;

See above.

> +}
> +
> +static int hornet_check_prog_maps(u32 ufd)
> +{
> +	CLASS(fd, f)(ufd);
> +	struct bpf_prog *prog;
> +	int i, result = 0;
> +
> +	if (fd_empty(f))
> +		return -EBADF;
> +	if (fd_file(f)->f_op != &bpf_prog_fops)
> +		return -EINVAL;
> +
> +	prog = fd_file(f)->private_data;
> +
> +	mutex_lock(&prog->aux->used_maps_mutex);
> +	if (!prog->aux->used_map_cnt)
> +		goto out;
> +
> +	for (i = 0; i < prog->aux->used_map_cnt; i++) {
> +		result = hornet_verify_map(prog, i);
> +		if (result)
> +			goto out;
> +	}
> +out:
> +	mutex_unlock(&prog->aux->used_maps_mutex);
> +	return result;
> +}
> +
> +static int hornet_bpf(int cmd, union bpf_attr *attr, unsigned int size, bool kernel)
> +{
> +	if (cmd != BPF_PROG_RUN)
> +		return 0;
> +	if (kernel)
> +		return 0;
> +
> +	return hornet_check_prog_maps(attr->test.prog_fd);
> +}
> +
> +static struct security_hook_list hornet_hooks[] __ro_after_init = {
> +	LSM_HOOK_INIT(bpf_prog_load_integrity, hornet_bpf_prog_load_integrity),
> +	LSM_HOOK_INIT(bpf, hornet_bpf),
> +};
> +
> +static int __init hornet_init(void)
> +{
> +	pr_info("Hornet: eBPF signature verification enabled\n");
> +	security_add_hooks(hornet_hooks, ARRAY_SIZE(hornet_hooks), &hornet_lsmid);
> +	return 0;
> +}
> +
> +DEFINE_LSM(hornet) = {
> +	.id = &hornet_lsmid,
> +	.blobs = &hornet_blob_sizes,
> +	.init = hornet_init,
> +};
> -- 
> 2.52.0

--
paul-moore.com

^ permalink raw reply

* Re: [PATCH v4 15/17] module: Introduce hash-based integrity checking
From: Eric Biggers @ 2026-03-11 21:14 UTC (permalink / raw)
  To: Thomas Weißschuh
  Cc: Nathan Chancellor, Arnd Bergmann, Luis Chamberlain, Petr Pavlu,
	Sami Tolvanen, Daniel Gomez, Paul Moore, James Morris,
	Serge E. Hallyn, Jonathan Corbet, Madhavan Srinivasan,
	Michael Ellerman, Nicholas Piggin, Naveen N Rao, Mimi Zohar,
	Roberto Sassu, Dmitry Kasatkin, Eric Snowberg, Nicolas Schier,
	Daniel Gomez, Aaron Tomlin, Christophe Leroy (CS GROUP),
	Nicolas Schier, Nicolas Bouchinet, Xiu Jianfeng,
	Fabian Grünbichler, Arnout Engelen, Mattia Rizzolo, kpcyrd,
	Christian Heusel, Câju Mihai-Drosi,
	Sebastian Andrzej Siewior, linux-kbuild, linux-kernel, linux-arch,
	linux-modules, linux-security-module, linux-doc, linuxppc-dev,
	linux-integrity
In-Reply-To: <5726fc65-7d24-4353-b341-81b785f2575c@t-8ch.de>

On Wed, Mar 11, 2026 at 02:19:02PM +0100, Thomas Weißschuh wrote:
> > > diff --git a/include/linux/module_signature.h b/include/linux/module_signature.h
> > > index a45ce3b24403..3b510651830d 100644
> > > --- a/include/linux/module_signature.h
> > > +++ b/include/linux/module_signature.h
> > > @@ -18,6 +18,7 @@ enum pkey_id_type {
> > >  	PKEY_ID_PGP,		/* OpenPGP generated key ID */
> > >  	PKEY_ID_X509,		/* X.509 arbitrary subjectKeyIdentifier */
> > >  	PKEY_ID_PKCS7,		/* Signature in PKCS#7 message */
> > > +	PKEY_ID_MERKLE,		/* Merkle proof for modules */
> > 
> > I recommend making the hash algorithm explicit:
> > 
> >         PKEY_ID_MERKLE_SHA256,	/* SHA-256 merkle proof for modules */
> > 
> > While I wouldn't encourage the addition of another hash algorithm
> > (specifying one good algorithm for now is absolutely the right choice),
> > if someone ever does need to add another one, we'd want them to be
> > guided to simply introduce a new value of this enum rather than hack it
> > in some other way.
> 
> The idea here was that this will only ever be used for module built as
> part of the kernel build. So the actual implementation could change freely
> without affecting anything.
> 
> But I don't have hard feelings about it.

Ah, okay.  That's even better then: if someone adds another algorithm it
would simply be a kconfig option.

It seems 'struct module_signature' itself is intended to be a stable
ABI, though.  So I think there's an opportunity for confusion here.  It
might be worth leaving a note somewhere that the format of the
PKEY_ID_MERKLE portion of the struct does not need to be kept stable and
can freely change in each kernel build.

- Eric

^ permalink raw reply

* Re: [PATCH v2 0/5] rust: lsm: introduce safe Rust abstractions for the LSM framework
From: Paul Moore @ 2026-03-11 21:16 UTC (permalink / raw)
  To: Jamie Lindsey
  Cc: rust-for-linux, Alice Ryhl, linux-security-module, ojeda, jmorris,
	serge
In-Reply-To: <CAH5fLgiQm=2YYvmG54o-MEt2m8x5V5xZrtmsqEUtuB9OZ=FPOw@mail.gmail.com>

On Wed, Mar 11, 2026 at 2:49 AM Alice Ryhl <aliceryhl@google.com> wrote:
> On Wed, Mar 11, 2026 at 6:09 AM Jamie Lindsey <jamie@matrixforgelabs.com> wrote:
> >
> > v2: add missing Signed-off-by tags, fix short commit hash in patch 4.
> > No code changes from v1.
> >
> > This series introduces the first safe Rust abstractions for the Linux
> > Security Module (LSM) framework.  It allows a complete, policy-enforcing
> > LSM to be written entirely in Rust with no C boilerplate required from
> > the LSM author.
> >
> > --- Motivation ---
> >
> > The LSM framework is a natural target for Rust: hook registration is
> > unsafe by nature (raw function pointers, C ABI, __randomize_layout on
> > the hook list struct), and the trait system can enforce correct
> > implementation at compile time.
>
> Hi Jamie,
>
> What is the intended end-user of these abstractions?

Building on Alice's question, I wanted to mention that we don't
accept/merge example LSMs into the upstream Linux kernel.  I'm
supportive of using Rust to develop new LSMs, and I recognize that
developing a meaningful LSM in Rust will require significant
shim/plumbing work, but that shim work needs to be done in conjunction
with a real LSM.

In case it may be helpful, I wanted to point out some previous work on
developing a LSM in Rust:

https://lore.kernel.org/linux-security-module/20250416213206.26060-2-kernel@o1oo11oo.de

... and if you are serious about developing a proper LSM in Rust, here
is some guidance for developing and submitting new LSMs upstream:

https://github.com/LinuxSecurityModule/kernel/blob/main/README.md#new-lsms

-- 
paul-moore.com

^ permalink raw reply

* [PATCH RFC bpf-next 0/4] audit: Expose audit subsystem to BPF LSM programs via BPF kfuncs
From: Frederick Lawler @ 2026-03-11 21:31 UTC (permalink / raw)
  To: Paul Moore, James Morris, Serge E. Hallyn, Eric Paris,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Mickaël Salaün, Günther Noack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, Frederick Lawler

The motivation behind the change is to give BPF LSM developers the
ability to report accesses via the audit subsystem much like how LSMs
operate today.

Series:

Patch 1: Introduces bpf_audit_*() kfuncs
Patch 2: Enables bpf_audit_*() kfuns
Patch 3: Prepares audit helpers used for testing
Patch 4: Adds self tests

Documentation will be added when this becomes a versioned series.

Key features:

1. Audit logs include type=AUDIT_BPF_LSM_ACCESS, BPF program ID, and comm
that triggered the hook by default

We wanted audit log consumers to be able to track who and what created
the entry. prog-id=%d is already used for BPF LOAD/UNLOAD logs, thus
is reused here for this distinction. Though, it may be better to use
the tag instead to capture which _specific_ version of the program
made the log, since prog-id can be reused.

2. Leverages BPF KF_AQUIRE/KF_RELEASE semantics to force use of
  bpf_audit_log_end().

One side effect of this decision is that the BPF documentation states
that these flags allow the pointer to struct bpf_audit_context to be 
stored in a map, and then exchanged through bpf_kptr_xchg(). However,
there's prior work with net/netfilter/nf_conntrack_bpf.c such that the
struct is not exposed as a kptr to support that functionality nor is
that supplying a dtor function. The verifier will not allow this use case
due to not exposing the __kptr. Ideally, we don't want the pointer to
be exchanged anyway because the reporting program can become ambiguous.
I am sure there are other edge cases WRT to keeping the audit buffer in a
strange state too that I cannot think of at this moment.

3. All bpf_audit_log_*() functions are destructive

The audit subsystem allows for AUDIT_FAIL_PANIC to be set when the
subsystem can detect that missing events. Further, some call paths may
invoke a BUG_ON(). Therefore all the functions are marked destructive.

4. Functions are callable once per bpf_audit_context

The rationale for this was to prevent abuse. Logs with repeated fields
are not helpful, and may not be handled by user space audit coherently.

This is in the same vein as not providing a audit_format() wrapper.

Similarly, some functions such as bpf_audit_log_path() and
bpf_audit_log_file() report the same information, thus can be
interchangeable in use.

5. API wraps security/lsm_audit.c

lsm_audit.c functions are multiplexed and not handled by BPF verifier
very well, thus the wrapped functions are isolated to their sole
purpose for use within hooks.

Key considerations:

1. Audit field ordering

AFAIK, user space audit is particular about what fields are
present and their order. This patch series does not address ordering.

My assumption is that the first three fields: type, prog-id, pid, comm
are well known, and user space can make an assumption that other
fields after those can appear in any order.

If that is not acceptable, I would propose that we leverage the struct
common_audit_data type order to be the order--much like how the type is
used for log_once() functionality.

I am open to other ideas.

Signed-off-by: Frederick Lawler <fred@cloudflare.com>
---
Frederick Lawler (4):
      audit: Implement bpf_audit_log_*() wrappers
      audit/security: Enable audit BPF kfuncs
      selftests/bpf: Add audit helpers for BPF tests
      selftests/bpf: Add lsm_audit_kfuncs tests

 include/linux/lsm_audit.h                          |   1 +
 include/uapi/linux/audit.h                         |   1 +
 security/Makefile                                  |   2 +
 security/lsm_audit_kfuncs.c                        | 306 +++++++++++
 tools/testing/selftests/bpf/Makefile               |   3 +-
 tools/testing/selftests/bpf/audit_helpers.c        | 281 ++++++++++
 tools/testing/selftests/bpf/audit_helpers.h        |  55 ++
 .../selftests/bpf/prog_tests/lsm_audit_kfuncs.c    | 598 +++++++++++++++++++++
 .../selftests/bpf/progs/test_lsm_audit_kfuncs.c    | 263 +++++++++
 9 files changed, 1509 insertions(+), 1 deletion(-)
---
base-commit: ca0f39a369c5f927c3d004e63a5a778b08a9df94
change-id: 20260105-bpf-auditd-send-message-4a883067aab8

Best regards,
-- 
Frederick Lawler <fred@cloudflare.com>


^ permalink raw reply

* [PATCH RFC bpf-next 1/4] audit: Implement bpf_audit_log_*() wrappers
From: Frederick Lawler @ 2026-03-11 21:31 UTC (permalink / raw)
  To: Paul Moore, James Morris, Serge E. Hallyn, Eric Paris,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Mickaël Salaün, Günther Noack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, Frederick Lawler
In-Reply-To: <20260311-bpf-auditd-send-message-v1-0-10a62db5c92f@cloudflare.com>

The primary use case is to provide LSM designers a direct API to report
access allow/denies through the audit subsystem similar to how LSM's
traditionally log their accesses.

Left out from this API are functions that are potentially abuseable such as
audit_log_format() where users may fill any field=value pair. Instead, the
API mostly follows what is exposed through security/lsm_audit.c for
consistency with user space audit expectations. Further calls to functions
report once to avoid repeated-call abuse.

Lastly, each audit record corresponds to the loaded BPF program's ID to
track which program reported the log entry. This helps remove
ambiguity in the event multiple programs are registered to the same
security hook.

Exposed functions:

	bpf_audit_log_start()
	bpf_audit_log_end()
	bpf_audit_log_cause()
	bpf_audit_log_cap()
	bpf_audit_log_path()
	bpf_audit_log_file()
	bpf_audit_log_ioctl_op()
	bpf_audit_log_dentry()
	bpf_audit_log_inode()
	bpf_audit_log_task()
	bpf_audit_log_net_sock()
	bpf_audit_log_net_sockaddr()

Signed-off-by: Frederick Lawler <fred@cloudflare.com>
---
 include/linux/lsm_audit.h   |   1 +
 include/uapi/linux/audit.h  |   1 +
 security/lsm_audit_kfuncs.c | 306 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 308 insertions(+)

diff --git a/include/linux/lsm_audit.h b/include/linux/lsm_audit.h
index 382c56a97bba1d0e5efe082553338229d541e267..859f51590de417ac246309eb75a760b8632224be 100644
--- a/include/linux/lsm_audit.h
+++ b/include/linux/lsm_audit.h
@@ -78,6 +78,7 @@ struct common_audit_data {
 #define LSM_AUDIT_DATA_NOTIFICATION 16
 #define LSM_AUDIT_DATA_ANONINODE	17
 #define LSM_AUDIT_DATA_NLMSGTYPE	18
+#define LSM_AUDIT_DATA_CAUSE 19 /* unused */
 	union 	{
 		struct path path;
 		struct dentry *dentry;
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 14a1c1fe013acecb12ea6bf81690965421baa7ff..7a22e214fe3e421decfc4109d2e6a3cee996fe51 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -150,6 +150,7 @@
 #define AUDIT_LANDLOCK_DOMAIN	1424	/* Landlock domain status */
 #define AUDIT_MAC_TASK_CONTEXTS	1425	/* Multiple LSM task contexts */
 #define AUDIT_MAC_OBJ_CONTEXTS	1426	/* Multiple LSM objext contexts */
+#define AUDIT_BPF_LSM_ACCESS		1427	/* LSM BPF MAC events */
 
 #define AUDIT_FIRST_KERN_ANOM_MSG   1700
 #define AUDIT_LAST_KERN_ANOM_MSG    1799
diff --git a/security/lsm_audit_kfuncs.c b/security/lsm_audit_kfuncs.c
new file mode 100644
index 0000000000000000000000000000000000000000..0d4fb20be34a61db29aa2c48d2aefc39131e73bf
--- /dev/null
+++ b/security/lsm_audit_kfuncs.c
@@ -0,0 +1,306 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright (c) 2026 Cloudflare */
+
+#include <linux/audit.h>
+#include <linux/bpf_mem_alloc.h>
+#include <linux/gfp_types.h>
+#include <linux/in6.h>
+#include <linux/lsm_audit.h>
+#include <linux/socket.h>
+#include <linux/types.h>
+
+struct bpf_audit_context {
+	struct audit_buffer *ab;
+	u64 log_once_mask;
+};
+
+static struct bpf_mem_alloc bpf_audit_context_ma;
+
+static inline u64 log_once(struct bpf_audit_context *ac, u64 mask)
+{
+	u64 set = (ac->log_once_mask & mask);
+
+	ac->log_once_mask |= mask;
+	return set;
+}
+
+static inline int __audit_log_lsm_data(struct bpf_audit_context *ac,
+				       struct common_audit_data *ad)
+{
+	if (log_once(ac, BIT_ULL(ad->type)))
+		return -EINVAL;
+
+	audit_log_lsm_data(ac->ab, ad);
+	return 0;
+}
+
+__bpf_kfunc_start_defs();
+
+__bpf_kfunc
+struct bpf_audit_context *bpf_audit_log_start(struct bpf_prog_aux *aux)
+{
+	char comm[sizeof(current->comm)];
+	struct bpf_audit_context *ac;
+
+	ac = bpf_mem_cache_alloc(&bpf_audit_context_ma);
+	if (!ac)
+		return NULL;
+
+	memset(ac, 0, sizeof(*ac));
+	ac->ab = audit_log_start(audit_context(),
+				 (aux->might_sleep) ? GFP_KERNEL : GFP_ATOMIC,
+				 AUDIT_BPF_LSM_ACCESS);
+	if (!ac->ab) {
+		bpf_mem_cache_free(&bpf_audit_context_ma, ac);
+		return NULL;
+	}
+
+	audit_log_format(ac->ab, "prog-id=%d", aux->id);
+
+	/* Audit may not have a filter configured for syscalls. Include
+	 * potentionally redundant pid & comm information
+	 */
+	audit_log_format(ac->ab, " pid=%d comm=", task_tgid_nr(current));
+	audit_log_untrustedstring(ac->ab, get_task_comm(comm, current));
+
+	return ac;
+}
+
+__bpf_kfunc void bpf_audit_log_end(struct bpf_audit_context *ac)
+{
+	audit_log_end(ac->ab);
+	bpf_mem_cache_free(&bpf_audit_context_ma, ac);
+}
+
+__bpf_kfunc int bpf_audit_log_cause(struct bpf_audit_context *ac,
+				    const char *cause__str)
+{
+	if (log_once(ac, BIT_ULL(LSM_AUDIT_DATA_CAUSE)))
+		return -EINVAL;
+
+	audit_log_format(ac->ab, " cause=");
+	audit_log_untrustedstring(ac->ab, cause__str);
+	return 0;
+}
+
+__bpf_kfunc int bpf_audit_log_cap(struct bpf_audit_context *ac, int cap)
+{
+	struct common_audit_data ad;
+
+	ad.type = LSM_AUDIT_DATA_CAP;
+	ad.u.cap = cap;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_path(struct bpf_audit_context *ac,
+				   const struct path *path)
+{
+	struct common_audit_data ad;
+
+	/* DATA_PATH prints similar to DATA_FILE */
+	if (log_once(ac, BIT_ULL(LSM_AUDIT_DATA_FILE)))
+		return -EINVAL;
+
+	ad.type = LSM_AUDIT_DATA_PATH;
+	ad.u.path = *path;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_file(struct bpf_audit_context *ac,
+				   struct file *file)
+{
+	struct common_audit_data ad;
+
+	/* DATA_PATH prints similar to DATA_FILE */
+	if (log_once(ac, BIT_ULL(LSM_AUDIT_DATA_PATH)))
+		return -EINVAL;
+
+	ad.type = LSM_AUDIT_DATA_FILE;
+	ad.u.file = file;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_ioctl_op(struct bpf_audit_context *ac,
+				       struct file *file, u16 cmd)
+{
+	struct lsm_ioctlop_audit op = { .path = file->f_path, .cmd = cmd };
+	struct common_audit_data ad;
+
+	ad.type = LSM_AUDIT_DATA_IOCTL_OP;
+	ad.u.op = &op;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_dentry(struct bpf_audit_context *ac,
+				     struct dentry *dentry)
+{
+	struct common_audit_data ad;
+
+	/* DATA_DENTRY prints similar to DATA_INODE */
+	if (log_once(ac, BIT_ULL(LSM_AUDIT_DATA_INODE)))
+		return -EINVAL;
+
+	ad.type = LSM_AUDIT_DATA_DENTRY;
+	ad.u.dentry = dentry;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_inode(struct bpf_audit_context *ac,
+				    struct inode *inode)
+{
+	struct common_audit_data ad;
+
+	/* DATA_DENTRY prints similar to DATA_INODE */
+	if (log_once(ac, BIT_ULL(LSM_AUDIT_DATA_DENTRY)))
+		return -EINVAL;
+
+	ad.type = LSM_AUDIT_DATA_INODE;
+	ad.u.inode = inode;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_task(struct bpf_audit_context *ac,
+				   struct task_struct *tsk)
+{
+	struct common_audit_data ad;
+
+	ad.type = LSM_AUDIT_DATA_TASK;
+	ad.u.tsk = tsk;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int bpf_audit_log_net_sock(struct bpf_audit_context *ac, int netif,
+				       const struct socket *sock)
+{
+	struct lsm_network_audit net = { .sk = sock->sk, .netif = netif };
+	struct common_audit_data ad;
+
+	ad.type = LSM_AUDIT_DATA_NET;
+	ad.u.net = &net;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc int
+bpf_audit_log_net_sockaddr(struct bpf_audit_context *ac, int netif,
+			   const struct sockaddr *saddr__nullable,
+			   const struct sockaddr *daddr__nullable, int addrlen)
+{
+	struct lsm_network_audit net;
+	struct common_audit_data ad;
+
+	net.netif = netif;
+
+	if (!saddr__nullable && !daddr__nullable)
+		return -EINVAL;
+
+	if (saddr__nullable && daddr__nullable &&
+	    saddr__nullable->sa_family != daddr__nullable->sa_family)
+		return -EINVAL;
+
+	if (saddr__nullable)
+		net.family = saddr__nullable->sa_family;
+	else
+		net.family = daddr__nullable->sa_family;
+
+	switch (net.family) {
+#if IS_ENABLED(CONFIG_IPV6)
+	case AF_INET6:
+		if (addrlen < SIN6_LEN_RFC2133)
+			return -EINVAL;
+
+		if (saddr__nullable) {
+			struct sockaddr_in6 *saddr =
+				(struct sockaddr_in6 *)saddr__nullable;
+			net.fam.v6.saddr = saddr->sin6_addr;
+			net.sport = saddr->sin6_port;
+		}
+
+		if (daddr__nullable) {
+			struct sockaddr_in6 *daddr =
+				(struct sockaddr_in6 *)daddr__nullable;
+			net.fam.v6.daddr = daddr->sin6_addr;
+			net.dport = daddr->sin6_port;
+		}
+		break;
+#endif
+	case AF_INET:
+		if (addrlen < sizeof(struct sockaddr_in))
+			return -EINVAL;
+
+		if (saddr__nullable) {
+			struct sockaddr_in *saddr =
+				(struct sockaddr_in *)saddr__nullable;
+			net.fam.v4.saddr = saddr->sin_addr.s_addr;
+			net.sport = saddr->sin_port;
+		}
+
+		if (daddr__nullable) {
+			struct sockaddr_in *daddr =
+				(struct sockaddr_in *)daddr__nullable;
+			net.fam.v4.daddr = daddr->sin_addr.s_addr;
+			net.dport = daddr->sin_port;
+		}
+		break;
+	default:
+		return -EAFNOSUPPORT;
+	}
+
+	ad.type = LSM_AUDIT_DATA_NET;
+	ad.u.net = &net;
+	return __audit_log_lsm_data(ac, &ad);
+}
+
+__bpf_kfunc_end_defs();
+
+BTF_KFUNCS_START(lsm_audit_set_ids)
+
+BTF_ID_FLAGS(func, bpf_audit_log_start,
+	     KF_ACQUIRE | KF_DESTRUCTIVE | KF_IMPLICIT_ARGS | KF_RET_NULL);
+
+BTF_ID_FLAGS(func, bpf_audit_log_end, KF_DESTRUCTIVE | KF_RELEASE);
+
+/* The following have a recursion opportunity if a LSM is attached to any of
+ * the following functions, and a bpf_audit_log_*() is called.
+ *  security_current_getlsmprop_subj,
+ *  security_lsmprop_to_secctx, or
+ *  security_release_secctx
+ */
+BTF_ID_FLAGS(func, bpf_audit_log_cause, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_cap, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_path, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_file, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_ioctl_op, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_dentry, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_inode, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_task, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_net_sock, KF_DESTRUCTIVE);
+BTF_ID_FLAGS(func, bpf_audit_log_net_sockaddr, KF_DESTRUCTIVE);
+
+BTF_KFUNCS_END(lsm_audit_set_ids)
+
+static int bpf_lsm_audit_kfuncs_filter(const struct bpf_prog *prog,
+				       u32 kfunc_id)
+{
+	if (!btf_id_set8_contains(&lsm_audit_set_ids, kfunc_id))
+		return 0;
+
+	return prog->type != BPF_PROG_TYPE_LSM ? -EACCES : 0;
+}
+
+static const struct btf_kfunc_id_set bpf_lsm_audit_set = {
+	.owner = THIS_MODULE,
+	.set = &lsm_audit_set_ids,
+	.filter = bpf_lsm_audit_kfuncs_filter,
+};
+
+static int lsm_audit_init_bpf(void)
+{
+	int ret;
+
+	ret = bpf_mem_alloc_init(&bpf_audit_context_ma,
+				 sizeof(struct bpf_audit_context), false);
+	return ret ?: register_btf_kfunc_id_set(BPF_PROG_TYPE_LSM,
+						 &bpf_lsm_audit_set);
+}
+
+late_initcall(lsm_audit_init_bpf)

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC bpf-next 2/4] audit/security: Enable audit BPF kfuncs
From: Frederick Lawler @ 2026-03-11 21:31 UTC (permalink / raw)
  To: Paul Moore, James Morris, Serge E. Hallyn, Eric Paris,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Mickaël Salaün, Günther Noack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, Frederick Lawler
In-Reply-To: <20260311-bpf-auditd-send-message-v1-0-10a62db5c92f@cloudflare.com>

Enable audit BPF kfuncs.

Signed-off-by: Frederick Lawler <fred@cloudflare.com>
---
 security/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/security/Makefile b/security/Makefile
index 4601230ba442a1bcedc3f999b74a7796ac72894d..de980b2797c1f8f8d0eaeb1be949c41e6ecb8fc1 100644
--- a/security/Makefile
+++ b/security/Makefile
@@ -16,6 +16,8 @@ obj-$(CONFIG_SECURITYFS)		+= inode.o
 obj-$(CONFIG_SECURITY_SELINUX)		+= selinux/
 obj-$(CONFIG_SECURITY_SMACK)		+= smack/
 obj-$(CONFIG_HAS_SECURITY_AUDIT)	+= lsm_audit.o
+lsm_audit-y += lsm_audit.o
+lsm_audit-$(CONFIG_BPF_LSM)	+= lsm_audit_kfuncs.o
 obj-$(CONFIG_SECURITY_TOMOYO)		+= tomoyo/
 obj-$(CONFIG_SECURITY_APPARMOR)		+= apparmor/
 obj-$(CONFIG_SECURITY_YAMA)		+= yama/

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC bpf-next 3/4] selftests/bpf: Add audit helpers for BPF tests
From: Frederick Lawler @ 2026-03-11 21:31 UTC (permalink / raw)
  To: Paul Moore, James Morris, Serge E. Hallyn, Eric Paris,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Mickaël Salaün, Günther Noack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, Frederick Lawler
In-Reply-To: <20260311-bpf-auditd-send-message-v1-0-10a62db5c92f@cloudflare.com>

Add audit helper utilities for reading and parsing audit messages
in BPF selftests.

Assisted-by: Claude:claude-4.5-opus
Signed-off-by: Frederick Lawler <fred@cloudflare.com>
---
 tools/testing/selftests/bpf/Makefile        |   3 +-
 tools/testing/selftests/bpf/audit_helpers.c | 281 ++++++++++++++++++++++++++++
 tools/testing/selftests/bpf/audit_helpers.h |  55 ++++++
 3 files changed, 338 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 869b582b1d1ff496fb07736597708487be3438ed..76a428539add5e03fe3811b41c55005c22f5cead 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -754,7 +754,8 @@ TRUNNER_EXTRA_SOURCES := test_progs.c		\
 			 flow_dissector_load.h	\
 			 ip_check_defrag_frags.h	\
 			 bpftool_helpers.c	\
-			 usdt_1.c usdt_2.c
+			 usdt_1.c usdt_2.c	\
+			 audit_helpers.c
 TRUNNER_LIB_SOURCES := find_bit.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read				\
 		       $(OUTPUT)/liburandom_read.so			\
diff --git a/tools/testing/selftests/bpf/audit_helpers.c b/tools/testing/selftests/bpf/audit_helpers.c
new file mode 100644
index 0000000000000000000000000000000000000000..a105136a581f92a1af73b9456b1e85dc88176678
--- /dev/null
+++ b/tools/testing/selftests/bpf/audit_helpers.c
@@ -0,0 +1,281 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * BPF audit helpers
+ *
+ * Borrowed code from tools/selftests/landlock/audit.h
+ *
+ * Copyright (C) 2024-2025 Microsoft Corporation
+ * Copyright (c) 2026 Cloudflare
+ */
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/audit.h>
+#include <linux/netlink.h>
+#include <netinet/in.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/un.h>
+
+#include "audit_helpers.h"
+
+static __u32 seq;
+
+int audit_init(void)
+{
+	int bufsize = 1024 * 1024; /* 1MB receive buffer */
+	struct audit_message msg;
+	int fd, err;
+
+	fd = socket(PF_NETLINK, SOCK_RAW, NETLINK_AUDIT);
+	if (fd < 0)
+		return -errno;
+
+	/*
+	 * Increase receive buffer to reduce kernel-side queueing.
+	 * When the socket buffer fills up, audit records get queued in
+	 * the kernel's hold/retry queues and delivered on subsequent runs.
+	 */
+	setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &bufsize, sizeof(bufsize));
+
+	seq = 0;
+	err = audit_send(fd, AUDIT_SET, AUDIT_STATUS_ENABLED, 1);
+	if (err)
+		goto out_close;
+
+	do {
+		err = audit_recv(fd, &msg, 0);
+		if (err < 0)
+			goto out_close;
+	} while (msg.nlh.nlmsg_type != NLMSG_ERROR);
+
+	if (msg.err.error)
+		goto out_close;
+
+	err = audit_send(fd, AUDIT_SET, AUDIT_STATUS_PID, getpid());
+	if (err)
+		goto out_close;
+
+	do {
+		err = audit_recv(fd, &msg, 0);
+		if (err < 0)
+			goto out_close;
+	} while (msg.nlh.nlmsg_type != NLMSG_ERROR);
+
+	if (msg.err.error)
+		goto out_close;
+
+	return fd;
+
+out_close:
+	close(fd);
+	return err;
+}
+
+void audit_cleanup(int fd)
+{
+	if (fd > 0)
+		close(fd);
+}
+
+int audit_send(int fd, __u16 type, __u32 key, __u32 val)
+{
+	struct audit_message msg = {
+		.nlh = {
+			.nlmsg_len = NLMSG_SPACE(sizeof(msg.status)),
+			.nlmsg_type = type,
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK,
+			.nlmsg_seq = ++seq,
+		},
+		.status = {
+			.mask = key,
+			.enabled = key == AUDIT_STATUS_ENABLED ? val : 0,
+			.pid = key == AUDIT_STATUS_PID ? val : 0,
+		},
+	};
+	struct sockaddr_nl addr = { .nl_family = AF_NETLINK };
+	int ret;
+
+	do {
+		ret = sendto(fd, &msg, msg.nlh.nlmsg_len, 0,
+			     (struct sockaddr *)&addr, sizeof(addr));
+	} while (ret < 0 && errno == EINTR);
+
+	return ret == msg.nlh.nlmsg_len ? 0 : -errno;
+}
+
+/*
+ * Receive an audit message from the netlink socket.
+ * Returns:
+ *   > 0: message type on success
+ *   0: ACK received (NLMSG_ERROR with error=0)
+ *   < 0: negative errno on error
+ */
+int audit_recv(int fd, struct audit_message *msg, int flags)
+{
+	struct sockaddr_nl addr;
+	socklen_t addrlen = sizeof(addr);
+	int ret;
+
+	do {
+		ret = recvfrom(fd, msg, sizeof(*msg), flags,
+			       (struct sockaddr *)&addr, &addrlen);
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret < 0)
+		return -errno;
+
+	/* Must be from kernel (pid 0) */
+	if (addrlen != sizeof(addr) || addr.nl_pid != 0)
+		return -EINVAL;
+
+	/*
+	 * NLMSG_ERROR with error=0 is an ACK. The kernel sends this in
+	 * response to messages with NLM_F_ACK flag set.
+	 */
+	if (msg->nlh.nlmsg_type == NLMSG_ERROR) {
+		if (msg->err.error == 0)
+			return 0; /* ACK */
+		return msg->err.error;
+	}
+
+	return msg->nlh.nlmsg_type;
+}
+
+__printf(2, 3) static inline void
+debug(struct audit_observer *obs, const char *fmt, ...)
+{
+	va_list args;
+
+	if (!obs || !obs->log)
+		return;
+
+	va_start(args, fmt);
+	vfprintf(obs->log, fmt, args);
+	va_end(args);
+}
+
+void audit_observer_init(struct audit_observer *obs, int audit_fd, FILE *log,
+			 int wait_timeout_ms)
+{
+	obs->audit_fd = audit_fd;
+	obs->wait_timeout = wait_timeout_ms;
+
+	if (log)
+		obs->log = log;
+
+	audit_observer_reset(obs);
+}
+
+void audit_observer_reset(struct audit_observer *obs)
+{
+	memset(obs->expects, 0, sizeof(obs->expects));
+	obs->num_expects = 0;
+}
+
+int audit_observer_expect(struct audit_observer *obs, int audit_type,
+			  const char *pattern, int count)
+{
+	struct audit_expectation *exp;
+
+	if (obs->num_expects >= AUDIT_EXPECT_MAX)
+		return -EINVAL;
+
+	exp = &obs->expects[obs->num_expects++];
+	exp->type = audit_type;
+	exp->pattern = pattern;
+	exp->expected_count = count;
+	exp->matched_count = 0;
+	return 0;
+}
+
+/*
+ * Check if a message matches any pending expectation.
+ * Returns 1 if all expectations are satisfied, 0 otherwise.
+ */
+static int audit_observer_match(struct audit_observer *obs,
+				struct audit_message *msg)
+{
+	int all_satisfied = 1;
+
+	for (int i = 0; i < obs->num_expects; i++) {
+		struct audit_expectation *exp = &obs->expects[i];
+
+		if (exp->matched_count >= exp->expected_count)
+			continue;
+
+		/* Check if this message matches */
+		if (exp->type && msg->nlh.nlmsg_type != exp->type)
+			goto check_satisfied;
+
+		if (strstr(msg->data, exp->pattern)) {
+			exp->matched_count++;
+			debug(obs, "%s: matched [%d/%d] %s\n", __func__,
+			      exp->matched_count, exp->expected_count,
+			      exp->pattern);
+		}
+
+check_satisfied:
+		if (exp->matched_count < exp->expected_count)
+			all_satisfied = 0;
+	}
+
+	return all_satisfied;
+}
+
+/*
+ * Wait for all expected audit messages to arrive.
+ * Returns 0 on success (all expectations met), -ETIMEDOUT on timeout.
+ */
+int audit_observer_wait(struct audit_observer *obs)
+{
+	struct pollfd pfd = { .fd = obs->audit_fd, .events = POLLIN };
+	struct audit_message msg;
+	int ret;
+
+	while (1) {
+		ret = poll(&pfd, 1, obs->wait_timeout);
+		if (ret < 0)
+			return -errno;
+		if (ret == 0)
+			return -ETIMEDOUT;
+
+		memset(&msg, 0, sizeof(msg));
+		ret = audit_recv(obs->audit_fd, &msg, MSG_DONTWAIT);
+
+		if (ret == -EAGAIN || ret == -EWOULDBLOCK)
+			continue;
+
+		if (ret <= 0)
+			continue;
+
+		debug(obs, "%s: recv type=%d %s\n", __func__,
+		      msg.nlh.nlmsg_type, msg.data);
+
+		if (audit_observer_match(obs, &msg))
+			return 0;
+	}
+}
+
+int audit_observer_check_satisfied(struct audit_observer *obs)
+{
+	for (int i = 0; i < obs->num_expects; i++) {
+		struct audit_expectation *exp = &obs->expects[i];
+
+		if (exp->matched_count < exp->expected_count) {
+			debug(obs, "%s: FAILED pattern '%s' got %d/%d\n",
+			      __func__, exp->pattern, exp->matched_count,
+			      exp->expected_count);
+			return 0;
+		}
+	}
+
+	return 1;
+}
diff --git a/tools/testing/selftests/bpf/audit_helpers.h b/tools/testing/selftests/bpf/audit_helpers.h
new file mode 100644
index 0000000000000000000000000000000000000000..40f3d20635bb25c305067756897593f34d54531e
--- /dev/null
+++ b/tools/testing/selftests/bpf/audit_helpers.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (c) 2026 Cloudflare */
+#pragma once
+
+#include <linux/audit.h>
+#include <linux/netlink.h>
+#include <stdio.h>
+
+#define MAX_AUDIT_MESSAGE_LENGTH 8970
+
+struct audit_message {
+	struct nlmsghdr nlh;
+	union {
+		struct audit_status status;
+		struct nlmsgerr err;
+		char data[MAX_AUDIT_MESSAGE_LENGTH];
+	};
+};
+
+/*
+ * Observer-based audit message matching.
+ * Tests register expected patterns before triggering events, then
+ * wait for matches. Messages that don't match any pattern are skipped.
+ */
+#define AUDIT_EXPECT_MAX 32
+
+struct audit_expectation {
+	__u16 type;
+	const char *pattern;
+	int expected_count;
+	int matched_count;
+};
+
+struct audit_observer {
+	struct audit_expectation expects[AUDIT_EXPECT_MAX];
+	int num_expects;
+	FILE *log;
+	int wait_timeout;
+	int audit_fd;
+};
+
+int audit_init(void);
+void audit_cleanup(int fd);
+int audit_wait_ack(int fd);
+int audit_send(int fd, __u16 type, __u32 key, __u32 val);
+int audit_recv(int fd, struct audit_message *msg, int flags);
+int audit_wait_ack(int fd);
+
+void audit_observer_init(struct audit_observer *obs, int audit_fd, FILE *log,
+			 int wait_timeout);
+void audit_observer_reset(struct audit_observer *obs);
+int audit_observer_expect(struct audit_observer *obs, int audit_type,
+			  const char *pattern, int count);
+int audit_observer_wait(struct audit_observer *obs);
+int audit_observer_check_satisfied(struct audit_observer *obs);

-- 
2.43.0


^ permalink raw reply related

* [PATCH RFC bpf-next 4/4] selftests/bpf: Add lsm_audit_kfuncs tests
From: Frederick Lawler @ 2026-03-11 21:31 UTC (permalink / raw)
  To: Paul Moore, James Morris, Serge E. Hallyn, Eric Paris,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Mickaël Salaün, Günther Noack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, Frederick Lawler
In-Reply-To: <20260311-bpf-auditd-send-message-v1-0-10a62db5c92f@cloudflare.com>

Add selftests for the audit kfunc BPF LSM functionality including
both the test program and BPF progs.

Assisted-by: Claude:claude-4.5-opus
Signed-off-by: Frederick Lawler <fred@cloudflare.com>
---
 .../selftests/bpf/prog_tests/lsm_audit_kfuncs.c    | 598 +++++++++++++++++++++
 .../selftests/bpf/progs/test_lsm_audit_kfuncs.c    | 263 +++++++++
 2 files changed, 861 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/lsm_audit_kfuncs.c b/tools/testing/selftests/bpf/prog_tests/lsm_audit_kfuncs.c
new file mode 100644
index 0000000000000000000000000000000000000000..de18e1a3c79578d4151a12a029f2a9e6cc7648e3
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/lsm_audit_kfuncs.c
@@ -0,0 +1,598 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Cloudflare */
+#define _GNU_SOURCE
+
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <string.h>
+#include <unistd.h>
+#include <linux/audit.h>
+#include <linux/netlink.h>
+#include <netinet/in.h>
+#include <sys/ioctl.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/un.h>
+
+#include "audit_helpers.h"
+#include "test_lsm_audit_kfuncs.skel.h"
+#include "test_progs.h"
+
+#ifndef AUDIT_BPF_LSM_ACCESS
+#define AUDIT_BPF_LSM_ACCESS 1427
+#endif
+
+static inline struct sockaddr_in addr4(void)
+{
+	return (struct sockaddr_in){
+		.sin_family = AF_INET,
+		.sin_port = htons(1234),
+		.sin_addr.s_addr = htonl(INADDR_LOOPBACK),
+	};
+}
+
+static inline struct sockaddr_in6 addr6(void)
+{
+	return (struct sockaddr_in6){
+		.sin6_family = AF_INET6,
+		.sin6_port = htons(1234),
+		.sin6_addr = in6addr_loopback,
+	};
+}
+
+static int bind_connect(const struct sockaddr *addr, int addrlen)
+{
+	int err;
+	int sock;
+	int opt = 1;
+	socklen_t optlen = sizeof(opt);
+
+	sock = socket(addr->sa_family, SOCK_STREAM, 0);
+	if (!ASSERT_OK_FD(sock, "socket"))
+		return 1;
+
+	err = setsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, optlen);
+	if (!ASSERT_OK(err, "setsockopt"))
+		goto done;
+
+	err = bind(sock, addr, addrlen);
+	if (!ASSERT_OK(err, "bind"))
+		goto done;
+
+	err = connect(sock, addr, addrlen);
+	ASSERT_OK(err, "connect");
+
+	err = getsockopt(sock, SOL_SOCKET, SO_REUSEADDR, &opt, &optlen);
+	ASSERT_OK(err, "getsockopt");
+
+done:
+	close(sock);
+	return err;
+}
+
+static void test_audit_log_sockaddr_src(struct audit_observer *obs,
+					struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_in sin = addr4();
+	struct sockaddr_in6 sin6 = addr6();
+	struct bpf_link *link;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sockaddr_src);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"bind4\" saddr=127.0.0.1 src=1234 netif=lo",
+			      1);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"bind6\" saddr=::1 src=1234 netif=lo", 1);
+
+	if (bind_connect((const struct sockaddr *)&sin, sizeof(sin)))
+		goto done;
+
+	if (bind_connect((const struct sockaddr *)&sin6, sizeof(sin6)))
+		goto done;
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+done:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_sockaddr_dest(struct audit_observer *obs,
+					 struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_in sin = addr4();
+	struct sockaddr_in6 sin6 = addr6();
+	struct bpf_link *link;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sockaddr_dest);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"connect4\" daddr=127.0.0.1 dest=1234 netif=lo",
+			      1);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"connect6\" daddr=::1 dest=1234 netif=lo",
+			      1);
+
+	if (bind_connect((const struct sockaddr *)&sin, sizeof(sin)))
+		goto out;
+
+	if (bind_connect((const struct sockaddr *)&sin6, sizeof(sin6)))
+		goto out;
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_sock(struct audit_observer *obs,
+				struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_in sin = addr4();
+	struct sockaddr_in6 sin6 = addr6();
+	struct bpf_link *link;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sock);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"sock4\" laddr=127.0.0.1 lport=1234 faddr=127.0.0.1 fport=1234 netif=lo",
+			1);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"sock6\" laddr=::1 lport=1234 faddr=::1 fport=1234 netif=lo",
+			1);
+
+	if (bind_connect((const struct sockaddr *)&sin, sizeof(sin)))
+		goto out;
+
+	if (bind_connect((const struct sockaddr *)&sin6, sizeof(sin6)))
+		goto out;
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_sock_unix(struct audit_observer *obs,
+				     struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_un addr;
+	struct bpf_link *link;
+	char expected[256];
+	char sun_path[108];
+	int server_fd = -1;
+	int opt = 1;
+	socklen_t optlen = sizeof(opt);
+	int err;
+
+	snprintf(sun_path, sizeof(sun_path), "/root/tmp/bpf_audit_test_%d.sock",
+		 getpid());
+
+	/* Ensure directory exists */
+	mkdir("/root/tmp", 0755);
+	unlink(sun_path);
+
+	link = bpf_program__attach_lsm(skel->progs.test_sock_unix);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	snprintf(expected, sizeof(expected), "cause=\"sock_unix\" path=\"%s\"",
+		 sun_path);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS, expected, 1);
+
+	memset(&addr, 0, sizeof(addr));
+	addr.sun_family = AF_UNIX;
+	strncpy(addr.sun_path, sun_path, sizeof(addr.sun_path) - 1);
+
+	server_fd = socket(AF_UNIX, SOCK_STREAM, 0);
+	if (!ASSERT_OK_FD(server_fd, "socket"))
+		goto out;
+
+	err = bind(server_fd, (struct sockaddr *)&addr, sizeof(addr));
+	if (!ASSERT_OK(err, "bind"))
+		goto out;
+
+	err = getsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR, &opt, &optlen);
+	ASSERT_OK(err, "getsockopt");
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	if (server_fd >= 0)
+		close(server_fd);
+	unlink(sun_path);
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_file(struct audit_observer *obs,
+				struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	int err;
+	int fd;
+
+	link = bpf_program__attach_lsm(skel->progs.test_file);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"file\" path=\"/dev/null\" dev=\"devtmpfs\" ino=4",
+			1);
+
+	fd = open("/dev/null", O_RDONLY);
+	close(fd);
+	if (!ASSERT_OK_FD(fd, "open(/dev/null)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_path(struct audit_observer *obs,
+				struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	int err;
+	int fd;
+
+	link = bpf_program__attach_lsm(skel->progs.test_file_path);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"path\" path=\"/dev/null\" dev=\"devtmpfs\" ino=4",
+			      1);
+
+	fd = open("/dev/null", O_RDONLY);
+	close(fd);
+	if (!ASSERT_OK_FD(fd, "open(/dev/null)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_dentry(struct audit_observer *obs,
+				  struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	char expected[128];
+	char buf[64];
+	int err;
+
+	link = bpf_program__attach_lsm(skel->progs.test_dentry);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	snprintf(expected, sizeof(expected),
+		 "cause=\"dentry\" name=\"exe\" dev=");
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS, expected, 1);
+
+	/* readlink triggers inode_readlink hook */
+	err = readlink("/proc/self/exe", buf, sizeof(buf));
+	if (!ASSERT_GT(err, 0, "readlink(/proc/self/exe)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_inode(struct audit_observer *obs,
+				 struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	char expected[128];
+	struct stat st;
+	int err;
+	int fd;
+
+	if (!ASSERT_OK(stat("/dev/null", &st), "stat(/dev/null)"))
+		return;
+
+	link = bpf_program__attach_lsm(skel->progs.test_inode);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	snprintf(expected, sizeof(expected),
+		 "cause=\"inode\" name=\"null\" dev=\"devtmpfs\" ino=%lu",
+		 st.st_ino);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS, expected, 1);
+
+	fd = open("/dev/null", O_RDONLY);
+	close(fd);
+	if (!ASSERT_OK_FD(fd, "open(/dev/null)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_task(struct audit_observer *obs,
+				struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	char expected[128];
+	pid_t pid;
+	int err;
+
+	pid = getpid();
+
+	link = bpf_program__attach_lsm(skel->progs.test_task);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	snprintf(expected, sizeof(expected),
+		 "cause=\"task\" opid=%d ocomm=\"test_progs\"", pid);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS, expected, 1);
+
+	err = getpgid(pid);
+	if (!ASSERT_GT(err, -1, "pid pgid match"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_cap(struct audit_observer *obs,
+			       struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	int err;
+	int fd;
+
+	link = bpf_program__attach_lsm(skel->progs.test_cap);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"cap\" capability=", 1);
+
+	fd = open("/proc/kallsyms", O_RDONLY);
+	close(fd);
+	if (!ASSERT_OK_FD(fd, "open(/proc/kallsyms)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_ioctl_op(struct audit_observer *obs,
+				    struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	char expected[128];
+	struct stat st;
+	int err;
+	int fd;
+
+	if (!ASSERT_OK(stat("/dev/null", &st), "stat(/dev/null)"))
+		return;
+
+	link = bpf_program__attach_lsm(skel->progs.test_ioctl_op);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	snprintf(expected, sizeof(expected),
+		 "cause=\"ioctl_op\" path=\"/dev/null\" dev=\"devtmpfs\" ino=%lu ioctlcmd=0x%x",
+		st.st_ino, TCGETS);
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS, expected, 1);
+
+	fd = open("/dev/null", O_RDONLY);
+	if (!ASSERT_OK_FD(fd, "open(/dev/null)"))
+		goto out;
+
+	/* ioctl will fail with ENOTTY but the LSM hook fires regardless */
+	ioctl(fd, TCGETS, NULL);
+	close(fd);
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void test_audit_log_sleepable(struct audit_observer *obs,
+				     struct test_lsm_audit_kfuncs *skel)
+{
+	struct bpf_link *link;
+	int err;
+	int fd;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sleepable);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"sleepable\" path=\"/dev/null\" dev=\"devtmpfs\" ino=4",
+		1);
+
+	fd = open("/dev/null", O_RDONLY);
+	close(fd);
+	if (!ASSERT_OK_FD(fd, "open(/dev/null)"))
+		goto out;
+
+	err = audit_observer_wait(obs);
+	ASSERT_OK(err, "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+out:
+	bpf_link__destroy(link);
+}
+
+static void
+test_audit_log_sockaddr_both_null(struct audit_observer *obs,
+				  struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_in sin = addr4();
+	struct bpf_link *link;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sockaddr_both_null);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	/* Should see cause but no saddr/daddr since both were NULL */
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"sockaddr_both_null\"", 1);
+
+	bind_connect((const struct sockaddr *)&sin, sizeof(sin));
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+	bpf_link__destroy(link);
+}
+
+static void
+test_audit_log_sockaddr_small_addrlen(struct audit_observer *obs,
+				      struct test_lsm_audit_kfuncs *skel)
+{
+	struct sockaddr_in sin = addr4();
+	struct bpf_link *link;
+
+	link = bpf_program__attach_lsm(skel->progs.test_sockaddr_small_addrlen);
+	if (!ASSERT_OK_PTR(link, "attach"))
+		return;
+
+	audit_observer_reset(obs);
+
+	/* Should see cause but no saddr since addrlen was too small */
+	audit_observer_expect(obs, AUDIT_BPF_LSM_ACCESS,
+			      "cause=\"sockaddr_small_addrlen\"", 1);
+
+	bind_connect((const struct sockaddr *)&sin, sizeof(sin));
+
+	ASSERT_OK(audit_observer_wait(obs), "audit_observer_wait");
+	ASSERT_TRUE(audit_observer_check_satisfied(obs),
+		    "all expectations met");
+
+	bpf_link__destroy(link);
+}
+
+void test_lsm_audit_kfuncs(void)
+{
+	struct test_lsm_audit_kfuncs *skel = NULL;
+	struct audit_observer obs;
+	FILE *log = NULL;
+	int audit_fd;
+
+	audit_fd = audit_init();
+	if (!ASSERT_GE(audit_fd, 0, "audit_init"))
+		return;
+
+	if (env.verbosity > VERBOSE_NONE)
+		log = env.stdout_saved;
+
+	audit_observer_init(&obs, audit_fd, log, 500);
+
+	skel = test_lsm_audit_kfuncs__open_and_load();
+	if (!ASSERT_OK_PTR(skel, "skel load"))
+		goto close_prog;
+
+	if (test__start_subtest("net")) {
+		test_audit_log_sockaddr_src(&obs, skel);
+		test_audit_log_sockaddr_dest(&obs, skel);
+		test_audit_log_sockaddr_both_null(&obs, skel);
+		test_audit_log_sockaddr_small_addrlen(&obs, skel);
+		test_audit_log_sock(&obs, skel);
+		test_audit_log_sock_unix(&obs, skel);
+	}
+
+	if (test__start_subtest("file")) {
+		test_audit_log_file(&obs, skel);
+		test_audit_log_path(&obs, skel);
+		test_audit_log_dentry(&obs, skel);
+		test_audit_log_inode(&obs, skel);
+	}
+
+	if (test__start_subtest("task")) {
+		test_audit_log_task(&obs, skel);
+		test_audit_log_cap(&obs, skel);
+	}
+
+	if (test__start_subtest("ioctl"))
+		test_audit_log_ioctl_op(&obs, skel);
+
+	if (test__start_subtest("sleepable"))
+		test_audit_log_sleepable(&obs, skel);
+
+close_prog:
+	test_lsm_audit_kfuncs__destroy(skel);
+	audit_cleanup(audit_fd);
+}
diff --git a/tools/testing/selftests/bpf/progs/test_lsm_audit_kfuncs.c b/tools/testing/selftests/bpf/progs/test_lsm_audit_kfuncs.c
new file mode 100644
index 0000000000000000000000000000000000000000..952ba09fce638f3bd14c18060a5baa3ccaec19ca
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/test_lsm_audit_kfuncs.c
@@ -0,0 +1,263 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (c) 2026 Cloudflare */
+
+#include <vmlinux.h>
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_tracing.h>
+#include <bpf/bpf_core_read.h>
+#include <errno.h>
+
+#define AF_UNIX 1
+#define AF_INET 2
+#define AF_INET6 10
+
+char _license[] SEC("license") = "GPL";
+
+SEC("lsm/socket_bind")
+int BPF_PROG(test_sockaddr_src, struct socket *sock, struct sockaddr *address,
+	     int addrlen)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	switch (address->sa_family) {
+	case AF_INET:
+		bpf_audit_log_cause(ac, "bind4");
+		break;
+	case AF_INET6:
+		bpf_audit_log_cause(ac, "bind6");
+	}
+
+	bpf_audit_log_net_sockaddr(ac, 1, address, NULL, addrlen);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/socket_connect")
+int BPF_PROG(test_sockaddr_dest, struct socket *sock, struct sockaddr *address,
+	     int addrlen)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	switch (address->sa_family) {
+	case AF_INET:
+		bpf_audit_log_cause(ac, "connect4");
+		break;
+	case AF_INET6:
+		bpf_audit_log_cause(ac, "connect6");
+	}
+
+	bpf_audit_log_net_sockaddr(ac, 1, NULL, address, addrlen);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/socket_bind")
+int BPF_PROG(test_sockaddr_both_null, struct socket *sock,
+	     struct sockaddr *address, int addrlen)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "sockaddr_both_null");
+	bpf_audit_log_net_sockaddr(ac, 1, NULL, NULL, addrlen);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/socket_bind")
+int BPF_PROG(test_sockaddr_small_addrlen, struct socket *sock,
+	     struct sockaddr *address, int addrlen)
+{
+	struct bpf_audit_context *ac;
+
+	if (address->sa_family != AF_INET)
+		return -EINVAL;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "sockaddr_small_addrlen");
+	bpf_audit_log_net_sockaddr(ac, 1, address, NULL, 1);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/socket_getsockopt")
+int BPF_PROG(test_sock, struct socket *sock, int level, int optname)
+{
+	struct bpf_audit_context *ac;
+	struct sock *sk = sock->sk;
+
+	if (!sk)
+		return -EINVAL;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	switch (sk->__sk_common.skc_family) {
+	case AF_INET:
+		bpf_audit_log_cause(ac, "sock4");
+		break;
+	case AF_INET6:
+		bpf_audit_log_cause(ac, "sock6");
+	}
+
+	bpf_audit_log_net_sock(ac, 1, sock);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/socket_getsockopt")
+int BPF_PROG(test_sock_unix, struct socket *sock, int level, int optname)
+{
+	struct bpf_audit_context *ac;
+	struct sock *sk = sock->sk;
+
+	if (!sk || sk->__sk_common.skc_family != AF_UNIX)
+		return -EINVAL;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "sock_unix");
+	bpf_audit_log_net_sock(ac, 0, sock);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/file_open")
+int BPF_PROG(test_file, struct file *file)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "file");
+	bpf_audit_log_file(ac, file);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/file_open")
+int BPF_PROG(test_file_path, struct file *file)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "path");
+	bpf_audit_log_path(ac, &file->f_path);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/inode_readlink")
+int BPF_PROG(test_dentry, struct dentry *dentry)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "dentry");
+	bpf_audit_log_dentry(ac, dentry);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/file_open")
+int BPF_PROG(test_inode, struct file *file)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "inode");
+	bpf_audit_log_inode(ac, file->f_inode);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/task_getpgid")
+int BPF_PROG(test_task, struct task_struct *task)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "task");
+	bpf_audit_log_task(ac, task);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/capable")
+int BPF_PROG(test_cap, const struct cred *cred, struct user_namespace *ns,
+	     int cap, unsigned int opts)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "cap");
+	bpf_audit_log_cap(ac, cap);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm/file_ioctl")
+int BPF_PROG(test_ioctl_op, struct file *file, unsigned int cmd,
+	     unsigned long arg)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "ioctl_op");
+	bpf_audit_log_ioctl_op(ac, file, cmd);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+
+SEC("lsm.s/file_open")
+int BPF_PROG(test_sleepable, struct file *file)
+{
+	struct bpf_audit_context *ac;
+
+	ac = bpf_audit_log_start();
+	if (!ac)
+		return -ENOMEM;
+
+	bpf_audit_log_cause(ac, "sleepable");
+	bpf_audit_log_file(ac, file);
+	bpf_audit_log_end(ac);
+	return 0;
+}
+

-- 
2.43.0


^ permalink raw reply related

* Re: [PATCH RFC bpf-next 1/4] audit: Implement bpf_audit_log_*() wrappers
From: bot+bpf-ci @ 2026-03-11 22:08 UTC (permalink / raw)
  To: fred, paul, jmorris, serge, eparis, ast, daniel, andrii,
	martin.lau, eddyz87, song, yonghong.song, john.fastabend, kpsingh,
	sdf, haoluo, jolsa, shuah, mic, gnoack
  Cc: linux-kernel, linux-security-module, audit, bpf, linux-kselftest,
	kernel-team, fred, ast, andrii, daniel, martin.lau, eddyz87,
	yonghong.song, clm, ihor.solodrai
In-Reply-To: <20260311-bpf-auditd-send-message-v1-1-10a62db5c92f@cloudflare.com>

[-- Attachment #1: Type: text/plain, Size: 2839 bytes --]

> diff --git a/security/lsm_audit_kfuncs.c b/security/lsm_audit_kfuncs.c
> --- /dev/null
> +++ b/security/lsm_audit_kfuncs.c

[ ... ]

> +__bpf_kfunc
> +struct bpf_audit_context *bpf_audit_log_start(struct bpf_prog_aux *aux)
> +{
> +	char comm[sizeof(current->comm)];
> +	struct bpf_audit_context *ac;
> +
> +	ac = bpf_mem_cache_alloc(&bpf_audit_context_ma);
> +	if (!ac)
> +		return NULL;
> +
> +	memset(ac, 0, sizeof(*ac));
> +	ac->ab = audit_log_start(audit_context(),
> +				 (aux->might_sleep) ? GFP_KERNEL : GFP_ATOMIC,
> +				 AUDIT_BPF_LSM_ACCESS);
> +	if (!ac->ab) {
> +		bpf_mem_cache_free(&bpf_audit_context_ma, ac);
> +		return NULL;
> +	}
> +
> +	audit_log_format(ac->ab, "prog-id=%d", aux->id);

aux->id is u32 -- should this be %u?

> +
> +	/* Audit may not have a filter configured for syscalls. Include
> +	 * potentionally redundant pid & comm information

Minor typo: "potentionally" -> "potentially".

> +	 */

[ ... ]

> +__bpf_kfunc int bpf_audit_log_net_sock(struct bpf_audit_context *ac, int netif,
> +				       const struct socket *sock)
> +{
> +	struct lsm_network_audit net = { .sk = sock->sk, .netif = netif };
> +	struct common_audit_data ad;
> +
> +	ad.type = LSM_AUDIT_DATA_NET;
> +	ad.u.net = &net;
> +	return __audit_log_lsm_data(ac, &ad);
> +}
> +
> +__bpf_kfunc int
> +bpf_audit_log_net_sockaddr(struct bpf_audit_context *ac, int netif,
> +			   const struct sockaddr *saddr__nullable,
> +			   const struct sockaddr *daddr__nullable, int addrlen)
> +{
> +	struct lsm_network_audit net;
                                    ^^^

Unlike bpf_audit_log_net_sock() above which uses a designated
initializer that zero-fills the struct, this declaration leaves
net uninitialized.

In particular, net.sk is never assigned anywhere in this function.
When audit_log_lsm_data() handles LSM_AUDIT_DATA_NET, it does:

    if (a->u.net->sk) {
        const struct sock *sk = a->u.net->sk;
        switch (sk->sk_family) {

Since net.sk contains stack garbage, this will likely dereference
an invalid pointer and crash.

Additionally, when only one of saddr__nullable or daddr__nullable
is provided, the corresponding sport/dport and address fields for
the missing side remain uninitialized, and audit_log_lsm_data()
unconditionally prints both sides:

    print_ipv4_addr(ab, a->u.net->v4info.saddr, a->u.net->sport, ...);
    print_ipv4_addr(ab, a->u.net->v4info.daddr, a->u.net->dport, ...);

Would initializing with something like
struct lsm_network_audit net = { .netif = netif }; address both
issues?

> +	struct common_audit_data ad;
> +
> +	net.netif = netif;

[ ... ]


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/22976062351

^ permalink raw reply

* Re: [PATCH v2 3/5] rust: kernel: add LSM abstraction layer
From: kernel test robot @ 2026-03-11 22:58 UTC (permalink / raw)
  To: Jamie Lindsey, rust-for-linux, linux-security-module
  Cc: llvm, oe-kbuild-all, ojeda, paul, aliceryhl, jmorris, serge,
	jamie
In-Reply-To: <0102019cdb4c705e-7d46b4f3-5cbb-4a6a-b315-e10f182fa987-000000@eu-west-1.amazonses.com>

Hi Jamie,

kernel test robot noticed the following build errors:

[auto build test ERROR on rust/rust-next]
[also build test ERROR on linus/master v7.0-rc3 next-20260311]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamie-Lindsey/rust-helpers-add-C-shims-for-LSM-hook-initialisation/20260311-131258
base:   https://github.com/Rust-for-Linux/linux rust-next
patch link:    https://lore.kernel.org/r/0102019cdb4c705e-7d46b4f3-5cbb-4a6a-b315-e10f182fa987-000000%40eu-west-1.amazonses.com
patch subject: [PATCH v2 3/5] rust: kernel: add LSM abstraction layer
config: x86_64-rhel-9.4-rust (https://download.01.org/0day-ci/archive/20260312/202603120654.DWidINbR-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260312/202603120654.DWidINbR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603120654.DWidINbR-lkp@intel.com/

All errors (new ones prefixed by >>):

>> error[E0433]: failed to resolve: use of unresolved module or unlinked crate `bindings`
   --> rust/doctests_kernel_generated.rs:8845:40
   |
   8845 | kernel::define_lsm!(MyLsm, "my_lsm\0", bindings::LSM_ID_UNDEF as u64);
   |                                        ^^^^^^^^ use of unresolved module or unlinked crate `bindings`
   |
   = help: you might be missing a crate named `bindings`
   = help: consider importing this crate:
   kernel::bindings
--
>> error[E0412]: cannot find type `MyLsm` in this scope
   --> rust/doctests_kernel_generated.rs:8845:21
   |
   8845 | kernel::define_lsm!(MyLsm, "my_lsm\0", bindings::LSM_ID_UNDEF as u64);
   |                     ^^^^^ not found in this scope
--
>> error[E0433]: failed to resolve: use of unresolved module or unlinked crate `bindings`
   --> rust/doctests_kernel_generated.rs:8898:44
   |
   8898 | kernel::define_lsm!(MyLsmType, "my_lsm\0", bindings::LSM_ID_UNDEF as u64);
   |                                            ^^^^^^^^ use of unresolved module or unlinked crate `bindings`
   |
   = help: you might be missing a crate named `bindings`
   = help: consider importing this crate:
   kernel::bindings
--
>> error[E0412]: cannot find type `MyLsmType` in this scope
   --> rust/doctests_kernel_generated.rs:8898:21
   |
   8898 | kernel::define_lsm!(MyLsmType, "my_lsm\0", bindings::LSM_ID_UNDEF as u64);
   |                     ^^^^^^^^^ not found in this scope

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* Re: [PATCH 49/61] media: Prefer IS_ERR_OR_NULL over manual NULL check
From: Kieran Bingham @ 2026-03-11 23:03 UTC (permalink / raw)
  To: Philipp Hahn, amd-gfx, apparmor, bpf, ceph-devel, cocci, dm-devel,
	dri-devel, gfs2, intel-gfx, intel-wired-lan, iommu, kvm,
	linux-arm-kernel, linux-block, linux-bluetooth, linux-btrfs,
	linux-cifs, linux-clk, linux-erofs, linux-ext4, linux-fsdevel,
	linux-gpio, linux-hyperv, linux-input, linux-kernel, linux-leds,
	linux-media, linux-mips, linux-mm, linux-modules, linux-mtd,
	linux-nfs, linux-omap, linux-phy, lin 
  Cc: Shuah Khan, Mauro Carvalho Chehab
In-Reply-To: <20260310-b4-is_err_or_null-v1-49-bd63b656022d@avm.de>

Quoting Philipp Hahn (2026-03-10 11:49:15)
> Prefer using IS_ERR_OR_NULL() over using IS_ERR() and a manual NULL
> check.
> 
> Change generated with coccinelle.
> 
> To: Shuah Khan <skhan@linuxfoundation.org>
> To: Kieran Bingham <kieran.bingham@ideasonboard.com>
> To: Mauro Carvalho Chehab <mchehab@kernel.org>
> Cc: linux-media@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
> ---
>  drivers/media/test-drivers/vimc/vimc-streamer.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/media/test-drivers/vimc/vimc-streamer.c b/drivers/media/test-drivers/vimc/vimc-streamer.c
> index 15d863f97cbf96b7ca7fbf3d7b6b6ec39fcc8ae3..da5aca50bcb4990c06f28e5a883eb398606991e9 100644
> --- a/drivers/media/test-drivers/vimc/vimc-streamer.c
> +++ b/drivers/media/test-drivers/vimc/vimc-streamer.c
> @@ -167,7 +167,7 @@ static int vimc_streamer_thread(void *data)
>                 for (i = stream->pipe_size - 1; i >= 0; i--) {
>                         frame = stream->ved_pipeline[i]->process_frame(
>                                         stream->ved_pipeline[i], frame);
> -                       if (!frame || IS_ERR(frame))
> +                       if (IS_ERR_OR_NULL(frame))

Reviewed-by: Kieran Bingham <kieran.bingham@ideasonboard.com>

>                                 break;
>                 }
>                 //wait for 60hz
> 
> -- 
> 2.43.0
>

^ permalink raw reply

* Re: [PATCH v2 0/5] rust: lsm: introduce safe Rust abstractions for the LSM framework
From: Jamie Lindsey @ 2026-03-11 23:08 UTC (permalink / raw)
  To: aliceryhl
  Cc: paul, jamie, jmorris, linux-security-module, ojeda,
	rust-for-linux, serge
In-Reply-To: <CAH5fLgiQm=2YYvmG54o-MEt2m8x5V5xZrtmsqEUtuB9OZ=FPOw@mail.gmail.com>

On Wed, Mar 11, 2026 at 07:48:57AM +0100, Alice Ryhl wrote:
> What is the intended end-user of these abstractions?

The intended end-user is a real, policy-enforcing LSM for autonomous
agent workloads -- not the sample module included in this series.

I'm building an agent-native security module that enforces capability
manifests at the kernel level: per-agent file access policy, network
destination restrictions, process spawn depth limits, and pre-exec
threat detection. The agent identity is tracked via the LSM security
blob on struct cred, and policy decisions are made per-hook based on
compiled manifest rules.

The sample LSM in patch 4 exists as a boot-test vehicle for the
abstractions, not as the target consumer. I should have made that
clearer in the cover letter -- that's on me.

Regarding Paul's point about example LSMs: understood completely.
I'll rework the series to present the abstractions alongside the
real LSM rather than the sample. I'll review the prior work Paul
linked and the new-LSM guidance before resubmitting.

Thanks to both of you for the fast feedback.

Jamie

^ permalink raw reply

* Re: [PATCH v2 2/5] rust: helpers: add C shims for LSM hook initialisation
From: kernel test robot @ 2026-03-11 23:13 UTC (permalink / raw)
  To: Jamie Lindsey, rust-for-linux, linux-security-module
  Cc: llvm, oe-kbuild-all, ojeda, paul, aliceryhl, jmorris, serge,
	jamie
In-Reply-To: <0102019cdb4c6a42-a28bbebb-3664-4792-966f-4036c94ac19c-000000@eu-west-1.amazonses.com>

Hi Jamie,

kernel test robot noticed the following build warnings:

[auto build test WARNING on rust/rust-next]
[also build test WARNING on linus/master v7.0-rc3]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Jamie-Lindsey/rust-helpers-add-C-shims-for-LSM-hook-initialisation/20260311-131258
base:   https://github.com/Rust-for-Linux/linux rust-next
patch link:    https://lore.kernel.org/r/0102019cdb4c6a42-a28bbebb-3664-4792-966f-4036c94ac19c-000000%40eu-west-1.amazonses.com
patch subject: [PATCH v2 2/5] rust: helpers: add C shims for LSM hook initialisation
config: um-randconfig-002-20260311 (https://download.01.org/0day-ci/archive/20260312/202603120739.yWj1J5Hv-lkp@intel.com/config)
compiler: clang version 18.1.8 (https://github.com/llvm/llvm-project 3b5b5c1ec4a3095ab096dd780e84d7ab81f3d7ff)
rustc: rustc 1.88.0 (6b00bc388 2025-06-23)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260312/202603120739.yWj1J5Hv-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603120739.yWj1J5Hv-lkp@intel.com/

All warnings (new ones prefixed by >>, old ones prefixed by <<):

>> WARNING: modpost: vmlinux: rust_helper_security_add_hooks: EXPORT_SYMBOL used for init symbol. Remove __init or EXPORT_SYMBOL.

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply

* [PATCH] integrity: Fix spelling mistake TRUSTED_KEYRING
From: Philipp Hahn @ 2026-03-12  9:35 UTC (permalink / raw)
  To: Mimi Zohar, Roberto Sassu, Dmitry Kasatkin, Eric Snowberg
  Cc: Philipp Hahn, linux-integrity, linux-security-module,
	linux-kernel

Fix minor spelling mistake "kerne{d -> l}".

Fixes: 9dc92c45177ab ("integrity: Define a trusted platform keyring")
Signed-off-by: Philipp Hahn <phahn-oss@avm.de>
---
 security/integrity/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/security/integrity/Kconfig b/security/integrity/Kconfig
index 916d4f2bfc441..328ea9f32035a 100644
--- a/security/integrity/Kconfig
+++ b/security/integrity/Kconfig
@@ -60,7 +60,7 @@ config INTEGRITY_PLATFORM_KEYRING
 	help
 	  Provide a separate, distinct keyring for platform trusted keys, which
 	  the kernel automatically populates during initialization from values
-	  provided by the platform for verifying the kexec'ed kerned image
+	  provided by the platform for verifying the kexec'ed kernel image
 	  and, possibly, the initramfs signature.
 
 config INTEGRITY_MACHINE_KEYRING
-- 
2.43.0


^ permalink raw reply related

* [RFC PATCH v1 07/11] selftests/landlock: Drain stale audit records on init
From: Mickaël Salaün @ 2026-03-12 10:04 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Paul Moore,
	Serge E . Hallyn
  Cc: Mickaël Salaün, Justin Suess, Lennart Poettering,
	Mikhail Ivanov, Nicolas Bouchinet, Shervin Oloumi, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-kernel, linux-security-module
In-Reply-To: <20260312100444.2609563-1-mic@digikod.net>

Non-audit Landlock tests generate audit records as side effects when
audit_enabled is non-zero (e.g. from boot configuration).  These records
accumulate in the kernel audit backlog while no audit daemon socket is
open.  When the next test opens a new netlink socket and registers as
the audit daemon, the stale backlog is delivered, causing baseline
record count checks to fail spuriously.

Fix this by draining all pending records in audit_init() right after
setting the receive timeout.  The 1-usec SO_RCVTIMEO causes audit_recv()
to return -EAGAIN once the backlog is empty, naturally terminating the
drain loop.

Domain deallocation records are emitted asynchronously from a work
queue, so they may still arrive after the drain.  Remove records.domain
== 0 checks from tests where a stale deallocation record from a previous
test could cause spurious failures.

Also fix a socket file descriptor leak on error paths in audit_init():
if audit_set_status() or setsockopt() fails (e.g.  when another audit
daemon is already registered), close the socket before returning.

Fix off-by-one checks in matches_log_domain_allocated() and
matches_log_domain_deallocated() where snprintf() truncation was
detected with ">" instead of ">=" (snprintf() returns the length
excluding the NUL terminator, so equality means truncation).

Cc: Günther Noack <gnoack@google.com>
Fixes: 6a500b22971c ("selftests/landlock: Add tests for audit flags and domain IDs")
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 tools/testing/selftests/landlock/audit.h      | 29 +++++++++++++++----
 tools/testing/selftests/landlock/audit_test.c |  2 --
 2 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/landlock/audit.h b/tools/testing/selftests/landlock/audit.h
index 44eb433e9666..550acaafcc1e 100644
--- a/tools/testing/selftests/landlock/audit.h
+++ b/tools/testing/selftests/landlock/audit.h
@@ -309,7 +309,7 @@ static int __maybe_unused matches_log_domain_allocated(int audit_fd, pid_t pid,
 
 	log_match_len =
 		snprintf(log_match, sizeof(log_match), log_template, pid);
-	if (log_match_len > sizeof(log_match))
+	if (log_match_len >= sizeof(log_match))
 		return -E2BIG;
 
 	return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
@@ -326,7 +326,7 @@ static int __maybe_unused matches_log_domain_deallocated(
 
 	log_match_len = snprintf(log_match, sizeof(log_match), log_template,
 				 num_denials);
-	if (log_match_len > sizeof(log_match))
+	if (log_match_len >= sizeof(log_match))
 		return -E2BIG;
 
 	return audit_match_record(audit_fd, AUDIT_LANDLOCK_DOMAIN, log_match,
@@ -379,19 +379,36 @@ static int audit_init(void)
 
 	err = audit_set_status(fd, AUDIT_STATUS_ENABLED, 1);
 	if (err)
-		return err;
+		goto err_close;
 
 	err = audit_set_status(fd, AUDIT_STATUS_PID, getpid());
 	if (err)
-		return err;
+		goto err_close;
 
 	/* Sets a timeout for negative tests. */
 	err = setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &audit_tv_default,
 			 sizeof(audit_tv_default));
-	if (err)
-		return -errno;
+	if (err) {
+		err = -errno;
+		goto err_close;
+	}
+
+	/*
+	 * Drains stale audit records that accumulated in the kernel backlog
+	 * while no audit daemon socket was open.  This happens when
+	 * non-audit Landlock tests create domains or trigger denials while
+	 * audit_enabled is non-zero (e.g. from boot configuration), or when
+	 * domain deallocation records arrive asynchronously after a
+	 * previous test's socket was closed.
+	 */
+	while (audit_recv(fd, NULL) == 0)
+		;
 
 	return fd;
+
+err_close:
+	close(fd);
+	return err;
 }
 
 static int audit_init_filter_exe(struct audit_filter *filter, const char *path)
diff --git a/tools/testing/selftests/landlock/audit_test.c b/tools/testing/selftests/landlock/audit_test.c
index 46d02d49835a..f92ba6774faa 100644
--- a/tools/testing/selftests/landlock/audit_test.c
+++ b/tools/testing/selftests/landlock/audit_test.c
@@ -412,7 +412,6 @@ TEST_F(audit_flags, signal)
 		} else {
 			EXPECT_EQ(1, records.access);
 		}
-		EXPECT_EQ(0, records.domain);
 
 		/* Updates filter rules to match the drop record. */
 		set_cap(_metadata, CAP_AUDIT_CONTROL);
@@ -601,7 +600,6 @@ TEST_F(audit_exec, signal_and_open)
 	/* Tests that there was no denial until now. */
 	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
 	EXPECT_EQ(0, records.access);
-	EXPECT_EQ(0, records.domain);
 
 	/*
 	 * Wait for the child to do a first denied action by layer1 and
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v1 08/11] selftests/landlock: Add namespace restriction tests
From: Mickaël Salaün @ 2026-03-12 10:04 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Paul Moore,
	Serge E . Hallyn
  Cc: Mickaël Salaün, Justin Suess, Lennart Poettering,
	Mikhail Ivanov, Nicolas Bouchinet, Shervin Oloumi, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-kernel, linux-security-module
In-Reply-To: <20260312100444.2609563-1-mic@digikod.net>

Add tests covering the two namespace-related Landlock permission types:
LANDLOCK_PERM_NAMESPACE_ENTER (namespace creation via unshare/clone and
namespace entry via setns) and its interaction with
LANDLOCK_PERM_CAPABILITY_USE.

Rule validation tests verify that the kernel correctly accepts known
CLONE_NEW* types, silently accepts unknown bits (including holes,
upper-range bits, and bit 63) for forward compatibility, and rejects an
empty namespace_types bitmask.  Invalid allowed_perm combinations and
non-zero flags are also covered.

Namespace creation tests use FIXTURE_VARIANT to exercise all eight
namespace types (user, UTS, IPC, mount, cgroup, PID, network, time)
across allowed/denied and privileged/unprivileged combinations.  This
verifies that security_namespace_alloc() is correctly called for every
type.  Layer stacking tests verify that any-layer-denies semantics work
correctly, including the allow-over-allow case.  A combined test
exercises both LANDLOCK_PERM_CAPABILITY_USE and
LANDLOCK_PERM_NAMESPACE_ENTER in a single domain.

Namespace entry tests verify that setns is subject to the same
type-based LANDLOCK_PERM_NAMESPACE_ENTER check via
security_namespace_install(), including cross-process setns denial and
the two-permission interaction where both LANDLOCK_PERM_NAMESPACE_ENTER
and LANDLOCK_PERM_CAPABILITY_USE must allow the operation for non-user
namespaces.

Audit tests verify that denied namespace creation, denied setns entry,
and allowed operations produce the expected audit records (or none).

Cc: Christian Brauner <brauner@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 tools/testing/selftests/landlock/common.h   |   23 +
 tools/testing/selftests/landlock/config     |    5 +
 tools/testing/selftests/landlock/ns_test.c  | 1379 +++++++++++++++++++
 tools/testing/selftests/landlock/wrappers.h |    6 +
 4 files changed, 1413 insertions(+)
 create mode 100644 tools/testing/selftests/landlock/ns_test.c

diff --git a/tools/testing/selftests/landlock/common.h b/tools/testing/selftests/landlock/common.h
index 90551650299c..e7d1d1e9df74 100644
--- a/tools/testing/selftests/landlock/common.h
+++ b/tools/testing/selftests/landlock/common.h
@@ -128,6 +128,29 @@ static void __maybe_unused clear_ambient_cap(
 	EXPECT_EQ(0, cap_get_ambient(cap));
 }
 
+/*
+ * Returns true if the current process is in the initial user namespace.
+ * Compares the readlink targets of /proc/self/ns/user and /proc/1/ns/user.
+ */
+static bool __maybe_unused is_in_init_user_ns(void)
+{
+	char self_buf[64], init_buf[64];
+	ssize_t self_len, init_len;
+
+	self_len = readlink("/proc/self/ns/user", self_buf, sizeof(self_buf));
+	if (self_len <= 0 || self_len >= (ssize_t)sizeof(self_buf))
+		return false;
+
+	init_len = readlink("/proc/1/ns/user", init_buf, sizeof(init_buf));
+	if (init_len <= 0 || init_len >= (ssize_t)sizeof(init_buf))
+		return false;
+
+	if (self_len != init_len)
+		return false;
+
+	return memcmp(self_buf, init_buf, self_len) == 0;
+}
+
 /* Receives an FD from a UNIX socket. Returns the received FD, or -errno. */
 static int __maybe_unused recv_fd(int usock)
 {
diff --git a/tools/testing/selftests/landlock/config b/tools/testing/selftests/landlock/config
index 8fe9b461b1fd..d09b637bf6ca 100644
--- a/tools/testing/selftests/landlock/config
+++ b/tools/testing/selftests/landlock/config
@@ -3,6 +3,7 @@ CONFIG_AUDIT=y
 CONFIG_CGROUPS=y
 CONFIG_CGROUP_SCHED=y
 CONFIG_INET=y
+CONFIG_IPC_NS=y
 CONFIG_IPV6=y
 CONFIG_KEYS=y
 CONFIG_MPTCP=y
@@ -10,10 +11,14 @@ CONFIG_MPTCP_IPV6=y
 CONFIG_NET=y
 CONFIG_NET_NS=y
 CONFIG_OVERLAY_FS=y
+CONFIG_PID_NS=y
 CONFIG_PROC_FS=y
 CONFIG_SECURITY=y
 CONFIG_SECURITY_LANDLOCK=y
 CONFIG_SHMEM=y
 CONFIG_SYSFS=y
+CONFIG_TIME_NS=y
 CONFIG_TMPFS=y
 CONFIG_TMPFS_XATTR=y
+CONFIG_USER_NS=y
+CONFIG_UTS_NS=y
diff --git a/tools/testing/selftests/landlock/ns_test.c b/tools/testing/selftests/landlock/ns_test.c
new file mode 100644
index 000000000000..5d968dd9f4f5
--- /dev/null
+++ b/tools/testing/selftests/landlock/ns_test.c
@@ -0,0 +1,1379 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - Namespace restriction
+ *
+ * Copyright © 2026 Cloudflare
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/capability.h>
+#include <linux/landlock.h>
+#include <sched.h>
+#include <stdio.h>
+#include <sys/prctl.h>
+#include <sys/wait.h>
+#include <syscall.h>
+#include <unistd.h>
+
+#include "audit.h"
+#include "common.h"
+
+/*
+ * Max length for /proc/self/ns/<name> paths (longest:
+ * "/proc/self/ns/cgroup").
+ */
+#define NS_PROC_PATH_MAX 32
+
+static int create_ns_ruleset(void)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+	};
+
+	return landlock_create_ruleset(&attr, sizeof(attr), 0);
+}
+
+static int add_ns_rule(int ruleset_fd, __u64 ns_type)
+{
+	const struct landlock_namespace_attr attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+		.namespace_types = ns_type,
+	};
+
+	return landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE, &attr, 0);
+}
+
+/*
+ * Returns the /proc/self/NS entry name for a given CLONE_NEW* type, or NULL
+ * if unknown.  Used to check kernel support without side effects.
+ */
+static const char *ns_proc_name(__u64 ns_type)
+{
+	switch (ns_type) {
+	case CLONE_NEWNS:
+		return "mnt";
+	case CLONE_NEWCGROUP:
+		return "cgroup";
+	case CLONE_NEWUTS:
+		return "uts";
+	case CLONE_NEWIPC:
+		return "ipc";
+	case CLONE_NEWUSER:
+		return "user";
+	case CLONE_NEWPID:
+		return "pid";
+	case CLONE_NEWNET:
+		return "net";
+	case CLONE_NEWTIME:
+		return "time";
+	default:
+		return NULL;
+	}
+}
+
+static bool ns_is_supported(__u64 ns_type, char *proc_path, size_t size)
+{
+	const char *ns_name;
+
+	ns_name = ns_proc_name(ns_type);
+	if (!ns_name)
+		return false;
+
+	snprintf(proc_path, size, "/proc/self/ns/%s", ns_name);
+	return access(proc_path, F_OK) == 0;
+}
+
+/* Rule validation tests */
+
+TEST(add_rule_bad_attr)
+{
+	const struct landlock_ruleset_attr cap_only_attr = {
+		.handled_perm = LANDLOCK_PERM_CAPABILITY_USE,
+	};
+	int ruleset_fd;
+	struct landlock_namespace_attr attr = {};
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/* Empty allowed_perm returns ENOMSG (useless deny rule). */
+	attr.allowed_perm = 0;
+	attr.namespace_types = CLONE_NEWUTS;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 0));
+	ASSERT_EQ(ENOMSG, errno);
+
+	/* allowed_perm with unhandled bit. */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER |
+			    LANDLOCK_PERM_CAPABILITY_USE;
+	attr.namespace_types = CLONE_NEWUTS;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+
+	/* allowed_perm with wrong type. */
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE;
+	attr.namespace_types = CLONE_NEWUTS;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+
+	/*
+	 * Unknown namespace bits (e.g. bit 63) are silently accepted
+	 * for forward compatibility.  Only known CLONE_NEW* bits are stored.
+	 */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = 1ULL << 63;
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &attr, 0));
+
+	/* Useless rule: empty namespace_types bitmask. */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = 0;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 0));
+	ASSERT_EQ(ENOMSG, errno);
+
+	/*
+	 * Bit 1 is not a CLONE_NEW* value but is silently accepted
+	 * for forward compatibility (no hole rejection).
+	 */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = (1ULL << 1);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &attr, 0));
+
+	/* Multi-bit values are valid (bitmask allows multiple types). */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = CLONE_NEWUTS | CLONE_NEWNET;
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &attr, 0));
+
+	/* Non-zero flags must be rejected. */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = CLONE_NEWUTS;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 1));
+	ASSERT_EQ(EINVAL, errno);
+
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/*
+	 * Ruleset handles PERM_CAPABILITY_USE but not PERM_NAMESPACE_ENTER:
+	 * adding a namespace rule must be rejected.
+	 */
+	ruleset_fd = landlock_create_ruleset(&cap_only_attr,
+					     sizeof(cap_only_attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.namespace_types = CLONE_NEWUTS;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+	EXPECT_EQ(0, close(ruleset_fd));
+}
+
+/*
+ * Unknown namespace types in the upper range are silently accepted
+ * (allow-list: they have no effect since the kernel never checks them).
+ */
+TEST(add_rule_unknown)
+{
+	int ruleset_fd;
+	struct landlock_namespace_attr attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+	};
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/*
+	 * Bit 31 is in the lower 32 bits but not a CLONE_NEW* value.
+	 * Silently accepted for forward compatibility (no hole rejection).
+	 */
+	attr.namespace_types = 1ULL << 31;
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &attr, 0));
+
+	/* Bit 32 is in the unknown upper range: silently accepted. */
+	attr.namespace_types = 1ULL << 32;
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &attr, 0));
+
+	EXPECT_EQ(0, close(ruleset_fd));
+}
+
+/* Namespace creation tests (variant-based positive/negative) */
+
+/* clang-format off */
+FIXTURE(ns_create) {
+	char proc_path[NS_PROC_PATH_MAX];
+};
+/* clang-format on */
+
+FIXTURE_VARIANT(ns_create)
+{
+	const __u64 namespace_types;
+	const bool is_sandboxed;
+	const bool has_rule;
+	const bool drop_all_caps;
+	const int expected_result;
+};
+
+/*
+ * Unsandboxed baseline: no Landlock domain is enforced.
+ * User namespace creation should succeed without any restriction.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, user_unsandboxed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUSER,
+	.is_sandboxed = false,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = 0,
+};
+
+/*
+ * User namespace creation denied: handled by Landlock but no rule
+ * allows CLONE_NEWUSER.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, user_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUSER,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/*
+ * User namespace creation allowed: Landlock rule permits CLONE_NEWUSER.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, user_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUSER,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = false,
+	.expected_result = 0,
+};
+
+/*
+ * User namespace creation while unprivileged: the process has no
+ * capabilities but unshare(CLONE_NEWUSER) is an unprivileged
+ * operation so it still succeeds.  The Landlock rule allows it.
+ * For setns, the capability check (CAP_SYS_ADMIN) fails first
+ * since the process has no capabilities, yielding EPERM.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, user_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUSER,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = 0,
+};
+
+/*
+ * Unsandboxed baseline for non-user namespace: no Landlock domain,
+ * process has CAP_SYS_ADMIN.  UTS creation should succeed.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, uts_unsandboxed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUTS,
+	.is_sandboxed = false,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = 0,
+};
+
+/*
+ * Non-user namespace denied: process has CAP_SYS_ADMIN (passes
+ * ns_capable), but Landlock denies (no rule).
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, uts_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUTS,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/*
+ * Non-user namespace allowed: process has CAP_SYS_ADMIN and Landlock
+ * rule permits CLONE_NEWUTS.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, uts_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUTS, .is_sandboxed = true, .has_rule = true,
+	.drop_all_caps = false,		 .expected_result = 0,
+};
+
+/*
+ * Unprivileged namespace creation: process lacks CAP_SYS_ADMIN, so the
+ * kernel denies creation regardless of Landlock rules.  Landlock cannot
+ * authorize what the kernel denied (LSM hooks are restriction-only).
+ * The rule is present to verify Landlock does not change the error code.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, uts_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWUTS,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, ipc_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWIPC,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, ipc_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWIPC, .is_sandboxed = true, .has_rule = true,
+	.drop_all_caps = false,		 .expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, ipc_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWIPC,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, mnt_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNS,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, mnt_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNS, .is_sandboxed = true, .has_rule = true,
+	.drop_all_caps = false,		.expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, mnt_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNS,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, cgroup_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWCGROUP,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, cgroup_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWCGROUP,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = false,
+	.expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, cgroup_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWCGROUP,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, pid_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWPID,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, pid_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWPID, .is_sandboxed = true, .has_rule = true,
+	.drop_all_caps = false,		 .expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, pid_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWPID,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, net_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNET,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, net_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNET, .is_sandboxed = true, .has_rule = true,
+	.drop_all_caps = false,		 .expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, net_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWNET,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, time_denied) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWTIME,
+	.is_sandboxed = true,
+	.has_rule = false,
+	.drop_all_caps = false,
+	.expected_result = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, time_allowed) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWTIME,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = false,
+	.expected_result = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_create, time_unprivileged) {
+	/* clang-format on */
+	.namespace_types = CLONE_NEWTIME,
+	.is_sandboxed = true,
+	.has_rule = true,
+	.drop_all_caps = true,
+	.expected_result = EPERM,
+};
+
+FIXTURE_SETUP(ns_create)
+{
+	if (!ns_is_supported(variant->namespace_types, self->proc_path,
+			     sizeof(self->proc_path))) {
+		/* UML does not support the time namespace. */
+		if (variant->namespace_types == CLONE_NEWTIME)
+			SKIP(return, "CLONE_NEWTIME not supported");
+
+		ASSERT_TRUE(false)
+		{
+			TH_LOG("Namespace type 0x%llx not supported",
+			       (unsigned long long)variant->namespace_types);
+		}
+	}
+
+	if (variant->drop_all_caps)
+		drop_caps(_metadata);
+	else
+		disable_caps(_metadata);
+}
+
+FIXTURE_TEARDOWN(ns_create)
+{
+}
+
+TEST_F(ns_create, unshare)
+{
+	int ruleset_fd, err;
+
+	if (variant->is_sandboxed) {
+		ruleset_fd = create_ns_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+
+		if (variant->has_rule)
+			ASSERT_EQ(0, add_ns_rule(ruleset_fd,
+						 variant->namespace_types));
+
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	/*
+	 * Non-user namespaces need CAP_SYS_ADMIN for the privileged path.
+	 * User namespaces and unprivileged tests skip this.
+	 */
+	if (!variant->drop_all_caps &&
+	    variant->namespace_types != CLONE_NEWUSER)
+		set_cap(_metadata, CAP_SYS_ADMIN);
+
+	err = unshare(variant->namespace_types);
+	if (variant->expected_result) {
+		EXPECT_EQ(-1, err);
+		EXPECT_EQ(variant->expected_result, errno);
+	} else {
+		EXPECT_EQ(0, err);
+	}
+
+	if (!variant->drop_all_caps &&
+	    variant->namespace_types != CLONE_NEWUSER)
+		clear_cap(_metadata, CAP_SYS_ADMIN);
+}
+
+/*
+ * clone3 exercises a different kernel entry point than unshare: it goes
+ * through kernel_clone() -> copy_process() -> copy_namespaces() ->
+ * create_new_namespaces().  Both paths converge at __ns_common_init() ->
+ * security_namespace_alloc(), but the entry point and argument handling
+ * differ.
+ */
+TEST_F(ns_create, clone3)
+{
+	int ruleset_fd, status;
+	pid_t pid;
+	struct clone_args args = {};
+
+	if (variant->is_sandboxed) {
+		ruleset_fd = create_ns_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+
+		if (variant->has_rule)
+			ASSERT_EQ(0, add_ns_rule(ruleset_fd,
+						 variant->namespace_types));
+
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	if (!variant->drop_all_caps &&
+	    variant->namespace_types != CLONE_NEWUSER)
+		set_cap(_metadata, CAP_SYS_ADMIN);
+
+	args.flags = variant->namespace_types;
+	args.exit_signal = SIGCHLD;
+	pid = sys_clone3(&args, sizeof(args));
+	if (pid == 0)
+		_exit(EXIT_SUCCESS);
+
+	if (variant->expected_result) {
+		EXPECT_EQ(-1, pid);
+		EXPECT_EQ(variant->expected_result, errno);
+	} else {
+		EXPECT_LE(0, pid);
+		ASSERT_EQ(pid, waitpid(pid, &status, 0));
+		ASSERT_EQ(1, WIFEXITED(status));
+		ASSERT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
+	}
+
+	if (!variant->drop_all_caps &&
+	    variant->namespace_types != CLONE_NEWUSER)
+		clear_cap(_metadata, CAP_SYS_ADMIN);
+}
+
+/*
+ * setns exercises the namespace install path: validate_ns() ->
+ * security_namespace_install() -> hook_namespace_install().  This is a
+ * different LSM hook than creation, so it must be tested separately for
+ * each type.
+ *
+ * Mount namespace setns requires both CAP_SYS_ADMIN and CAP_SYS_CHROOT
+ * (checked by mntns_install), so the allowed variant sets both.
+ */
+TEST_F(ns_create, setns)
+{
+	int ruleset_fd, ns_fd, err, expected;
+
+	/*
+	 * setns into the process's own user NS always returns EINVAL:
+	 * userns_install() rejects re-entry before checking capabilities.
+	 */
+	if (variant->namespace_types == CLONE_NEWUSER) {
+		expected = EINVAL;
+	} else {
+		expected = variant->expected_result;
+	}
+
+	/* Open the NS FD before enforcing the domain. */
+	ns_fd = open(self->proc_path, O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+
+	if (variant->is_sandboxed) {
+		ruleset_fd = create_ns_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+
+		if (variant->has_rule)
+			ASSERT_EQ(0, add_ns_rule(ruleset_fd,
+						 variant->namespace_types));
+
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	if (!variant->drop_all_caps) {
+		set_cap(_metadata, CAP_SYS_ADMIN);
+		/*
+		 * mntns_install() requires CAP_SYS_CHROOT in addition to
+		 * CAP_SYS_ADMIN.
+		 */
+		if (variant->namespace_types == CLONE_NEWNS)
+			set_cap(_metadata, CAP_SYS_CHROOT);
+	}
+
+	err = setns(ns_fd, variant->namespace_types);
+	if (expected) {
+		EXPECT_EQ(-1, err);
+		EXPECT_EQ(expected, errno);
+	} else {
+		EXPECT_EQ(0, err);
+	}
+
+	if (!variant->drop_all_caps) {
+		clear_cap(_metadata, CAP_SYS_ADMIN);
+		if (variant->namespace_types == CLONE_NEWNS)
+			clear_cap(_metadata, CAP_SYS_CHROOT);
+	}
+
+	EXPECT_EQ(0, close(ns_fd));
+}
+
+/* Additional namespace creation tests */
+
+/*
+ * When LANDLOCK_PERM_NAMESPACE_ENTER is not handled by any domain, namespace
+ * creation must produce the same result as without Landlock.  Unlike the
+ * unsandboxed variants of ns_create (which have no domain at all), this test
+ * verifies that a domain handling only FS access does not interfere with
+ * namespace operations.
+ */
+TEST(ns_create_unhandled)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_access_fs = LANDLOCK_ACCESS_FS_READ_FILE,
+	};
+	int ruleset_fd;
+
+	disable_caps(_metadata);
+
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* User namespace creation should still work (unhandled). */
+	EXPECT_EQ(0, unshare(CLONE_NEWUSER));
+}
+
+/*
+ * Layer stacking: layer 1 always allows CLONE_NEWUSER.  Layer 2
+ * either allows (both layers agree -> success) or denies (any layer
+ * can deny -> failure).
+ */
+/* clang-format off */
+FIXTURE(ns_stacking) {};
+/* clang-format on */
+
+FIXTURE_VARIANT(ns_stacking)
+{
+	bool second_layer_allows;
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_stacking, deny) {
+	/* clang-format on */
+	.second_layer_allows = false,
+};
+
+/* Both layers allow CLONE_NEWUSER -> operation succeeds. */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(ns_stacking, allow) {
+	/* clang-format on */
+	.second_layer_allows = true,
+};
+
+FIXTURE_SETUP(ns_stacking)
+{
+	disable_caps(_metadata);
+}
+
+FIXTURE_TEARDOWN(ns_stacking)
+{
+}
+
+/*
+ * Verify that a second Landlock layer cannot override the first layer's
+ * denial.  Each layer stores its permission bitmask independently, and
+ * enforcement requires all layers to allow an operation.  This ensures
+ * the correct intersection: layer 1 allows CLONE_NEWUSER, but if layer
+ * 2 does not also allow it, the operation is denied.
+ */
+TEST_F(ns_stacking, two_layers)
+{
+	int ruleset_fd;
+
+	/* First layer: allow CLONE_NEWUSER. */
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, add_ns_rule(ruleset_fd, CLONE_NEWUSER));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* Second layer: allow or deny depending on variant. */
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	if (variant->second_layer_allows)
+		ASSERT_EQ(0, add_ns_rule(ruleset_fd, CLONE_NEWUSER));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	if (variant->second_layer_allows) {
+		EXPECT_EQ(0, unshare(CLONE_NEWUSER));
+	} else {
+		EXPECT_EQ(-1, unshare(CLONE_NEWUSER));
+		EXPECT_EQ(EPERM, errno);
+	}
+}
+
+/*
+ * Combined capability and namespace permissions in a single domain.
+ * Verifies that both permission types can coexist and are enforced
+ * independently.
+ */
+TEST(combined_cap_ns)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_CAPABILITY_USE |
+				LANDLOCK_PERM_NAMESPACE_ENTER,
+	};
+	const struct landlock_capability_attr cap_attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << CAP_SYS_ADMIN),
+	};
+	const struct landlock_namespace_attr ns_attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+		.namespace_types = CLONE_NEWUSER,
+	};
+	int ruleset_fd;
+
+	/* Isolate hostname changes from other tests. */
+	ASSERT_EQ(0, unshare(CLONE_NEWUTS));
+
+	disable_caps(_metadata);
+
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &cap_attr, 0));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &ns_attr, 0));
+
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* CAP_SYS_ADMIN use allowed by capability rule. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, sethostname("test", 4));
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* CAP_SYS_CHROOT denied (not in allowed capability rules). */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(-1, chroot("/"));
+	EXPECT_EQ(EPERM, errno);
+
+	/*
+	 * UTS namespace creation denied by Landlock (not in allowed namespace
+	 * rules).  CAP_SYS_ADMIN is needed for the kernel's ns_capable()
+	 * check to pass, so that Landlock's hook is actually reached.
+	 */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(-1, unshare(CLONE_NEWUTS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* User namespace creation allowed by namespace rule. */
+	EXPECT_EQ(0, unshare(CLONE_NEWUSER));
+}
+
+/*
+ * Partial allow: one namespace type is allowed, another is denied.
+ * Verifies that rules are per-type.
+ */
+TEST(ns_create_partial)
+{
+	int ruleset_fd;
+
+	disable_caps(_metadata);
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/* Only allow UTS namespace creation. */
+	ASSERT_EQ(0, add_ns_rule(ruleset_fd, CLONE_NEWUTS));
+
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* UTS namespace should be allowed. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, unshare(CLONE_NEWUTS));
+
+	/* User namespace should be denied (no rule). */
+	EXPECT_EQ(-1, unshare(CLONE_NEWUSER));
+	EXPECT_EQ(EPERM, errno);
+}
+
+/* clang-format off */
+FIXTURE(setns_cross_process) {};
+/* clang-format on */
+
+FIXTURE_VARIANT(setns_cross_process)
+{
+	bool is_sandboxed;
+	int expected_setns;
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(setns_cross_process, denied) {
+	/* clang-format on */
+	.is_sandboxed = true,
+	.expected_setns = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(setns_cross_process, allowed) {
+	/* clang-format on */
+	.is_sandboxed = false,
+	.expected_setns = 0,
+};
+
+FIXTURE_SETUP(setns_cross_process)
+{
+}
+
+FIXTURE_TEARDOWN(setns_cross_process)
+{
+}
+
+/*
+ * setns into a child's UTS namespace: when sandboxed with
+ * LANDLOCK_PERM_NAMESPACE_ENTER denying UTS, the rule-based check
+ * applies regardless of which process created the namespace.
+ */
+TEST_F(setns_cross_process, setns)
+{
+	int ruleset_fd, ns_fd, status;
+	pid_t child;
+	int pipe_parent[2], pipe_child[2];
+	char buf, path[64];
+
+	disable_caps(_metadata);
+
+	/*
+	 * Enable dumpable so the parent can read /proc/<child>/ns/uts.
+	 * Without this, ptrace access checks (PTRACE_MODE_READ) prevent
+	 * opening another process's namespace entries.
+	 */
+	ASSERT_EQ(0, prctl(PR_SET_DUMPABLE, 1, 0, 0, 0));
+
+	ASSERT_EQ(0, pipe2(pipe_parent, O_CLOEXEC));
+	ASSERT_EQ(0, pipe2(pipe_child, O_CLOEXEC));
+
+	child = fork();
+	ASSERT_LE(0, child);
+
+	if (child == 0) {
+		EXPECT_EQ(0, close(pipe_parent[1]));
+		EXPECT_EQ(0, close(pipe_child[0]));
+
+		/* Child: create a UTS namespace. */
+		set_cap(_metadata, CAP_SYS_ADMIN);
+		ASSERT_EQ(0, unshare(CLONE_NEWUTS));
+
+		drop_caps(_metadata);
+		ASSERT_EQ(0, prctl(PR_SET_DUMPABLE, 1, 0, 0, 0));
+
+		/* Signal parent that the namespace is ready. */
+		ASSERT_EQ(1, write(pipe_child[1], ".", 1));
+
+		/* Wait for parent to finish testing. */
+		ASSERT_EQ(1, read(pipe_parent[0], &buf, 1));
+		_exit(_metadata->exit_code);
+	}
+
+	EXPECT_EQ(0, close(pipe_parent[0]));
+	EXPECT_EQ(0, close(pipe_child[1]));
+
+	/* Wait for child namespace. */
+	ASSERT_EQ(1, read(pipe_child[0], &buf, 1));
+	EXPECT_EQ(0, close(pipe_child[0]));
+
+	/* Open the child's NS FD BEFORE creating the domain. */
+	snprintf(path, sizeof(path), "/proc/%d/ns/uts", child);
+	ns_fd = open(path, O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+
+	if (variant->is_sandboxed) {
+		/* Create domain denying UTS entry (no allow rule). */
+		ruleset_fd = create_ns_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	if (variant->expected_setns) {
+		EXPECT_EQ(-1, setns(ns_fd, CLONE_NEWUTS));
+		EXPECT_EQ(variant->expected_setns, errno);
+	} else {
+		EXPECT_EQ(0, setns(ns_fd, CLONE_NEWUTS));
+	}
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, close(ns_fd));
+
+	/* Release child. */
+	ASSERT_EQ(1, write(pipe_parent[1], ".", 1));
+	EXPECT_EQ(0, close(pipe_parent[1]));
+	ASSERT_EQ(child, waitpid(child, &status, 0));
+	ASSERT_EQ(1, WIFEXITED(status));
+	ASSERT_EQ(EXIT_SUCCESS, WEXITSTATUS(status));
+}
+
+/*
+ * Verify that both LANDLOCK_PERM_NAMESPACE_ENTER and LANDLOCK_PERM_CAPABILITY_USE
+ * apply simultaneously: creating/entering a non-user namespace
+ * requires both the namespace type to be allowed AND CAP_SYS_ADMIN
+ * to be allowed.  User namespace creation is the exception (no
+ * capable() call from the kernel).
+ */
+TEST(setns_and_create)
+{
+	int ruleset_fd, ns_fd;
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_NAMESPACE_ENTER |
+				LANDLOCK_PERM_CAPABILITY_USE,
+	};
+	const struct landlock_namespace_attr ns_attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+		.namespace_types = CLONE_NEWUTS,
+	};
+	const struct landlock_capability_attr cap_attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << CAP_SYS_ADMIN),
+	};
+
+	disable_caps(_metadata);
+
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &ns_attr, 0));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &cap_attr, 0));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* UTS unshare: allowed by NS rule + CAP_SYS_ADMIN allowed. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	ASSERT_EQ(0, unshare(CLONE_NEWUTS));
+
+	/* IPC unshare: denied by NS rule (type not allowed). */
+	EXPECT_EQ(-1, unshare(CLONE_NEWIPC));
+	EXPECT_EQ(EPERM, errno);
+
+	/* setns into current UTS: allowed by NS rule. */
+	ns_fd = open("/proc/self/ns/uts", O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+	EXPECT_EQ(0, setns(ns_fd, CLONE_NEWUTS));
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, close(ns_fd));
+
+	/*
+	 * User namespace creation: only LANDLOCK_PERM_NAMESPACE_ENTER needed
+	 * (no capable() call from the kernel for user NS).  Denied
+	 * because CLONE_NEWUSER is not in the allowed namespace types.
+	 */
+	EXPECT_EQ(-1, unshare(CLONE_NEWUSER));
+	EXPECT_EQ(EPERM, errno);
+}
+
+/*
+ * Verify that LANDLOCK_PERM_CAPABILITY_USE can deny the CAP_SYS_ADMIN check
+ * that the kernel performs before the Landlock namespace hook is
+ * reached.  The NS type is allowed but the required capability is not,
+ * so the operation fails on the capability check.
+ *
+ * User namespace creation is the exception: no capable() call, so the
+ * operation succeeds with just LANDLOCK_PERM_NAMESPACE_ENTER.
+ */
+TEST(two_perm_cap_denied)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_NAMESPACE_ENTER |
+				LANDLOCK_PERM_CAPABILITY_USE,
+	};
+	const struct landlock_namespace_attr ns_attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+		.namespace_types = CLONE_NEWUTS | CLONE_NEWUSER,
+	};
+	/* CAP_SYS_ADMIN is NOT allowed. */
+	const struct landlock_capability_attr cap_attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << CAP_SYS_CHROOT),
+	};
+	int ruleset_fd;
+
+	disable_caps(_metadata);
+
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &ns_attr, 0));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &cap_attr, 0));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/*
+	 * UTS creation: the process holds CAP_SYS_ADMIN but Landlock
+	 * denies it (not in the cap rule), so the kernel's
+	 * ns_capable(CAP_SYS_ADMIN) gate fails before the namespace
+	 * hook is reached.
+	 */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(-1, unshare(CLONE_NEWUTS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/*
+	 * User NS creation: no capable() call from the kernel, so
+	 * only LANDLOCK_PERM_NAMESPACE_ENTER applies.  CLONE_NEWUSER is in the
+	 * allowed set, so this succeeds.
+	 */
+	EXPECT_EQ(0, unshare(CLONE_NEWUSER));
+}
+
+/*
+ * Mount namespace setns is unique: the kernel checks both
+ * CAP_SYS_ADMIN and CAP_SYS_CHROOT in mntns_install().  Verify that
+ * allowing CAP_SYS_ADMIN alone is not sufficient.
+ */
+TEST(two_perm_mnt_setns)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_NAMESPACE_ENTER |
+				LANDLOCK_PERM_CAPABILITY_USE,
+	};
+	const struct landlock_namespace_attr ns_attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+		.namespace_types = CLONE_NEWNS,
+	};
+	const struct landlock_capability_attr cap_admin = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << CAP_SYS_ADMIN),
+	};
+	const struct landlock_capability_attr cap_admin_chroot = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << CAP_SYS_ADMIN) |
+				(1ULL << CAP_SYS_CHROOT),
+	};
+	int ruleset_fd, ns_fd;
+
+	disable_caps(_metadata);
+
+	/* Layer 1: allow mount NS + CAP_SYS_ADMIN only (no CAP_SYS_CHROOT). */
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &ns_attr, 0));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &cap_admin, 0));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	ns_fd = open("/proc/self/ns/mnt", O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+
+	/*
+	 * Fails: mntns_install() checks CAP_SYS_ADMIN (allowed) then
+	 * CAP_SYS_CHROOT (denied by LANDLOCK_PERM_CAPABILITY_USE).
+	 */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(-1, setns(ns_fd, CLONE_NEWNS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	clear_cap(_metadata, CAP_SYS_CHROOT);
+
+	/* Layer 2: also allows CAP_SYS_CHROOT. */
+	ruleset_fd = landlock_create_ruleset(&attr, sizeof(attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				       &ns_attr, 0));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &cap_admin_chroot, 0));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/*
+	 * Still fails: layer 1 still denies CAP_SYS_CHROOT.
+	 * Landlock layer stacking means the most restrictive layer wins.
+	 */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(-1, setns(ns_fd, CLONE_NEWNS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	clear_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(0, close(ns_fd));
+}
+
+/* Audit tests */
+
+static int matches_log_ns_create(int audit_fd, __u64 ns_type)
+{
+	static const char log_template[] = REGEX_LANDLOCK_PREFIX
+		" blockers=perm\\.namespace_enter"
+		" namespace_type=0x%x"
+		" namespace_inum=0$";
+	char log_match[sizeof(log_template) + 10];
+	int log_match_len;
+
+	log_match_len = snprintf(log_match, sizeof(log_match), log_template,
+				 (unsigned int)ns_type);
+	if (log_match_len >= sizeof(log_match))
+		return -E2BIG;
+
+	return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
+				  NULL);
+}
+
+static int matches_log_ns_setns(int audit_fd, __u64 ns_type)
+{
+	static const char log_template[] = REGEX_LANDLOCK_PREFIX
+		" blockers=perm\\.namespace_enter"
+		" namespace_type=0x%x"
+		" namespace_inum=[0-9]\\+$";
+	char log_match[sizeof(log_template) + 10];
+	int log_match_len;
+
+	log_match_len = snprintf(log_match, sizeof(log_match), log_template,
+				 (unsigned int)ns_type);
+	if (log_match_len >= sizeof(log_match))
+		return -E2BIG;
+
+	return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
+				  NULL);
+}
+
+FIXTURE(ns_audit)
+{
+	struct audit_filter audit_filter;
+	int audit_fd;
+};
+
+FIXTURE_SETUP(ns_audit)
+{
+	ASSERT_TRUE(is_in_init_user_ns());
+
+	disable_caps(_metadata);
+
+	set_cap(_metadata, CAP_AUDIT_CONTROL);
+	self->audit_fd = audit_init_with_exe_filter(&self->audit_filter);
+	EXPECT_LE(0, self->audit_fd);
+	clear_cap(_metadata, CAP_AUDIT_CONTROL);
+}
+
+FIXTURE_TEARDOWN(ns_audit)
+{
+	set_cap(_metadata, CAP_AUDIT_CONTROL);
+	EXPECT_EQ(0, audit_cleanup(self->audit_fd, &self->audit_filter));
+}
+
+/*
+ * Verifies that a denied namespace creation produces the expected audit
+ * record with the perm.namespace_enter blocker string and namespace_type.
+ */
+TEST_F(ns_audit, create_denied)
+{
+	struct audit_records records;
+	int ruleset_fd;
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(-1, unshare(CLONE_NEWUTS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	EXPECT_EQ(0, matches_log_ns_create(self->audit_fd, CLONE_NEWUTS));
+
+	/*
+	 * No extra access records: the denial was already consumed by
+	 * matches_log_ns_create above.  One domain allocation record,
+	 * emitted in the same event as the first access denial for this
+	 * domain.
+	 */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(1, records.domain);
+}
+
+TEST_F(ns_audit, create_allowed)
+{
+	struct audit_records records;
+	int ruleset_fd;
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, add_ns_rule(ruleset_fd, CLONE_NEWUTS));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, unshare(CLONE_NEWUTS));
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* No records: allowed operations never trigger audit logging. */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+}
+
+TEST_F(ns_audit, setns_allowed)
+{
+	struct audit_records records;
+	int ruleset_fd, ns_fd;
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, add_ns_rule(ruleset_fd, CLONE_NEWUTS));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	ns_fd = open("/proc/self/ns/uts", O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+
+	/* Allowed: should succeed with no audit record. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, setns(ns_fd, CLONE_NEWUTS));
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, close(ns_fd));
+
+	/* No records: allowed setns never triggers audit logging. */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+}
+
+TEST_F(ns_audit, setns_denied)
+{
+	struct audit_records records;
+	int ruleset_fd, ns_fd;
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	/* No rule allows UTS -> denied. */
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	ns_fd = open("/proc/self/ns/uts", O_RDONLY);
+	ASSERT_LE(0, ns_fd);
+
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(-1, setns(ns_fd, CLONE_NEWUTS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, close(ns_fd));
+
+	/* Verify the audit record for setns denial. */
+	EXPECT_EQ(0, matches_log_ns_setns(self->audit_fd, CLONE_NEWUTS));
+
+	/*
+	 * No extra access records: the denial was already consumed by
+	 * matches_log_ns_setns above.  One domain allocation record,
+	 * emitted in the same event as the first access denial for this
+	 * domain.
+	 */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(1, records.domain);
+}
+
+TEST_F(ns_audit, unshare_denied)
+{
+	struct audit_records records;
+	int ruleset_fd;
+
+	ruleset_fd = create_ns_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* Deny UTS namespace creation (no allow rule). */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(-1, unshare(CLONE_NEWUTS));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* Verify the audit record for namespace creation denial. */
+	EXPECT_EQ(0, matches_log_ns_create(self->audit_fd, CLONE_NEWUTS));
+
+	/*
+	 * No extra access records: the denial was already consumed by
+	 * matches_log_ns_create above.  One domain allocation record,
+	 * emitted in the same event as the first access denial for this
+	 * domain.
+	 */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(1, records.domain);
+}
+
+TEST_HARNESS_MAIN
diff --git a/tools/testing/selftests/landlock/wrappers.h b/tools/testing/selftests/landlock/wrappers.h
index 65548323e45d..a3266fdb43da 100644
--- a/tools/testing/selftests/landlock/wrappers.h
+++ b/tools/testing/selftests/landlock/wrappers.h
@@ -9,6 +9,7 @@
 
 #define _GNU_SOURCE
 #include <linux/landlock.h>
+#include <linux/sched.h>
 #include <sys/syscall.h>
 #include <sys/types.h>
 #include <unistd.h>
@@ -45,3 +46,8 @@ static inline pid_t sys_gettid(void)
 {
 	return syscall(__NR_gettid);
 }
+
+static inline pid_t sys_clone3(struct clone_args *args, size_t size)
+{
+	return syscall(__NR_clone3, args, size);
+}
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v1 09/11] selftests/landlock: Add capability restriction tests
From: Mickaël Salaün @ 2026-03-12 10:04 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Paul Moore,
	Serge E . Hallyn
  Cc: Mickaël Salaün, Justin Suess, Lennart Poettering,
	Mikhail Ivanov, Nicolas Bouchinet, Shervin Oloumi, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-kernel, linux-security-module
In-Reply-To: <20260312100444.2609563-1-mic@digikod.net>

Add tests to exercise LANDLOCK_PERM_CAPABILITY_USE enforcement.  The
tests verify that a sandboxed process is denied a handled capability
when no rule grants it, and that an explicit rule restores the
capability.  Unknown capability values above CAP_LAST_CAP are checked to
be silently accepted without effect, ensuring the allow-list stays
future-proof when new capabilities are added.  A stacking test creates
two nested domains restricting different capability sets and confirms
that both layers' rules are enforced.  Invalid rule attributes (wrong
flags, out-of-range values) are tested to return the expected errors.

Two tests exercise non-standard capability gain paths.  The first
enforces a domain via CAP_SYS_ADMIN (no_new_privs is not set) and
verifies that denied capabilities are blocked even when still in the
effective set.  The second creates a user namespace under a Landlock
domain to verify that capabilities gained through the kernel's user
namespace ownership bypass (cap_capable_helper) are still restricted by
the domain's rules.

Audit tests verify that denied capabilities produce the correct audit
record with the capability number, and that allowed capabilities
generate no denial record.

Test coverage for security/landlock is 90.7% of 2282 lines according to
LLVM 21.

Cc: Christian Brauner <brauner@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 tools/testing/selftests/landlock/base_test.c |  18 +
 tools/testing/selftests/landlock/cap_test.c  | 614 +++++++++++++++++++
 2 files changed, 632 insertions(+)
 create mode 100644 tools/testing/selftests/landlock/cap_test.c

diff --git a/tools/testing/selftests/landlock/base_test.c b/tools/testing/selftests/landlock/base_test.c
index 30d37234086c..a55e8111bbde 100644
--- a/tools/testing/selftests/landlock/base_test.c
+++ b/tools/testing/selftests/landlock/base_test.c
@@ -142,6 +142,24 @@ TEST(errata)
 	ASSERT_EQ(EINVAL, errno);
 }
 
+#define PERM_LAST LANDLOCK_PERM_CAPABILITY_USE
+
+TEST(ruleset_with_unknown_perm)
+{
+	__u64 perm_mask;
+
+	for (perm_mask = 1ULL << 63; perm_mask != PERM_LAST; perm_mask >>= 1) {
+		struct landlock_ruleset_attr ruleset_attr = {
+			.handled_perm = perm_mask,
+		};
+
+		/* Unknown handled_perm values must be rejected. */
+		ASSERT_EQ(-1, landlock_create_ruleset(&ruleset_attr,
+						      sizeof(ruleset_attr), 0));
+		ASSERT_EQ(EINVAL, errno);
+	}
+}
+
 /* Tests ordering of syscall argument checks. */
 TEST(create_ruleset_checks_ordering)
 {
diff --git a/tools/testing/selftests/landlock/cap_test.c b/tools/testing/selftests/landlock/cap_test.c
new file mode 100644
index 000000000000..7ae978dff808
--- /dev/null
+++ b/tools/testing/selftests/landlock/cap_test.c
@@ -0,0 +1,614 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Landlock tests - Capability restriction
+ *
+ * Copyright © 2026 Cloudflare
+ */
+
+#define _GNU_SOURCE
+#include <errno.h>
+#include <fcntl.h>
+#include <linux/capability.h>
+#include <linux/landlock.h>
+#include <sched.h>
+#include <stdio.h>
+#include <string.h>
+#include <sys/wait.h>
+#include <unistd.h>
+
+#include "audit.h"
+#include "common.h"
+
+static int create_cap_ruleset(void)
+{
+	const struct landlock_ruleset_attr attr = {
+		.handled_perm = LANDLOCK_PERM_CAPABILITY_USE,
+	};
+
+	return landlock_create_ruleset(&attr, sizeof(attr), 0);
+}
+
+static int add_cap_rule(int ruleset_fd, __u64 cap)
+{
+	const struct landlock_capability_attr attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+		.capabilities = (1ULL << cap),
+	};
+
+	return landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY, &attr,
+				 0);
+}
+
+TEST(add_rule_bad_attr)
+{
+	const struct landlock_ruleset_attr ns_only_attr = {
+		.handled_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+	};
+	int ruleset_fd;
+	struct landlock_capability_attr attr = {};
+
+	ruleset_fd = create_cap_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/* Empty allowed_perm returns ENOMSG (useless deny rule). */
+	attr.allowed_perm = 0;
+	attr.capabilities = (1ULL << CAP_NET_RAW);
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 0));
+	ASSERT_EQ(ENOMSG, errno);
+
+	/* Useless rule: empty capabilities bitmask. */
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE;
+	attr.capabilities = 0;
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 0));
+	ASSERT_EQ(ENOMSG, errno);
+
+	/* allowed_perm with unhandled bit. */
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE |
+			    LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.capabilities = (1ULL << CAP_NET_RAW);
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+
+	/* allowed_perm with wrong type. */
+	attr.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER;
+	attr.capabilities = (1ULL << CAP_NET_RAW);
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+
+	/*
+	 * Unknown capability bits (e.g. bit 63) are silently accepted
+	 * for forward compatibility.  Only known bits are stored.
+	 */
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE;
+	attr.capabilities = 1ULL << 63;
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &attr, 0));
+
+	/* Non-zero flags must be rejected. */
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE;
+	attr.capabilities = (1ULL << CAP_NET_RAW);
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 1));
+	ASSERT_EQ(EINVAL, errno);
+
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/*
+	 * Ruleset handles PERM_NAMESPACE_ENTER but not PERM_CAPABILITY_USE:
+	 * adding a capability rule must be rejected.
+	 */
+	ruleset_fd =
+		landlock_create_ruleset(&ns_only_attr, sizeof(ns_only_attr), 0);
+	ASSERT_LE(0, ruleset_fd);
+	attr.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE;
+	attr.capabilities = (1ULL << CAP_NET_RAW);
+	ASSERT_EQ(-1, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+					&attr, 0));
+	ASSERT_EQ(EINVAL, errno);
+	EXPECT_EQ(0, close(ruleset_fd));
+}
+
+/*
+ * Unknown capability values above CAP_LAST_CAP are silently accepted
+ * (allow-list: they have no effect since the kernel never checks them).
+ */
+TEST(add_rule_unknown)
+{
+	int ruleset_fd;
+	struct landlock_capability_attr attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+	};
+
+	ruleset_fd = create_cap_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/* Just above CAP_LAST_CAP should succeed. */
+	attr.capabilities = (1ULL << (CAP_LAST_CAP + 1));
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &attr, 0));
+
+	/* High values (below bit 63) should succeed. */
+	attr.capabilities = (1ULL << 62);
+	ASSERT_EQ(0, landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				       &attr, 0));
+
+	EXPECT_EQ(0, close(ruleset_fd));
+}
+
+/* clang-format off */
+FIXTURE(cap_enforce) {};
+/* clang-format on */
+
+FIXTURE_VARIANT(cap_enforce)
+{
+	const bool is_sandboxed;
+	const bool handle_caps;
+	const __u64 allowed_cap;
+	const int expected_sysadmin;
+	const int expected_chroot;
+};
+
+/*
+ * Unsandboxed baseline: no Landlock domain is enforced.
+ * Both capabilities should work normally.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_enforce, unsandboxed) {
+	/* clang-format on */
+	.is_sandboxed = false,	.handle_caps = false, .allowed_cap = 0,
+	.expected_sysadmin = 0, .expected_chroot = 0,
+};
+
+/*
+ * Denied: capabilities are handled but no rule allows them.
+ * All capability checks must be denied by Landlock even if the
+ * capability is effective.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_enforce, denied) {
+	/* clang-format on */
+	.is_sandboxed = true,	    .handle_caps = true,      .allowed_cap = 0,
+	.expected_sysadmin = EPERM, .expected_chroot = EPERM,
+};
+
+/*
+ * Allowed: CAP_SYS_ADMIN is allowed by rule, CAP_SYS_CHROOT is not.
+ * Only the explicitly allowed capability should succeed.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_enforce, allowed) {
+	/* clang-format on */
+	.is_sandboxed = true,	      .handle_caps = true,
+	.allowed_cap = CAP_SYS_ADMIN, .expected_sysadmin = 0,
+	.expected_chroot = EPERM,
+};
+
+/*
+ * Unhandled: the ruleset does not handle LANDLOCK_PERM_CAPABILITY_USE
+ * at all (only handles FS access).  Both capabilities should work
+ * since the domain does not restrict them.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_enforce, unhandled) {
+	/* clang-format on */
+	.is_sandboxed = true,	.handle_caps = false, .allowed_cap = 0,
+	.expected_sysadmin = 0, .expected_chroot = 0,
+};
+
+FIXTURE_SETUP(cap_enforce)
+{
+	disable_caps(_metadata);
+}
+
+FIXTURE_TEARDOWN(cap_enforce)
+{
+}
+
+/*
+ * Capability enforcement: tests the four fundamental enforcement
+ * scenarios (unsandboxed baseline, denied, allowed, unhandled) using
+ * two independent capability checks (sethostname for CAP_SYS_ADMIN,
+ * chroot for CAP_SYS_CHROOT).
+ */
+TEST_F(cap_enforce, use)
+{
+	int ruleset_fd;
+
+	/* Isolate hostname changes from other tests. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	ASSERT_EQ(0, unshare(CLONE_NEWUTS));
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	if (variant->is_sandboxed) {
+		if (variant->handle_caps) {
+			ruleset_fd = create_cap_ruleset();
+		} else {
+			const struct landlock_ruleset_attr attr = {
+				.handled_access_fs =
+					LANDLOCK_ACCESS_FS_READ_FILE,
+			};
+
+			ruleset_fd =
+				landlock_create_ruleset(&attr, sizeof(attr), 0);
+		}
+		ASSERT_LE(0, ruleset_fd);
+
+		if (variant->allowed_cap)
+			ASSERT_EQ(0, add_cap_rule(ruleset_fd,
+						  variant->allowed_cap));
+
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	/* Test CAP_SYS_ADMIN via sethostname. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	if (variant->expected_sysadmin) {
+		EXPECT_EQ(-1, sethostname("test", 4));
+		EXPECT_EQ(variant->expected_sysadmin, errno);
+	} else {
+		EXPECT_EQ(0, sethostname("test", 4));
+	}
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* Test CAP_SYS_CHROOT via chroot. */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	if (variant->expected_chroot) {
+		EXPECT_EQ(-1, chroot("/"));
+		EXPECT_EQ(variant->expected_chroot, errno);
+	} else {
+		EXPECT_EQ(0, chroot("/"));
+	}
+}
+
+/*
+ * Layer stacking: layer 1 always allows CAP_SYS_ADMIN.  Layer 2
+ * either allows (both layers agree -> success) or denies (any layer
+ * can deny -> failure).
+ */
+/* clang-format off */
+FIXTURE(cap_stacking) {};
+/* clang-format on */
+
+FIXTURE_VARIANT(cap_stacking)
+{
+	const bool is_sandboxed;
+	const bool second_layer_allows;
+	const bool second_layer_is_fs_only;
+	const int expected_sysadmin;
+	const int expected_chroot;
+};
+
+/*
+ * Unsandboxed baseline: no Landlock layers are stacked.
+ * Both capabilities should work normally.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_stacking, unsandboxed) {
+	/* clang-format on */
+	.is_sandboxed = false,
+	.second_layer_allows = false,
+	.expected_sysadmin = 0,
+	.expected_chroot = 0,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_stacking, deny) {
+	/* clang-format on */
+	.is_sandboxed = true,
+	.second_layer_allows = false,
+	.expected_sysadmin = EPERM,
+	.expected_chroot = EPERM,
+};
+
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_stacking, allow) {
+	/* clang-format on */
+	.is_sandboxed = true,
+	.second_layer_allows = true,
+	.expected_sysadmin = 0,
+	.expected_chroot = EPERM,
+};
+
+/*
+ * Mixed layers: first layer handles PERM_CAPABILITY_USE (denies all
+ * caps), second layer is FS-only (does not handle it).  The perm
+ * walker iterates from youngest (layer 1) to oldest (layer 0) and
+ * must skip the FS-only layer to find the denying layer beneath.
+ */
+/* clang-format off */
+FIXTURE_VARIANT_ADD(cap_stacking, mixed_layers) {
+	/* clang-format on */
+	.is_sandboxed = true,
+	.second_layer_is_fs_only = true,
+	.expected_sysadmin = EPERM,
+	.expected_chroot = EPERM,
+};
+
+FIXTURE_SETUP(cap_stacking)
+{
+	disable_caps(_metadata);
+}
+
+FIXTURE_TEARDOWN(cap_stacking)
+{
+}
+
+TEST_F(cap_stacking, two_layers)
+{
+	int ruleset_fd;
+
+	if (variant->is_sandboxed) {
+		/* First layer: always handles PERM_CAPABILITY_USE. */
+		ruleset_fd = create_cap_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+		if (!variant->second_layer_is_fs_only)
+			ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_SYS_ADMIN));
+
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+
+		if (variant->second_layer_is_fs_only) {
+			/*
+			 * Second layer: FS-only (does not handle
+			 * PERM_CAPABILITY_USE).  The perm walker must
+			 * skip this layer.
+			 */
+			const struct landlock_ruleset_attr fs_attr = {
+				.handled_access_fs =
+					LANDLOCK_ACCESS_FS_READ_FILE,
+			};
+
+			ruleset_fd = landlock_create_ruleset(
+				&fs_attr, sizeof(fs_attr), 0);
+		} else {
+			/* Second layer: cap allow or deny. */
+			ruleset_fd = create_cap_ruleset();
+			if (variant->second_layer_allows)
+				ASSERT_EQ(0, add_cap_rule(ruleset_fd,
+							  CAP_SYS_ADMIN));
+		}
+		ASSERT_LE(0, ruleset_fd);
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+	}
+
+	/* Test CAP_SYS_ADMIN via sethostname. */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	if (variant->expected_sysadmin) {
+		EXPECT_EQ(-1, sethostname("test", 4));
+		EXPECT_EQ(variant->expected_sysadmin, errno);
+	} else {
+		EXPECT_EQ(0, sethostname("test", 4));
+	}
+	clear_cap(_metadata, CAP_SYS_ADMIN);
+
+	/* Test CAP_SYS_CHROOT via chroot. */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	if (variant->expected_chroot) {
+		EXPECT_EQ(-1, chroot("/"));
+		EXPECT_EQ(variant->expected_chroot, errno);
+	} else {
+		EXPECT_EQ(0, chroot("/"));
+	}
+	clear_cap(_metadata, CAP_SYS_CHROOT);
+}
+
+/*
+ * Verify that LANDLOCK_PERM_CAPABILITY_USE enforces when the domain is applied
+ * without no_new_privs, using CAP_SYS_ADMIN for landlock_restrict_self()
+ * authorization instead.  Privileged processes (e.g. container managers)
+ * can sandbox themselves this way.
+ */
+TEST(cap_without_nnp)
+{
+	int ruleset_fd;
+
+	disable_caps(_metadata);
+
+	ruleset_fd = create_cap_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+
+	/* Allow CAP_SYS_CHROOT but not CAP_SYS_ADMIN. */
+	ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_SYS_CHROOT));
+
+	/*
+	 * Enforce WITHOUT NNP: landlock_restrict_self() succeeds when
+	 * the caller has CAP_SYS_ADMIN (checked before the new domain
+	 * takes effect).
+	 */
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	ASSERT_EQ(0, landlock_restrict_self(ruleset_fd, 0));
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/*
+	 * CAP_SYS_ADMIN is still in effective set but Landlock denies it:
+	 * cap_capable() returns 0, then hook_capable() returns -EPERM.
+	 */
+	EXPECT_EQ(-1, sethostname("test", 4));
+	EXPECT_EQ(EPERM, errno);
+
+	/* CAP_SYS_CHROOT is allowed by the rule. */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(0, chroot("/"));
+}
+
+/*
+ * Verify that capabilities gained through user namespace ownership are
+ * still restricted by LANDLOCK_PERM_CAPABILITY_USE.  When a process creates a
+ * user namespace, the kernel grants CAP_FULL_SET in the new namespace
+ * via cap_capable_helper()'s ownership bypass.  Landlock's hook_capable()
+ * must still deny capabilities not in the allowed set, ensuring that
+ * user namespace creation cannot be used to escape capability restrictions.
+ */
+TEST(cap_userns_ownership_bypass)
+{
+	pid_t child;
+	int status;
+
+	child = fork();
+	ASSERT_LE(0, child);
+	if (child == 0) {
+		int ruleset_fd;
+
+		disable_caps(_metadata);
+
+		ruleset_fd = create_cap_ruleset();
+		ASSERT_LE(0, ruleset_fd);
+
+		/* Allow CAP_SYS_ADMIN only. */
+		ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_SYS_ADMIN));
+		enforce_ruleset(_metadata, ruleset_fd);
+		EXPECT_EQ(0, close(ruleset_fd));
+
+		/*
+		 * Create a user namespace.  This is unprivileged and
+		 * does not require capabilities.  LANDLOCK_PERM_NAMESPACE_ENTER
+		 * is not handled so namespace creation is unrestricted.
+		 */
+		ASSERT_EQ(0, unshare(CLONE_NEWUSER));
+
+		/*
+		 * After unshare(CLONE_NEWUSER), the kernel set
+		 * cap_effective = CAP_FULL_SET in the new namespace.
+		 * Create a UTS namespace (requires CAP_SYS_ADMIN in
+		 * the new user NS).  Landlock allows CAP_SYS_ADMIN.
+		 */
+		ASSERT_EQ(0, unshare(CLONE_NEWUTS))
+		{
+			TH_LOG("unshare(CLONE_NEWUTS): %s", strerror(errno));
+		}
+
+		/*
+		 * sethostname checks against uts_ns->user_ns, which is
+		 * now the new user NS.  CAP_SYS_ADMIN is allowed.
+		 */
+		EXPECT_EQ(0, sethostname("test", 4));
+
+		/*
+		 * chroot checks against current_user_ns(), which is
+		 * the new user NS.  The process has CAP_SYS_CHROOT in
+		 * cap_effective (from user NS creation), so cap_capable()
+		 * returns 0.  But Landlock denies because no rule
+		 * allows CAP_SYS_CHROOT.
+		 */
+		EXPECT_EQ(-1, chroot("/"));
+		EXPECT_EQ(EPERM, errno);
+
+		_exit(_metadata->exit_code);
+		return;
+	}
+
+	ASSERT_EQ(child, waitpid(child, &status, 0));
+	if (WIFSIGNALED(status) || !WIFEXITED(status) ||
+	    WEXITSTATUS(status) != EXIT_SUCCESS)
+		_metadata->exit_code = KSFT_FAIL;
+}
+
+/* Audit tests */
+
+static int matches_log_cap(int audit_fd, int cap_number)
+{
+	static const char log_template[] = REGEX_LANDLOCK_PREFIX
+		" blockers=perm\\.capability_use capability=%d $";
+	char log_match[sizeof(log_template) + 10];
+	int log_match_len;
+
+	log_match_len = snprintf(log_match, sizeof(log_match), log_template,
+				 cap_number);
+	if (log_match_len >= sizeof(log_match))
+		return -E2BIG;
+
+	return audit_match_record(audit_fd, AUDIT_LANDLOCK_ACCESS, log_match,
+				  NULL);
+}
+
+FIXTURE(cap_audit)
+{
+	struct audit_filter audit_filter;
+	int audit_fd;
+};
+
+FIXTURE_SETUP(cap_audit)
+{
+	ASSERT_TRUE(is_in_init_user_ns());
+
+	disable_caps(_metadata);
+
+	set_cap(_metadata, CAP_AUDIT_CONTROL);
+	self->audit_fd = audit_init_with_exe_filter(&self->audit_filter);
+	EXPECT_LE(0, self->audit_fd);
+	clear_cap(_metadata, CAP_AUDIT_CONTROL);
+}
+
+FIXTURE_TEARDOWN(cap_audit)
+{
+	set_cap(_metadata, CAP_AUDIT_CONTROL);
+	EXPECT_EQ(0, audit_cleanup(self->audit_fd, &self->audit_filter));
+}
+
+/*
+ * Verifies that a denied capability produces the expected audit record
+ * with the correct capability number and blocker string.
+ */
+TEST_F(cap_audit, denied)
+{
+	struct audit_records records;
+	int ruleset_fd;
+
+	/* Baseline: chroot works before Landlock. */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	ASSERT_EQ(0, chroot("/"));
+	clear_cap(_metadata, CAP_SYS_CHROOT);
+
+	ruleset_fd = create_cap_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	/* Allow CAP_AUDIT_CONTROL for child-side audit cleanup. */
+	ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_AUDIT_CONTROL));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	/* Deny CAP_SYS_CHROOT (no allow rule). */
+	set_cap(_metadata, CAP_SYS_CHROOT);
+	EXPECT_EQ(-1, chroot("/"));
+	EXPECT_EQ(EPERM, errno);
+	clear_cap(_metadata, CAP_SYS_CHROOT);
+
+	EXPECT_EQ(0, matches_log_cap(self->audit_fd, CAP_SYS_CHROOT));
+
+	/*
+	 * No extra access records: the denial was already consumed by
+	 * matches_log_cap above.  One domain allocation record, emitted
+	 * in the same event as the first access denial for this domain.
+	 */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+	EXPECT_EQ(1, records.domain);
+}
+
+TEST_F(cap_audit, allowed)
+{
+	struct audit_records records;
+	int ruleset_fd;
+
+	ruleset_fd = create_cap_ruleset();
+	ASSERT_LE(0, ruleset_fd);
+	ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_SYS_ADMIN));
+	/* Allow CAP_AUDIT_CONTROL for child-side audit cleanup. */
+	ASSERT_EQ(0, add_cap_rule(ruleset_fd, CAP_AUDIT_CONTROL));
+	enforce_ruleset(_metadata, ruleset_fd);
+	EXPECT_EQ(0, close(ruleset_fd));
+
+	set_cap(_metadata, CAP_SYS_ADMIN);
+	EXPECT_EQ(0, sethostname("test", 4));
+
+	/* No records: allowed operations never trigger audit logging. */
+	EXPECT_EQ(0, audit_count_records(self->audit_fd, &records));
+	EXPECT_EQ(0, records.access);
+}
+
+TEST_HARNESS_MAIN
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v1 10/11] samples/landlock: Add capability and namespace restriction support
From: Mickaël Salaün @ 2026-03-12 10:04 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Paul Moore,
	Serge E . Hallyn
  Cc: Mickaël Salaün, Justin Suess, Lennart Poettering,
	Mikhail Ivanov, Nicolas Bouchinet, Shervin Oloumi, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-kernel, linux-security-module
In-Reply-To: <20260312100444.2609563-1-mic@digikod.net>

Extend the sandboxer sample to demonstrate the new Landlock capability
and namespace restriction features.  The LL_CAPS environment variable
takes a colon-delimited list of allowed capability numbers (e.g. "18"
for CAP_SYS_CHROOT).  The LL_NS variable takes a colon-delimited list of
allowed namespace types by short name (e.g.  "user:uts:net").  Update
LANDLOCK_ABI_LAST to 9 and add best-effort degradation for older
kernels.

Allow creating user and UTS namespaces but deny network namespaces
(works as an unprivileged user).  All capabilities are available
(LL_CAPS is not set), but namespace creation is still restricted to the
types listed in LL_NS.  The first command succeeds because user and UTS
types are in the allowed set, and sets the hostname inside the new UTS
namespace.  The second command fails because the network namespace type
is not allowed by the LANDLOCK_PERM_NAMESPACE_ENTER rule:

  LL_FS_RO=/ LL_FS_RW=/proc LL_NS="user:uts" \
    ./sandboxer /bin/sh -c \
    "unshare --user --uts --map-root-user hostname sandbox \
    && ! unshare --user --net true"

Allow only user namespace creation and CAP_SYS_CHROOT (18), denying all
other capabilities and namespace types (works as an unprivileged user).
An unprivileged process creates a user namespace (no capability
required) and calls chroot inside it using the CAP_SYS_CHROOT granted
within the new namespace:

  LL_FS_RO=/ LL_FS_RW="" LL_NS="user" LL_CAPS="18" \
    ./sandboxer /bin/sh -c \
    "unshare --user --keep-caps chroot / true"

Cc: Christian Brauner <brauner@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 samples/landlock/sandboxer.c | 164 +++++++++++++++++++++++++++++++++--
 1 file changed, 155 insertions(+), 9 deletions(-)

diff --git a/samples/landlock/sandboxer.c b/samples/landlock/sandboxer.c
index 9f21088c0855..09c499703835 100644
--- a/samples/landlock/sandboxer.c
+++ b/samples/landlock/sandboxer.c
@@ -14,6 +14,8 @@
 #include <fcntl.h>
 #include <linux/landlock.h>
 #include <linux/socket.h>
+#include <sched.h>
+#include <stdbool.h>
 #include <stddef.h>
 #include <stdio.h>
 #include <stdlib.h>
@@ -22,12 +24,16 @@
 #include <sys/stat.h>
 #include <sys/syscall.h>
 #include <unistd.h>
-#include <stdbool.h>
 
 #if defined(__GLIBC__)
 #include <linux/prctl.h>
 #endif
 
+/* From include/linux/bits.h, not available in userspace. */
+#ifndef BITS_PER_TYPE
+#define BITS_PER_TYPE(type) (sizeof(type) * 8)
+#endif
+
 #ifndef landlock_create_ruleset
 static inline int
 landlock_create_ruleset(const struct landlock_ruleset_attr *const attr,
@@ -60,6 +66,8 @@ static inline int landlock_restrict_self(const int ruleset_fd,
 #define ENV_FS_RW_NAME "LL_FS_RW"
 #define ENV_TCP_BIND_NAME "LL_TCP_BIND"
 #define ENV_TCP_CONNECT_NAME "LL_TCP_CONNECT"
+#define ENV_CAPS_NAME "LL_CAPS"
+#define ENV_NS_NAME "LL_NS"
 #define ENV_SCOPED_NAME "LL_SCOPED"
 #define ENV_FORCE_LOG_NAME "LL_FORCE_LOG"
 #define ENV_DELIMITER ":"
@@ -226,11 +234,125 @@ static int populate_ruleset_net(const char *const env_var, const int ruleset_fd,
 	return ret;
 }
 
+static __u64 str2ns(const char *const name)
+{
+	static const struct {
+		const char *name;
+		__u64 value;
+	} ns_map[] = {
+		/* clang-format off */
+		{ "cgroup",	CLONE_NEWCGROUP },
+		{ "ipc",	CLONE_NEWIPC },
+		{ "mnt",	CLONE_NEWNS },
+		{ "net",	CLONE_NEWNET },
+		{ "pid",	CLONE_NEWPID },
+		{ "time",	CLONE_NEWTIME },
+		{ "user",	CLONE_NEWUSER },
+		{ "uts",	CLONE_NEWUTS },
+		/* clang-format on */
+	};
+	size_t i;
+
+	for (i = 0; i < sizeof(ns_map) / sizeof(ns_map[0]); i++) {
+		if (strcmp(name, ns_map[i].name) == 0)
+			return ns_map[i].value;
+	}
+	return 0;
+}
+
+static int populate_ruleset_caps(const char *const env_var,
+				 const int ruleset_fd)
+{
+	int ret = 1;
+	char *env_cap_name, *env_cap_name_next, *strcap;
+	struct landlock_capability_attr cap_attr = {
+		.allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+	};
+
+	env_cap_name = getenv(env_var);
+	if (!env_cap_name)
+		return 0;
+	env_cap_name = strdup(env_cap_name);
+	unsetenv(env_var);
+
+	env_cap_name_next = env_cap_name;
+	while ((strcap = strsep(&env_cap_name_next, ENV_DELIMITER))) {
+		__u64 cap;
+
+		if (strcmp(strcap, "") == 0)
+			continue;
+
+		if (str2num(strcap, &cap) ||
+		    cap >= BITS_PER_TYPE(cap_attr.capabilities)) {
+			fprintf(stderr,
+				"Failed to parse capability at \"%s\"\n",
+				strcap);
+			goto out_free_name;
+		}
+		cap_attr.capabilities = 1ULL << cap;
+		if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+				      &cap_attr, 0)) {
+			fprintf(stderr,
+				"Failed to update the ruleset with capability \"%llu\": %s\n",
+				(unsigned long long)cap, strerror(errno));
+			goto out_free_name;
+		}
+	}
+	ret = 0;
+
+out_free_name:
+	free(env_cap_name);
+	return ret;
+}
+
+static int populate_ruleset_ns(const char *const env_var, const int ruleset_fd)
+{
+	int ret = 1;
+	char *env_ns_name, *env_ns_name_next, *strns;
+	struct landlock_namespace_attr ns_attr = {
+		.allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+	};
+
+	env_ns_name = getenv(env_var);
+	if (!env_ns_name)
+		return 0;
+	env_ns_name = strdup(env_ns_name);
+	unsetenv(env_var);
+
+	env_ns_name_next = env_ns_name;
+	while ((strns = strsep(&env_ns_name_next, ENV_DELIMITER))) {
+		__u64 ns_type;
+
+		if (strcmp(strns, "") == 0)
+			continue;
+
+		ns_type = str2ns(strns);
+		if (!ns_type) {
+			fprintf(stderr, "Unknown namespace type \"%s\"\n",
+				strns);
+			goto out_free_name;
+		}
+		ns_attr.namespace_types = ns_type;
+		if (landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+				      &ns_attr, 0)) {
+			fprintf(stderr,
+				"Failed to update the ruleset with namespace \"%s\": %s\n",
+				strns, strerror(errno));
+			goto out_free_name;
+		}
+	}
+	ret = 0;
+
+out_free_name:
+	free(env_ns_name);
+	return ret;
+}
+
 /* Returns true on error, false otherwise. */
 static bool check_ruleset_scope(const char *const env_var,
 				struct landlock_ruleset_attr *ruleset_attr)
 {
-	char *env_type_scope, *env_type_scope_next, *ipc_scoping_name;
+	char *env_type_scope, *env_type_scope_next, *scope_name;
 	bool error = false;
 	bool abstract_scoping = false;
 	bool signal_scoping = false;
@@ -247,16 +369,14 @@ static bool check_ruleset_scope(const char *const env_var,
 
 	env_type_scope = strdup(env_type_scope);
 	env_type_scope_next = env_type_scope;
-	while ((ipc_scoping_name =
-			strsep(&env_type_scope_next, ENV_DELIMITER))) {
-		if (strcmp("a", ipc_scoping_name) == 0 && !abstract_scoping) {
+	while ((scope_name = strsep(&env_type_scope_next, ENV_DELIMITER))) {
+		if (strcmp("a", scope_name) == 0 && !abstract_scoping) {
 			abstract_scoping = true;
-		} else if (strcmp("s", ipc_scoping_name) == 0 &&
-			   !signal_scoping) {
+		} else if (strcmp("s", scope_name) == 0 && !signal_scoping) {
 			signal_scoping = true;
 		} else {
 			fprintf(stderr, "Unknown or duplicate scope \"%s\"\n",
-				ipc_scoping_name);
+				scope_name);
 			error = true;
 			goto out_free_name;
 		}
@@ -299,7 +419,7 @@ static bool check_ruleset_scope(const char *const env_var,
 
 /* clang-format on */
 
-#define LANDLOCK_ABI_LAST 8
+#define LANDLOCK_ABI_LAST 9
 
 #define XSTR(s) #s
 #define STR(s) XSTR(s)
@@ -322,6 +442,10 @@ static const char help[] =
 	"means an empty list):\n"
 	"* " ENV_TCP_BIND_NAME ": ports allowed to bind (server)\n"
 	"* " ENV_TCP_CONNECT_NAME ": ports allowed to connect (client)\n"
+	"* " ENV_CAPS_NAME ": capability numbers allowed to use "
+	"(e.g. 10 for CAP_NET_BIND_SERVICE, 21 for CAP_SYS_ADMIN)\n"
+	"* " ENV_NS_NAME ": namespace types allowed to enter "
+	"(cgroup, ipc, mnt, net, pid, time, user, uts)\n"
 	"* " ENV_SCOPED_NAME ": actions denied on the outside of the landlock domain\n"
 	"  - \"a\" to restrict opening abstract unix sockets\n"
 	"  - \"s\" to restrict sending signals\n"
@@ -334,6 +458,8 @@ static const char help[] =
 	ENV_FS_RW_NAME "=\"/dev/null:/dev/full:/dev/zero:/dev/pts:/tmp\" "
 	ENV_TCP_BIND_NAME "=\"9418\" "
 	ENV_TCP_CONNECT_NAME "=\"80:443\" "
+	ENV_CAPS_NAME "=\"21\" "
+	ENV_NS_NAME "=\"user:uts:net\" "
 	ENV_SCOPED_NAME "=\"a:s\" "
 	"%1$s bash -i\n"
 	"\n"
@@ -357,6 +483,8 @@ int main(const int argc, char *const argv[], char *const *const envp)
 				      LANDLOCK_ACCESS_NET_CONNECT_TCP,
 		.scoped = LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET |
 			  LANDLOCK_SCOPE_SIGNAL,
+		.handled_perm = LANDLOCK_PERM_CAPABILITY_USE |
+				LANDLOCK_PERM_NAMESPACE_ENTER,
 	};
 	int supported_restrict_flags = LANDLOCK_RESTRICT_SELF_LOG_NEW_EXEC_ON;
 	int set_restrict_flags = 0;
@@ -438,6 +566,10 @@ int main(const int argc, char *const argv[], char *const *const envp)
 			~LANDLOCK_RESTRICT_SELF_LOG_NEW_EXEC_ON;
 		__attribute__((fallthrough));
 	case 7:
+		__attribute__((fallthrough));
+	case 8:
+		/* Removes permission support for ABI < 9 */
+		ruleset_attr.handled_perm = 0;
 		/* Must be printed for any ABI < LANDLOCK_ABI_LAST. */
 		fprintf(stderr,
 			"Hint: You should update the running kernel "
@@ -470,6 +602,14 @@ int main(const int argc, char *const argv[], char *const *const envp)
 			~LANDLOCK_ACCESS_NET_CONNECT_TCP;
 	}
 
+	/* Removes capability handling if not set by a user. */
+	if (!getenv(ENV_CAPS_NAME))
+		ruleset_attr.handled_perm &= ~LANDLOCK_PERM_CAPABILITY_USE;
+
+	/* Removes namespace handling if not set by a user. */
+	if (!getenv(ENV_NS_NAME))
+		ruleset_attr.handled_perm &= ~LANDLOCK_PERM_NAMESPACE_ENTER;
+
 	if (check_ruleset_scope(ENV_SCOPED_NAME, &ruleset_attr))
 		return 1;
 
@@ -514,6 +654,12 @@ int main(const int argc, char *const argv[], char *const *const envp)
 		goto err_close_ruleset;
 	}
 
+	if (populate_ruleset_caps(ENV_CAPS_NAME, ruleset_fd))
+		goto err_close_ruleset;
+
+	if (populate_ruleset_ns(ENV_NS_NAME, ruleset_fd))
+		goto err_close_ruleset;
+
 	if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
 		perror("Failed to restrict privileges");
 		goto err_close_ruleset;
-- 
2.53.0


^ permalink raw reply related

* [RFC PATCH v1 11/11] landlock: Add documentation for capability and namespace restrictions
From: Mickaël Salaün @ 2026-03-12 10:04 UTC (permalink / raw)
  To: Christian Brauner, Günther Noack, Paul Moore,
	Serge E . Hallyn
  Cc: Mickaël Salaün, Justin Suess, Lennart Poettering,
	Mikhail Ivanov, Nicolas Bouchinet, Shervin Oloumi, Tingmao Wang,
	kernel-team, linux-fsdevel, linux-kernel, linux-security-module
In-Reply-To: <20260312100444.2609563-1-mic@digikod.net>

Document the two new Landlock permission categories in the userspace
API guide, admin guide, and kernel security documentation.

The userspace API guide adds sections on capability restriction
(LANDLOCK_PERM_CAPABILITY_USE with LANDLOCK_RULE_CAPABILITY), namespace
restriction (LANDLOCK_PERM_NAMESPACE_ENTER with LANDLOCK_RULE_NAMESPACE
covering creation via unshare/clone and entry via setns), and the
backward-compatible degradation pattern for ABI < 9.  A table documents
the per-namespace-type capability requirements for both creation and
entry.

The admin guide adds the new perm.namespace_enter and
perm.capability_use audit blocker names with their object identification
fields (namespace_type, namespace_inum, capability).

The kernel security documentation adds a "Ruleset restriction models"
section defining the three models (handled_access_*, handled_perm,
scoped), their coverage and compatibility properties, and the criteria
for choosing between them for future features.  It also documents
composability with user namespaces and adds kernel-doc references for
the new capability and namespace headers.

Cc: Christian Brauner <brauner@kernel.org>
Cc: Günther Noack <gnoack@google.com>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Serge E. Hallyn <serge@hallyn.com>
Signed-off-by: Mickaël Salaün <mic@digikod.net>
---
 Documentation/admin-guide/LSM/landlock.rst |  19 ++-
 Documentation/security/landlock.rst        |  80 ++++++++++-
 Documentation/userspace-api/landlock.rst   | 156 ++++++++++++++++++++-
 3 files changed, 245 insertions(+), 10 deletions(-)

diff --git a/Documentation/admin-guide/LSM/landlock.rst b/Documentation/admin-guide/LSM/landlock.rst
index 9923874e2156..99c6a599ce9e 100644
--- a/Documentation/admin-guide/LSM/landlock.rst
+++ b/Documentation/admin-guide/LSM/landlock.rst
@@ -6,7 +6,7 @@ Landlock: system-wide management
 ================================
 
 :Author: Mickaël Salaün
-:Date: January 2026
+:Date: March 2026
 
 Landlock can leverage the audit framework to log events.
 
@@ -59,14 +59,25 @@ AUDIT_LANDLOCK_ACCESS
         - scope.abstract_unix_socket - Abstract UNIX socket connection denied
         - scope.signal - Signal sending denied
 
+    **perm.*** - Permission restrictions (ABI 9+):
+        - perm.namespace_enter - Namespace entry was denied (creation via
+          :manpage:`unshare(2)` / :manpage:`clone(2)` or joining via
+          :manpage:`setns(2)`);
+          ``namespace_type`` indicates the type (hex CLONE_NEW* bitmask),
+          ``namespace_inum`` identifies the target namespace for
+          :manpage:`setns(2)` operations
+        - perm.capability_use - Capability use was denied;
+          ``capability`` indicates the capability number
+
     Multiple blockers can appear in a single event (comma-separated) when
     multiple access rights are missing. For example, creating a regular file
     in a directory that lacks both ``make_reg`` and ``refer`` rights would show
     ``blockers=fs.make_reg,fs.refer``.
 
-    The object identification fields (path, dev, ino for filesystem; opid,
-    ocomm for signals) depend on the type of access being blocked and provide
-    context about what resource was involved in the denial.
+    The object identification fields depend on the type of access being blocked:
+    ``path``, ``dev``, ``ino`` for filesystem; ``opid``, ``ocomm`` for signals;
+    ``namespace_type`` and ``namespace_inum`` for namespace operations;
+    ``capability`` for capability use.
 
 
 AUDIT_LANDLOCK_DOMAIN
diff --git a/Documentation/security/landlock.rst b/Documentation/security/landlock.rst
index 3e4d4d04cfae..cd3d640ca5c9 100644
--- a/Documentation/security/landlock.rst
+++ b/Documentation/security/landlock.rst
@@ -7,7 +7,7 @@ Landlock LSM: kernel documentation
 ==================================
 
 :Author: Mickaël Salaün
-:Date: September 2025
+:Date: March 2026
 
 Landlock's goal is to create scoped access-control (i.e. sandboxing).  To
 harden a whole system, this feature should be available to any process,
@@ -89,6 +89,72 @@ this is required to keep access controls consistent over the whole system, and
 this avoids unattended bypasses through file descriptor passing (i.e. confused
 deputy attack).
 
+Composability with user namespaces
+----------------------------------
+
+Landlock domain-based scoping and the kernel's user namespace-based capability
+scoping enforce isolation over independent hierarchies.  Landlock checks domain
+ancestry; the kernel's ``ns_capable()`` checks user namespace ancestry.  These
+hierarchies are orthogonal: Landlock enforcement is deterministic with respect
+to its own configuration, regardless of namespace or capability state, and vice
+versa.  This orthogonality is a design invariant that must hold for all new
+scoped features.
+
+Ruleset restriction models
+--------------------------
+
+Landlock provides three restriction models, each with different coverage
+and compatibility properties.
+
+Access rights (``handled_access_*``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Access rights control **enumerated operations on kernel objects**
+identified by a rule key (a file hierarchy or a network port).  Each
+``handled_access_*`` field declares a set of access rights that the
+ruleset restricts.  Multiple access rights share a single rule type.
+Operations for which no access right exists yet remain uncontrolled;
+new rights are added incrementally across ABI versions.
+
+Permissions (``handled_perm``)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Permissions control **broad operations enforced at single kernel
+chokepoints**, achieving complete deny-by-default coverage.  Each
+``LANDLOCK_PERM_*`` flag maps to its own rule type.  When a ruleset
+handles a permission, all instances of that operation are denied unless
+explicitly allowed by a rule.  New kernel values (new ``CAP_*``
+capabilities, new ``CLONE_NEW*`` namespace types) are automatically
+denied without any Landlock update.
+
+Each permission flag names a single gateway operation whose control
+transitively covers an open-ended set of downstream operations: for
+example, exercising a capability enables privileged operations across
+many subsystems; entering a namespace enables gaining capabilities in a
+new context.
+
+Permission rules identify what to allow using constants defined by other
+kernel subsystems (``CAP_*``, ``CLONE_NEW*``).  Unknown values are
+silently ignored because deny-by-default ensures they are denied anyway.
+In contrast, unknown ``LANDLOCK_PERM_*`` flags in ``handled_perm`` are
+rejected (``-EINVAL``), since Landlock owns that namespace.
+
+Scopes (``scoped``)
+~~~~~~~~~~~~~~~~~~~~
+
+Scopes restrict **cross-domain interactions** categorically, without
+rules.  Setting a scope flag (e.g. ``LANDLOCK_SCOPE_SIGNAL``) denies the
+operation to targets outside the Landlock domain or its children.  Like
+permissions, scopes provide complete coverage of the controlled
+operation.
+
+When adding new Landlock features, new operations on existing rule types
+extend the corresponding ``handled_access_*`` field (e.g. a new
+filesystem operation extends ``handled_access_fs``).  A new object
+category with multiple fine-grained operations would use a new
+``handled_access_*`` field.  New rule types that control a single
+chokepoint operation use ``handled_perm``.
+
 Tests
 =====
 
@@ -110,6 +176,18 @@ Filesystem
 .. kernel-doc:: security/landlock/fs.h
     :identifiers:
 
+Namespace
+---------
+
+.. kernel-doc:: security/landlock/ns.h
+    :identifiers:
+
+Capability
+----------
+
+.. kernel-doc:: security/landlock/cap.h
+    :identifiers:
+
 Process credential
 ------------------
 
diff --git a/Documentation/userspace-api/landlock.rst b/Documentation/userspace-api/landlock.rst
index 13134bccdd39..238d30a18162 100644
--- a/Documentation/userspace-api/landlock.rst
+++ b/Documentation/userspace-api/landlock.rst
@@ -8,7 +8,7 @@ Landlock: unprivileged access control
 =====================================
 
 :Author: Mickaël Salaün
-:Date: January 2026
+:Date: March 2026
 
 The goal of Landlock is to enable restriction of ambient rights (e.g. global
 filesystem or network access) for a set of processes.  Because Landlock
@@ -33,7 +33,7 @@ A Landlock rule describes an action on an object which the process intends to
 perform.  A set of rules is aggregated in a ruleset, which can then restrict
 the thread enforcing it, and its future children.
 
-The two existing types of rules are:
+The existing types of rules are:
 
 Filesystem rules
     For these rules, the object is a file hierarchy,
@@ -44,6 +44,14 @@ Network rules (since ABI v4)
     For these rules, the object is a TCP port,
     and the related actions are defined with `network access rights`.
 
+Capability rules (since ABI v9)
+    For these rules, the object is a set of Linux capabilities,
+    and the related actions are defined with `permission flags`.
+
+Namespace rules (since ABI v9)
+    For these rules, the object is a set of namespace types,
+    and the related actions are defined with `permission flags`.
+
 Defining and enforcing a security policy
 ----------------------------------------
 
@@ -84,6 +92,9 @@ to be explicit about the denied-by-default access rights.
         .scoped =
             LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET |
             LANDLOCK_SCOPE_SIGNAL,
+        .handled_perm =
+            LANDLOCK_PERM_CAPABILITY_USE |
+            LANDLOCK_PERM_NAMESPACE_ENTER,
     };
 
 Because we may not know which kernel version an application will be executed
@@ -127,6 +138,12 @@ version, and only use the available subset of access rights:
         /* Removes LANDLOCK_SCOPE_* for ABI < 6 */
         ruleset_attr.scoped &= ~(LANDLOCK_SCOPE_ABSTRACT_UNIX_SOCKET |
                                  LANDLOCK_SCOPE_SIGNAL);
+        __attribute__((fallthrough));
+    case 6:
+    case 7:
+    case 8:
+        /* Removes permission support for ABI < 9 */
+        ruleset_attr.handled_perm = 0;
     }
 
 This enables the creation of an inclusive ruleset that will contain our rules.
@@ -191,6 +208,42 @@ number for a specific action: HTTPS connections.
     err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NET_PORT,
                             &net_port, 0);
 
+For capability access-control, we can add rules that allow specific
+capabilities.  For instance, to allow ``CAP_SYS_CHROOT`` (so the sandboxed
+process can call :manpage:`chroot(2)` inside a user namespace):
+
+.. code-block:: c
+
+    struct landlock_capability_attr cap_attr = {
+        .allowed_perm = LANDLOCK_PERM_CAPABILITY_USE,
+        .capabilities = (1ULL << CAP_SYS_CHROOT),
+    };
+
+    err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_CAPABILITY,
+                            &cap_attr, 0);
+
+For namespace access-control, we can add rules that allow entering specific
+namespace types (creating them via :manpage:`unshare(2)` / :manpage:`clone(2)`
+or joining them via :manpage:`setns(2)`).  For instance, to allow creating user
+namespaces (which grants all capabilities inside the new namespace):
+
+.. code-block:: c
+
+    struct landlock_namespace_attr ns_attr = {
+        .allowed_perm = LANDLOCK_PERM_NAMESPACE_ENTER,
+        .namespace_types = CLONE_NEWUSER,
+    };
+
+    err = landlock_add_rule(ruleset_fd, LANDLOCK_RULE_NAMESPACE,
+                            &ns_attr, 0);
+
+Together, these two rules allow an unprivileged process to create a user
+namespace and call :manpage:`chroot(2)` inside it, while denying all other
+capabilities and namespace types.  User namespace creation is the one operation
+that does not require ``CAP_SYS_ADMIN``, so no capability rule is needed for it.
+See `Capability and namespace restrictions`_ for details on capability
+requirements.
+
 When passing a non-zero ``flags`` argument to ``landlock_restrict_self()``, a
 similar backwards compatibility check is needed for the restrict flags
 (see sys_landlock_restrict_self() documentation for available flags):
@@ -354,10 +407,87 @@ The operations which can be scoped are:
     A :manpage:`sendto(2)` on a socket which was previously connected will not
     be restricted.  This works for both datagram and stream sockets.
 
-IPC scoping does not support exceptions via :manpage:`landlock_add_rule(2)`.
+Scoping does not support exceptions via :manpage:`landlock_add_rule(2)`.
 If an operation is scoped within a domain, no rules can be added to allow access
 to resources or processes outside of the scope.
 
+Capability and namespace restrictions
+-------------------------------------
+
+See Documentation/security/landlock.rst for the design rationale behind
+the permission model (``handled_perm``) and how it differs from access
+rights (``handled_access_*``) and scopes (``scoped``).
+When a process creates a user namespace, the kernel grants all capabilities
+within that namespace.  While these capabilities cannot directly bypass Landlock
+restrictions (Landlock enforces access controls independently of capability
+checks), they open kernel code paths that are normally unreachable to
+unprivileged users and may contain exploitable bugs.
+
+Landlock provides two complementary permissions to address this.
+``LANDLOCK_PERM_CAPABILITY_USE`` restricts which capabilities a process can use,
+even when it holds them.  ``LANDLOCK_PERM_NAMESPACE_ENTER`` restricts which
+namespace types a process can create (via :manpage:`unshare(2)` or
+:manpage:`clone(2)`) or join (via :manpage:`setns(2)`).  After creating a user
+namespace, the granted capabilities are scoped to namespaces owned by that user
+namespace or its descendants; to exercise a capability such as
+``CAP_NET_ADMIN``, the process must create a namespace of the corresponding type
+(e.g., a network namespace).  Configuring both permissions together provides
+full coverage: ``LANDLOCK_PERM_CAPABILITY_USE`` restricts which capabilities are
+available, while ``LANDLOCK_PERM_NAMESPACE_ENTER`` restricts the namespaces in
+which they can be used.
+
+When a Landlock domain handles ``LANDLOCK_PERM_CAPABILITY_USE``, all Linux
+:manpage:`capabilities(7)` are denied by default unless a rule explicitly allows
+them.  This is purely restrictive: Landlock can only deny capabilities that the
+traditional capability mechanism would have allowed, never grant additional ones.
+Rules are added with ``LANDLOCK_RULE_CAPABILITY`` using a
+&struct landlock_capability_attr.  Each rule specifies a set of ``CAP_*`` values
+(as a bitmask) to allow.  Capabilities above ``CAP_LAST_CAP`` are silently
+accepted but have no effect since the kernel never checks them; this means new
+capabilities introduced by future kernels are automatically denied.
+
+When a Landlock domain handles ``LANDLOCK_PERM_NAMESPACE_ENTER``, namespace
+creation and entry are denied by default unless a rule explicitly allows them.
+Rules are added with ``LANDLOCK_RULE_NAMESPACE`` using a
+&struct landlock_namespace_attr.  Each rule specifies a set of ``CLONE_NEW*``
+flags to allow.
+
+In practice, unprivileged processes first create a user namespace (which requires
+no capability and grants all capabilities within it), then use those capabilities
+to create other namespace types.  All non-user namespace types require
+``CAP_SYS_ADMIN`` for both creation and :manpage:`setns(2)` entry; mount
+namespace entry additionally requires ``CAP_SYS_CHROOT``.  For
+:manpage:`setns(2)`, capabilities are checked relative to the target namespace,
+so a process in an ancestor user namespace naturally satisfies them; this
+includes joining user namespaces, which requires ``CAP_SYS_ADMIN``.  When
+``LANDLOCK_PERM_CAPABILITY_USE`` is also handled, each of these capabilities
+must be explicitly allowed by a rule.
+
+When combining ``CLONE_NEWUSER`` with other ``CLONE_NEW*`` flags in a single
+:manpage:`unshare(2)` call, the ``CAP_SYS_ADMIN`` check targets the newly
+created user namespace, which is handled by ``LANDLOCK_PERM_NAMESPACE_ENTER``
+independently from ``LANDLOCK_PERM_CAPABILITY_USE``.  Performing the user
+namespace creation and the additional namespace creation in two separate
+:manpage:`unshare(2)` calls requires a rule allowing ``CAP_SYS_ADMIN`` if the
+domain also handles ``LANDLOCK_PERM_CAPABILITY_USE``.
+
+More generally, Landlock domains and user namespaces form independent
+hierarchies: Landlock domains restrict what actions are allowed (each stacked
+layer narrows the permitted set), while user namespaces restrict where
+capabilities take effect (only within the process's own namespace and its
+descendants).  Landlock access controls are fully determined by the domain
+configuration, regardless of the process's position in the user namespace
+hierarchy.  When creating child user namespaces, it is recommended to also
+create a dedicated Landlock domain with restrictions relevant to each namespace
+context.
+
+Note that ``LANDLOCK_PERM_CAPABILITY_USE`` restricts the *use* of capabilities,
+not their presence in the process's credential.  Capability sets can change
+after a domain is enforced through user namespace entry, :manpage:`execve(2)` of
+binaries with file capabilities, or :manpage:`capset(2)`.  In all cases,
+:manpage:`capget(2)` will report the credential's capability sets, but any
+denied capability will fail with ``EPERM`` when exercised.
+
 Truncating files
 ----------------
 
@@ -515,7 +645,7 @@ Access rights
 -------------
 
 .. kernel-doc:: include/uapi/linux/landlock.h
-    :identifiers: fs_access net_access scope
+    :identifiers: fs_access net_access scope perm
 
 Creating a new ruleset
 ----------------------
@@ -534,7 +664,8 @@ Extending a ruleset
 
 .. kernel-doc:: include/uapi/linux/landlock.h
     :identifiers: landlock_rule_type landlock_path_beneath_attr
-                  landlock_net_port_attr
+                  landlock_net_port_attr landlock_capability_attr
+                  landlock_namespace_attr
 
 Enforcing a ruleset
 -------------------
@@ -685,6 +816,21 @@ enforce Landlock rulesets across all threads of the calling process
 using the ``LANDLOCK_RESTRICT_SELF_TSYNC`` flag passed to
 sys_landlock_restrict_self().
 
+Capability restriction (ABI < 9)
+--------------------------------
+
+Starting with the Landlock ABI version 9, it is possible to restrict
+:manpage:`capabilities(7)` with the new ``LANDLOCK_PERM_CAPABILITY_USE``
+permission flag and ``LANDLOCK_RULE_CAPABILITY`` rule type.
+
+Namespace restriction (ABI < 9)
+-------------------------------
+
+Starting with the Landlock ABI version 9, it is possible to restrict
+namespace creation (:manpage:`unshare(2)`, :manpage:`clone(2)`) and entry
+(:manpage:`setns(2)`) with the new ``LANDLOCK_PERM_NAMESPACE_ENTER`` permission
+flag and ``LANDLOCK_RULE_NAMESPACE`` rule type.
+
 .. _kernel_support:
 
 Kernel support
-- 
2.53.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox