* [RFC][BUG] ns_mkdir_op() locking is FUBAR
@ 2025-06-23 21:37 Al Viro
2025-06-23 22:23 ` Al Viro
2025-06-23 22:28 ` Al Viro
0 siblings, 2 replies; 4+ messages in thread
From: Al Viro @ 2025-06-23 21:37 UTC (permalink / raw)
To: John Johansen; +Cc: linux-fsdevel
AFAICS, you are trying to put your locks outside of the
->i_rwsem *and* take them from inside your ->mkdir() instance.
This
/* we have to unlock and then relock to get locking order right
* for pin_fs
*/
inode_unlock(dir);
error = simple_pin_fs(&aafs_ops, &aafs_mnt, &aafs_count);
mutex_lock_nested(&parent->lock, parent->level);
inode_lock_nested(dir, I_MUTEX_PARENT);
if (error)
goto out;
error = __aafs_setup_d_inode(dir, dentry, mode | S_IFDIR, NULL,
NULL, NULL, NULL);
if (error)
goto out_pin;
ns = __aa_find_or_create_ns(parent, READ_ONCE(dentry->d_name.name),
dentry);
is completely broken.
Think what happens if two threads call mkdir() on the same thing.
OK, the first one got through vfs_mkdir() to the point where it
calls ->mkdir(). Parent is locked (->i_rwsem, exclusive), dentry
happens to be a hashed negative (e.g. from a slightly overlapping
earlier stat(2), whatever).
Your ->mkdir() drops the lock on parent, which is the only thing
holding the second thread back. Now, the second thread gets in
and it happens to grab parent->lock first. It makes dentry hashed
positive, drops the lock and buggers off. Your first thread gets
parent->lock... and calls d_instantiate() on an already positive
dentry. At which point the data structures are FUBAR.
Better yet, have mkdir A/B race with rmdir A. After dropping
the locks in ->mkdir() you get the second thread getting through
the entire rmdir(2). You get to __aa_find_or_create_ns(), which
calls __aa_create_ns(), which gets to
error = __aafs_ns_mkdir(ns, ns_subns_dir(parent), name, dir);
Unfortunately, ns_subns_dir(parent) is already NULL, so you hit
AA_BUG(!parent);
in __aafd_ns_mkdir().
AFAICS, rmdir/rmdir races are also there and also rather unpleasant.
Could you explain what exclusion are you trying to get there?
The mechanism is currently broken, but what is it trying to achieve?
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC][BUG] ns_mkdir_op() locking is FUBAR
2025-06-23 21:37 [RFC][BUG] ns_mkdir_op() locking is FUBAR Al Viro
@ 2025-06-23 22:23 ` Al Viro
2025-06-24 17:25 ` John Johansen
2025-06-23 22:28 ` Al Viro
1 sibling, 1 reply; 4+ messages in thread
From: Al Viro @ 2025-06-23 22:23 UTC (permalink / raw)
To: John Johansen; +Cc: linux-fsdevel
On Mon, Jun 23, 2025 at 10:37:47PM +0100, Al Viro wrote:
> Could you explain what exclusion are you trying to get there?
> The mechanism is currently broken, but what is it trying to achieve?
While we are at it:
root@kvm1:~# cd /sys/kernel/security/apparmor/policy
root@kvm1:/sys/kernel/security/apparmor/policy# (for i in `seq 270`; do mkdir namespaces/$i; cd namespaces/$i; done)
root@kvm1:/sys/kernel/security/apparmor/policy# rmdir namespaces/1
[ 40.980453] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
[ 40.980457] CPU: 3 UID: 0 PID: 2223 Comm: rmdir Not tainted 6.12.27-amd64 #11
[ 40.980459] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
[ 40.980460] RIP: 0010:inode_set_ctime_current+0x2c/0x100
[ 40.980490] Code: 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 31 db 48 8f
[ 40.980491] RSP: 0018:ffffc1cbc2cfbff8 EFLAGS: 00010292
[ 40.980493] RAX: 0000000000400000 RBX: 0000000000000000 RCX: ffff9dbcc358ac70
[ 40.980494] RDX: 0000000000000001 RSI: ffff9dbcc48c0300 RDI: ffffc1cbc2cfbff8
[ 40.980495] RBP: ffffc1cbc2cfc028 R08: 0000000000000000 R09: ffffffffa484c6c0
[ 40.980495] R10: ffff9dbcc0729cc0 R11: 0000000000000002 R12: ffff9dbcc4a75b28
[ 40.980496] R13: ffff9dbcc4a75b28 R14: ffff9dbcc01fe600 R15: ffff9dbcc51a9e00
[ 40.980498] FS: 00007ffb70ea4740(0000) GS:ffff9dbfefd80000(0000) knlGS:00000
[ 40.980499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 40.980499] CR2: ffffc1cbc2cfbfe8 CR3: 000000010619a000 CR4: 00000000000006f0
[ 40.980501] Call Trace:
[ 40.980510] <TASK>
[ 40.980513] simple_unlink+0x24/0x50
[ 40.980526] aafs_remove+0x9a/0xb0
[ 40.980543] __aafs_ns_rmdir+0x2ec/0x3b0
[ 40.980548] destroy_ns.part.0+0x9f/0xc0
[ 40.980558] __aa_remove_ns+0x44/0x90
[ 40.980560] destroy_ns.part.0+0x40/0xc0
[ 40.980562] __aa_remove_ns+0x44/0x90
[ 40.980563] destroy_ns.part.0+0x40/0xc0
.....
[ 40.981324] ns_rmdir_op+0x189/0x300
[ 40.981327] vfs_rmdir+0x9b/0x200
[ 40.981335] do_rmdir+0x1ac/0x1c0
[ 40.981340] __x64_sys_rmdir+0x3f/0x70
[ 40.981342] do_syscall_64+0x82/0x190
[ 40.981360] ? do_fault+0x31a/0x550
[ 40.981372] ? __handle_mm_fault+0x7c2/0xf70
[ 40.981373] ? syscall_exit_to_user_mode_prepare+0x149/0x170
[ 40.981388] ? __count_memcg_events+0x53/0xf0
[ 40.981392] ? count_memcg_events.constprop.0+0x1a/0x30
[ 40.981394] ? handle_mm_fault+0x1bb/0x2c0
[ 40.981396] ? do_user_addr_fault+0x36c/0x620
[ 40.981408] ? exc_page_fault+0x7e/0x180
[ 40.981412] entry_SYSCALL_64_after_hwframe+0x76/0x7e
.....
[ 40.981486] Kernel panic - not syncing: Fatal exception in interrupt
I realize that anyone who can play with apparmor config can screw the
box into the ground in a lot of ways, but... when you have a recursion
kernel-side, it would be nice to have its depth bounded. Not even root
should be able to panic the box with a single call of rmdir(2)...
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC][BUG] ns_mkdir_op() locking is FUBAR
2025-06-23 21:37 [RFC][BUG] ns_mkdir_op() locking is FUBAR Al Viro
2025-06-23 22:23 ` Al Viro
@ 2025-06-23 22:28 ` Al Viro
1 sibling, 0 replies; 4+ messages in thread
From: Al Viro @ 2025-06-23 22:28 UTC (permalink / raw)
To: John Johansen; +Cc: linux-fsdevel
On Mon, Jun 23, 2025 at 10:37:47PM +0100, Al Viro wrote:
> Better yet, have mkdir A/B race with rmdir A. After dropping
mkdir A/namespaces/B vs rmdir A, sorry.
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [RFC][BUG] ns_mkdir_op() locking is FUBAR
2025-06-23 22:23 ` Al Viro
@ 2025-06-24 17:25 ` John Johansen
0 siblings, 0 replies; 4+ messages in thread
From: John Johansen @ 2025-06-24 17:25 UTC (permalink / raw)
To: Al Viro, John Johansen; +Cc: linux-fsdevel, apparmor
On 6/23/25 15:23, Al Viro wrote:
> On Mon, Jun 23, 2025 at 10:37:47PM +0100, Al Viro wrote:
>
>> Could you explain what exclusion are you trying to get there?
>> The mechanism is currently broken, but what is it trying to achieve?
> While we are at it:
>
> root@kvm1:~# cd /sys/kernel/security/apparmor/policy
> root@kvm1:/sys/kernel/security/apparmor/policy# (for i in `seq 270`; do mkdir namespaces/$i; cd namespaces/$i; done)
> root@kvm1:/sys/kernel/security/apparmor/policy# rmdir namespaces/1
> [ 40.980453] Oops: stack guard page: 0000 [#1] PREEMPT SMP NOPTI
> [ 40.980457] CPU: 3 UID: 0 PID: 2223 Comm: rmdir Not tainted 6.12.27-amd64 #11
> [ 40.980459] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.164
> [ 40.980460] RIP: 0010:inode_set_ctime_current+0x2c/0x100
> [ 40.980490] Code: 1e fa 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 31 db 48 8f
> [ 40.980491] RSP: 0018:ffffc1cbc2cfbff8 EFLAGS: 00010292
> [ 40.980493] RAX: 0000000000400000 RBX: 0000000000000000 RCX: ffff9dbcc358ac70
> [ 40.980494] RDX: 0000000000000001 RSI: ffff9dbcc48c0300 RDI: ffffc1cbc2cfbff8
> [ 40.980495] RBP: ffffc1cbc2cfc028 R08: 0000000000000000 R09: ffffffffa484c6c0
> [ 40.980495] R10: ffff9dbcc0729cc0 R11: 0000000000000002 R12: ffff9dbcc4a75b28
> [ 40.980496] R13: ffff9dbcc4a75b28 R14: ffff9dbcc01fe600 R15: ffff9dbcc51a9e00
> [ 40.980498] FS: 00007ffb70ea4740(0000) GS:ffff9dbfefd80000(0000) knlGS:00000
> [ 40.980499] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 40.980499] CR2: ffffc1cbc2cfbfe8 CR3: 000000010619a000 CR4: 00000000000006f0
> [ 40.980501] Call Trace:
> [ 40.980510] <TASK>
> [ 40.980513] simple_unlink+0x24/0x50
> [ 40.980526] aafs_remove+0x9a/0xb0
> [ 40.980543] __aafs_ns_rmdir+0x2ec/0x3b0
> [ 40.980548] destroy_ns.part.0+0x9f/0xc0
> [ 40.980558] __aa_remove_ns+0x44/0x90
> [ 40.980560] destroy_ns.part.0+0x40/0xc0
> [ 40.980562] __aa_remove_ns+0x44/0x90
> [ 40.980563] destroy_ns.part.0+0x40/0xc0
> .....
> [ 40.981324] ns_rmdir_op+0x189/0x300
> [ 40.981327] vfs_rmdir+0x9b/0x200
> [ 40.981335] do_rmdir+0x1ac/0x1c0
> [ 40.981340] __x64_sys_rmdir+0x3f/0x70
> [ 40.981342] do_syscall_64+0x82/0x190
> [ 40.981360] ? do_fault+0x31a/0x550
> [ 40.981372] ? __handle_mm_fault+0x7c2/0xf70
> [ 40.981373] ? syscall_exit_to_user_mode_prepare+0x149/0x170
> [ 40.981388] ? __count_memcg_events+0x53/0xf0
> [ 40.981392] ? count_memcg_events.constprop.0+0x1a/0x30
> [ 40.981394] ? handle_mm_fault+0x1bb/0x2c0
> [ 40.981396] ? do_user_addr_fault+0x36c/0x620
> [ 40.981408] ? exc_page_fault+0x7e/0x180
> [ 40.981412] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> .....
> [ 40.981486] Kernel panic - not syncing: Fatal exception in interrupt
>
> I realize that anyone who can play with apparmor config can screw the
> box into the ground in a lot of ways, but... when you have a recursion
> kernel-side, it would be nice to have its depth bounded. Not even root
> should be able to panic the box with a single call of rmdir(2)...
Indeed, we can cap this something about ~3x what realistically we would see with
nesting of user namespaces. Some where between between 8-16 deep should be enough.
Something like
diff --git a/security/apparmor/include/policy_ns.h b/security/apparmor/include/policy_ns.h
index d646070fd966..081a0d5988d4 100644
--- a/security/apparmor/include/policy_ns.h
+++ b/security/apparmor/include/policy_ns.h
@@ -19,6 +19,9 @@
#include "policy.h"
+/* maximum nesting of policy namespaces */
+#define MAX_NS_DEPTH 8
+
/* struct aa_ns_acct - accounting of profiles in namespace
* @max_size: maximum space allowed for all profiles in namespace
* @max_count: maximum number of profiles that can be in this namespace
diff --git a/security/apparmor/policy_ns.c b/security/apparmor/policy_ns.c
index 64783ca3b0f2..89a6fecfd39a 100644
--- a/security/apparmor/policy_ns.c
+++ b/security/apparmor/policy_ns.c
@@ -223,6 +223,9 @@ static struct aa_ns *__aa_create_ns(struct aa_ns *parent, const char *name,
AA_BUG(!name);
AA_BUG(!mutex_is_locked(&parent->lock));
+ if (ns->level >= MAX_NS_LEVEL)
+ return ERR_PTR(-EPERM);
+
ns = alloc_ns(parent->base.hname, name);
if (!ns)
return ERR_PTR(-ENOMEM);
would do for the creation side, which should be enough. But We could also
throw in a bug check and bail against the the ns->level on the rmdir as well.
^ permalink raw reply related [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-06-24 17:25 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-23 21:37 [RFC][BUG] ns_mkdir_op() locking is FUBAR Al Viro
2025-06-23 22:23 ` Al Viro
2025-06-24 17:25 ` John Johansen
2025-06-23 22:28 ` Al Viro
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).