Re: [PATCH] capabilities: add capability cgroup controller

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Topi Miettinen <toiwoton@gmail.com>
To: Petr Mladek <pmladek@suse.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Tejun Heo <tj@kernel.org>, lkml <linux-kernel@vger.kernel.org>,
	luto@kernel.org, Kees Cook <keescook@chromium.org>,
	Jonathan Corbet <corbet@lwn.net>, Li Zefan <lizefan@huawei.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	James Morris <james.l.morris@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Howells <dhowells@redhat.com>,
	David Woodhouse <David.Woodhouse@intel.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	"open list:CAPABILITIES" <linux-security-module@vger.kernel.org>
Subject: Re: [PATCH] capabilities: add capability cgroup controller
Date: Sun, 10 Jul 2016 09:04:19 +0000	[thread overview]
Message-ID: <5e3cd49c-2ae1-efc1-ea85-afa1ff7b22cb@gmail.com> (raw)
In-Reply-To: <20160708091332.GD3556@pathway.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 10985 bytes --]

On 07/08/16 09:13, Petr Mladek wrote:
> On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
>> On 07/07/16 09:16, Petr Mladek wrote:
>>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>>>> The attached patch would make any uses of capabilities generate audit
>>>> messages. It works for simple tests as you can see from the commit
>>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>>>> system when booting a full blown OS. There's no deadlock when the call
>>>> is removed.
>>>>
>>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>>>> already held earlier before entering audit_cgroup_list(). Holding the
>>>> locks is however required by task_cgroup_from_root(). Is there any way
>>>> to avoid this? For example, only print some kind of cgroup ID numbers
>>>> (are there unique and stable IDs, available without locks?) for those
>>>> cgroups where the task is registered in the audit message?
>>>
>>> I am not sure if anyone know what really happens here. I suggest to
>>> enable lockdep. It might detect possible deadlock even before it
>>> really happens, see Documentation/locking/lockdep-design.txt
>>>
>>> It can be enabled by
>>>
>>>    CONFIG_PROVE_LOCKING=y
>>>
>>> It depends on
>>>
>>>     CONFIG_DEBUG_KERNEL=y
>>>
>>> and maybe some more options, see lib/Kconfig.debug
>>
>> Thanks a lot! I caught this stack dump:
>>
>> starting version 230
>> [    3.416647] ------------[ cut here ]------------
>> [    3.417310] WARNING: CPU: 0 PID: 95 at
>> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
>> lockdep_trace_alloc+0xb4/0xc0
>> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [    3.417923] Modules linked in:
>> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
>> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Debian-1.8.2-1 04/01/2014
>> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
>> ffffffff813c9c45
>> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
>> ffffffff81091e9b
>> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
>> 00000000ffffffff
>> [    3.419374] Call Trace:
>> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
>> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
>> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
>> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
>> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
>> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
>> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
>> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
>> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
>> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
>> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
>> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
>> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
>> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
>> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
>> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
>> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
>> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
>> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
>> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
>> [    3.420170] ---[ end trace fb586899fb556a5e ]---
>> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
>> available
>> [    4.014078] clocksource: Switched to clocksource tsc
>> Begin: Loading essential drivers ... done.
>>
>> This is with qemu and the boot continues normally. With real computer,
>> there's no such output and system just seems to freeze.
>>
>> Could it be possible that the deadlock happens because there's some IO
>> towards /sys/fs/cgroup, which causes a capability check and that in turn
>> causes locking problems when we try to print cgroup list?
> 
> The above warning is printed by the code from
> kernel/locking/lockdep.c:2871
> 
> static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> {
> [...]
> 	/* We're only interested __GFP_FS allocations for now */
> 	if (!(gfp_mask & __GFP_FS))
> 		return;
> 
> 	/*
> 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
> 	 */
> 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> 		return;
> 
> 
> The backtrace shows that your new audit_log_cap_use() is called
> from vfs_write(). You might try to use audit_log_start() with
> GFP_NOFS instead of GFP_KERNEL.
> 
> Note that this is rather intuitive advice. I still need to learn a lot
> about memory management and kernel in general to be more sure about
> a correct solution.
> 
> Best Regards,
> Petr
> 

With the attached patch, the system boots without deadlock. Locking
problems still remain:

[    3.652221] ======================================================
[    3.652221] [ INFO: possible circular locking dependency detected ]
[    3.652221] 4.7.0-rc5+ #101 Not tainted
[    3.652221] -------------------------------------------------------
[    3.652221] systemd/1 is trying to acquire lock:
[    3.652221]  (tasklist_lock){.+.+..}, at: [<ffffffff81137ddd>]
cgroup_mount+0x7ed/0xbc0
[    3.652221]
               but task is already holding lock:
[    3.652221]  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               which lock already depends on the new lock.

[    3.652221]
               the existing dependency chain (in reverse order) is:
[    3.652221]
               -> #3 (css_set_lock){......}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e137>] _raw_spin_lock_irq+0x37/0x50
[    3.652221]        [<ffffffff811374be>] cgroup_setup_root+0x19e/0x2d0
[    3.652221]        [<ffffffff821911fc>] cgroup_init+0xec/0x41d
[    3.652221]        [<ffffffff82171f68>] start_kernel+0x40c/0x465
[    3.652221]        [<ffffffff82171294>]
x86_64_start_reservations+0x2f/0x31
[    3.652221]        [<ffffffff8217140e>] x86_64_start_kernel+0x178/0x18b
[    3.652221]
               -> #2 (cgroup_mutex){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184af5f>] mutex_lock_nested+0x5f/0x350
[    3.652221]        [<ffffffff8113962a>] audit_cgroup_list+0x4a/0x2f0
[    3.652221]        [<ffffffff81145d19>] audit_log_cap_use+0xd9/0xf0
[    3.652221]        [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.652221]        [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.652221]        [<ffffffff812a2f00>] oom_score_adj_write+0x150/0x2f0
[    3.652221]        [<ffffffff81230947>] __vfs_write+0x37/0x160
[    3.652221]        [<ffffffff81230ff8>] vfs_write+0xb8/0x1b0
[    3.652221]        [<ffffffff81232028>] SyS_write+0x58/0xc0
[    3.652221]        [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.652221]        [<ffffffff8184ea1a>] return_from_SYSCALL_64+0x0/0x7a
[    3.652221]
               -> #1 (&(&sighand->siglock)->rlock){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184dfc1>] _raw_spin_lock+0x31/0x40
[    3.652221]        [<ffffffff810901d9>]
copy_process.part.34+0x10f9/0x1b40
[    3.652221]        [<ffffffff81090e23>] _do_fork+0xf3/0x6b0
[    3.652221]        [<ffffffff81091409>] kernel_thread+0x29/0x30
[    3.652221]        [<ffffffff810b71d7>] kthreadd+0x187/0x1e0
[    3.652221]        [<ffffffff8184eb7f>] ret_from_fork+0x1f/0x40
[    3.652221]
               -> #0 (tasklist_lock){.+.+..}:
[    3.652221]        [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]        [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]        [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]        [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]        [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]        [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]        [<ffffffff8184e965>]
entry_SYSCALL_64_fastpath+0x18/0xa8
[    3.652221]
               other info that might help us debug this:

[    3.652221] Chain exists of:
                 tasklist_lock --> cgroup_mutex --> css_set_lock

[    3.652221]  Possible unsafe locking scenario:

[    3.652221]        CPU0                    CPU1
[    3.652221]        ----                    ----
[    3.652221]   lock(css_set_lock);
[    3.652221]                                lock(cgroup_mutex);
[    3.652221]                                lock(css_set_lock);
[    3.652221]   lock(tasklist_lock);
[    3.652221]
                *** DEADLOCK ***

[    3.652221] 1 lock held by systemd/1:
[    3.652221]  #0:  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               stack backtrace:
[    3.652221] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #101
[    3.652221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.652221]  0000000000000086 0000000024b7c1ed ffff880006d13bb0
ffffffff813c9bf5
[    3.652221]  ffffffff829dbd20 ffffffff829cf2a0 ffff880006d13bf0
ffffffff810e60a3
[    3.652221]  ffff880006d13c30 ffff880006d067b0 ffff880006d06040
0000000000000001
[    3.652221] Call Trace:
[    3.652221]  [<ffffffff813c9bf5>] dump_stack+0x67/0x92
[    3.652221]  [<ffffffff810e60a3>] print_circular_bug+0x1e3/0x250
[    3.652221]  [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]  [<ffffffff81210bcd>] ? __kmalloc_track_caller+0x2bd/0x630
[    3.652221]  [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff810e5637>] ? lockdep_init_map+0x57/0x1f0
[    3.652221]  [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]  [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]  [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]  [<ffffffff812105bb>] ? kmem_cache_alloc_trace+0x28b/0x5e0
[    3.652221]  [<ffffffff811cc176>] ? strndup_user+0x46/0x80
[    3.652221]  [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]  [<ffffffff8184e965>] entry_SYSCALL_64_fastpath+0x18/0xa8

Rate limiting would not be a bad idea, there were 329 audit log entries
about capability use.

-Topi


[-- Attachment #2: 0001-cgroup-check-options-and-capabilities-before-locking.patch --]
[-- Type: text/x-patch, Size: 2087 bytes --]

From b857ec7d436f6fbb8c33e8d6a16d2f16175517d8 Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sun, 10 Jul 2016 11:45:30 +0300
Subject: [PATCH] cgroup: check options and capabilities before locking to
 avoid a deadlock

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/cgroup.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1931679..1190559 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2100,6 +2100,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		return ERR_PTR(-EPERM);
 	}
 
+	/* First find the desired set of subsystems */
+	ret = parse_cgroupfs_options(data, &opts);
+	if (ret)
+		goto out_free;
+
+	/*
+	 * We know this subsystem has not yet been bound.  Users in a non-init
+	 * user namespace may only mount hierarchies with no bound subsystems,
+	 * i.e. 'none,name=user1'
+	 */
+	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_free;
+	}
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2121,11 +2136,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* First find the desired set of subsystems */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
 	/*
 	 * Destruction of cgroup root is asynchronous, so subsystems may
 	 * still be dying after the previous unmount.  Let's drain the
@@ -2216,16 +2226,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		goto out_unlock;
 	}
 
-	/*
-	 * We know this subsystem has not yet been bound.  Users in a non-init
-	 * user namespace may only mount hierarchies with no bound subsystems,
-	 * i.e. 'none,name=user1'
-	 */
-	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
-		ret = -EPERM;
-		goto out_unlock;
-	}
-
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
-- 
2.8.1

WARNING: multiple messages have this Message-ID (diff)

From: Topi Miettinen <toiwoton@gmail.com>
To: Petr Mladek <pmladek@suse.com>
Cc: "Serge E. Hallyn" <serge@hallyn.com>,
	"Eric W. Biederman" <ebiederm@xmission.com>,
	Tejun Heo <tj@kernel.org>, lkml <linux-kernel@vger.kernel.org>,
	luto@kernel.org, Kees Cook <keescook@chromium.org>,
	Jonathan Corbet <corbet@lwn.net>, Li Zefan <lizefan@huawei.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	James Morris <james.l.morris@oracle.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	David Howells <dhowells@redhat.com>,
	David Woodhouse <David.Woodhouse@intel.com>,
	Ard Biesheuvel <ard.biesheuvel@linaro.org>,
	"Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
	"open list:DOCUMENTATION" <linux-doc@vger.kernel.org>,
	"open list:CONTROL GROUP (CGROUP)" <cgroups@vger.kernel.org>,
	"open list:CAPABILITIES" <linux-security-module@vger.kernel.org>
Subject: Re: [PATCH] capabilities: add capability cgroup controller
Date: Sun, 10 Jul 2016 09:04:19 +0000	[thread overview]
Message-ID: <5e3cd49c-2ae1-efc1-ea85-afa1ff7b22cb@gmail.com> (raw)
In-Reply-To: <20160708091332.GD3556@pathway.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 10985 bytes --]

On 07/08/16 09:13, Petr Mladek wrote:
> On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
>> On 07/07/16 09:16, Petr Mladek wrote:
>>> On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
>>>> The attached patch would make any uses of capabilities generate audit
>>>> messages. It works for simple tests as you can see from the commit
>>>> message, but unfortunately the call to audit_cgroup_list() deadlocks the
>>>> system when booting a full blown OS. There's no deadlock when the call
>>>> is removed.
>>>>
>>>> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
>>>> already held earlier before entering audit_cgroup_list(). Holding the
>>>> locks is however required by task_cgroup_from_root(). Is there any way
>>>> to avoid this? For example, only print some kind of cgroup ID numbers
>>>> (are there unique and stable IDs, available without locks?) for those
>>>> cgroups where the task is registered in the audit message?
>>>
>>> I am not sure if anyone know what really happens here. I suggest to
>>> enable lockdep. It might detect possible deadlock even before it
>>> really happens, see Documentation/locking/lockdep-design.txt
>>>
>>> It can be enabled by
>>>
>>>    CONFIG_PROVE_LOCKING=y
>>>
>>> It depends on
>>>
>>>     CONFIG_DEBUG_KERNEL=y
>>>
>>> and maybe some more options, see lib/Kconfig.debug
>>
>> Thanks a lot! I caught this stack dump:
>>
>> starting version 230
>> [    3.416647] ------------[ cut here ]------------
>> [    3.417310] WARNING: CPU: 0 PID: 95 at
>> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
>> lockdep_trace_alloc+0xb4/0xc0
>> [    3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
>> [    3.417923] Modules linked in:
>> [    3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
>> [    3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
>> BIOS Debian-1.8.2-1 04/01/2014
>> [    3.418726]  0000000000000086 000000007970f3b0 ffff88000016fb00
>> ffffffff813c9c45
>> [    3.418993]  ffff88000016fb50 0000000000000000 ffff88000016fb40
>> ffffffff81091e9b
>> [    3.419176]  00000b3705e2c798 0000000000000046 0000000000000410
>> 00000000ffffffff
>> [    3.419374] Call Trace:
>> [    3.419511]  [<ffffffff813c9c45>] dump_stack+0x67/0x92
>> [    3.419644]  [<ffffffff81091e9b>] __warn+0xcb/0xf0
>> [    3.419745]  [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
>> [    3.419868]  [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
>> [    3.419988]  [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
>> [    3.420156]  [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
>> [    3.420170]  [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
>> [    3.420170]  [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
>> [    3.420170]  [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
>> [    3.420170]  [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
>> [    3.420170]  [<ffffffff8109cd75>] ns_capable+0x45/0x70
>> [    3.420170]  [<ffffffff8109cdb7>] capable+0x17/0x20
>> [    3.420170]  [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
>> [    3.420170]  [<ffffffff81230997>] __vfs_write+0x37/0x160
>> [    3.420170]  [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
>> [    3.420170]  [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
>> [    3.420170]  [<ffffffff81231048>] vfs_write+0xb8/0x1b0
>> [    3.420170]  [<ffffffff812533c6>] ? __fget_light+0x66/0x90
>> [    3.420170]  [<ffffffff81232078>] SyS_write+0x58/0xc0
>> [    3.420170]  [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
>> [    3.420170]  [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
>> [    3.420170] ---[ end trace fb586899fb556a5e ]---
>> [    3.447922] random: systemd-udevd urandom read with 3 bits of entropy
>> available
>> [    4.014078] clocksource: Switched to clocksource tsc
>> Begin: Loading essential drivers ... done.
>>
>> This is with qemu and the boot continues normally. With real computer,
>> there's no such output and system just seems to freeze.
>>
>> Could it be possible that the deadlock happens because there's some IO
>> towards /sys/fs/cgroup, which causes a capability check and that in turn
>> causes locking problems when we try to print cgroup list?
> 
> The above warning is printed by the code from
> kernel/locking/lockdep.c:2871
> 
> static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
> {
> [...]
> 	/* We're only interested __GFP_FS allocations for now */
> 	if (!(gfp_mask & __GFP_FS))
> 		return;
> 
> 	/*
> 	 * Oi! Can't be having __GFP_FS allocations with IRQs disabled.
> 	 */
> 	if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
> 		return;
> 
> 
> The backtrace shows that your new audit_log_cap_use() is called
> from vfs_write(). You might try to use audit_log_start() with
> GFP_NOFS instead of GFP_KERNEL.
> 
> Note that this is rather intuitive advice. I still need to learn a lot
> about memory management and kernel in general to be more sure about
> a correct solution.
> 
> Best Regards,
> Petr
> 

With the attached patch, the system boots without deadlock. Locking
problems still remain:

[    3.652221] ======================================================
[    3.652221] [ INFO: possible circular locking dependency detected ]
[    3.652221] 4.7.0-rc5+ #101 Not tainted
[    3.652221] -------------------------------------------------------
[    3.652221] systemd/1 is trying to acquire lock:
[    3.652221]  (tasklist_lock){.+.+..}, at: [<ffffffff81137ddd>]
cgroup_mount+0x7ed/0xbc0
[    3.652221]
               but task is already holding lock:
[    3.652221]  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               which lock already depends on the new lock.

[    3.652221]
               the existing dependency chain (in reverse order) is:
[    3.652221]
               -> #3 (css_set_lock){......}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e137>] _raw_spin_lock_irq+0x37/0x50
[    3.652221]        [<ffffffff811374be>] cgroup_setup_root+0x19e/0x2d0
[    3.652221]        [<ffffffff821911fc>] cgroup_init+0xec/0x41d
[    3.652221]        [<ffffffff82171f68>] start_kernel+0x40c/0x465
[    3.652221]        [<ffffffff82171294>]
x86_64_start_reservations+0x2f/0x31
[    3.652221]        [<ffffffff8217140e>] x86_64_start_kernel+0x178/0x18b
[    3.652221]
               -> #2 (cgroup_mutex){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184af5f>] mutex_lock_nested+0x5f/0x350
[    3.652221]        [<ffffffff8113962a>] audit_cgroup_list+0x4a/0x2f0
[    3.652221]        [<ffffffff81145d19>] audit_log_cap_use+0xd9/0xf0
[    3.652221]        [<ffffffff8109cd75>] ns_capable+0x45/0x70
[    3.652221]        [<ffffffff8109cdb7>] capable+0x17/0x20
[    3.652221]        [<ffffffff812a2f00>] oom_score_adj_write+0x150/0x2f0
[    3.652221]        [<ffffffff81230947>] __vfs_write+0x37/0x160
[    3.652221]        [<ffffffff81230ff8>] vfs_write+0xb8/0x1b0
[    3.652221]        [<ffffffff81232028>] SyS_write+0x58/0xc0
[    3.652221]        [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
[    3.652221]        [<ffffffff8184ea1a>] return_from_SYSCALL_64+0x0/0x7a
[    3.652221]
               -> #1 (&(&sighand->siglock)->rlock){+.+...}:
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184dfc1>] _raw_spin_lock+0x31/0x40
[    3.652221]        [<ffffffff810901d9>]
copy_process.part.34+0x10f9/0x1b40
[    3.652221]        [<ffffffff81090e23>] _do_fork+0xf3/0x6b0
[    3.652221]        [<ffffffff81091409>] kernel_thread+0x29/0x30
[    3.652221]        [<ffffffff810b71d7>] kthreadd+0x187/0x1e0
[    3.652221]        [<ffffffff8184eb7f>] ret_from_fork+0x1f/0x40
[    3.652221]
               -> #0 (tasklist_lock){.+.+..}:
[    3.652221]        [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]        [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]        [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]        [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]        [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]        [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]        [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]        [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]        [<ffffffff8184e965>]
entry_SYSCALL_64_fastpath+0x18/0xa8
[    3.652221]
               other info that might help us debug this:

[    3.652221] Chain exists of:
                 tasklist_lock --> cgroup_mutex --> css_set_lock

[    3.652221]  Possible unsafe locking scenario:

[    3.652221]        CPU0                    CPU1
[    3.652221]        ----                    ----
[    3.652221]   lock(css_set_lock);
[    3.652221]                                lock(cgroup_mutex);
[    3.652221]                                lock(css_set_lock);
[    3.652221]   lock(tasklist_lock);
[    3.652221]
                *** DEADLOCK ***

[    3.652221] 1 lock held by systemd/1:
[    3.652221]  #0:  (css_set_lock){......}, at: [<ffffffff81137c59>]
cgroup_mount+0x669/0xbc0
[    3.652221]
               stack backtrace:
[    3.652221] CPU: 0 PID: 1 Comm: systemd Not tainted 4.7.0-rc5+ #101
[    3.652221] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS Debian-1.8.2-1 04/01/2014
[    3.652221]  0000000000000086 0000000024b7c1ed ffff880006d13bb0
ffffffff813c9bf5
[    3.652221]  ffffffff829dbd20 ffffffff829cf2a0 ffff880006d13bf0
ffffffff810e60a3
[    3.652221]  ffff880006d13c30 ffff880006d067b0 ffff880006d06040
0000000000000001
[    3.652221] Call Trace:
[    3.652221]  [<ffffffff813c9bf5>] dump_stack+0x67/0x92
[    3.652221]  [<ffffffff810e60a3>] print_circular_bug+0x1e3/0x250
[    3.652221]  [<ffffffff810e8dfb>] __lock_acquire+0x13cb/0x1440
[    3.652221]  [<ffffffff81210bcd>] ? __kmalloc_track_caller+0x2bd/0x630
[    3.652221]  [<ffffffff810e92b3>] lock_acquire+0xe3/0x1c0
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff8184e3f4>] _raw_read_lock+0x34/0x50
[    3.652221]  [<ffffffff81137ddd>] ? cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff81137ddd>] cgroup_mount+0x7ed/0xbc0
[    3.652221]  [<ffffffff810e5637>] ? lockdep_init_map+0x57/0x1f0
[    3.652221]  [<ffffffff81234d98>] mount_fs+0x38/0x170
[    3.652221]  [<ffffffff8125626b>] vfs_kern_mount+0x6b/0x150
[    3.652221]  [<ffffffff81258f8c>] do_mount+0x24c/0xe30
[    3.652221]  [<ffffffff812105bb>] ? kmem_cache_alloc_trace+0x28b/0x5e0
[    3.652221]  [<ffffffff811cc176>] ? strndup_user+0x46/0x80
[    3.652221]  [<ffffffff81259ea5>] SyS_mount+0x95/0xe0
[    3.652221]  [<ffffffff8184e965>] entry_SYSCALL_64_fastpath+0x18/0xa8

Rate limiting would not be a bad idea, there were 329 audit log entries
about capability use.

-Topi


[-- Attachment #2: 0001-cgroup-check-options-and-capabilities-before-locking.patch --]
[-- Type: text/x-patch, Size: 2088 bytes --]

>From b857ec7d436f6fbb8c33e8d6a16d2f16175517d8 Mon Sep 17 00:00:00 2001
From: Topi Miettinen <toiwoton@gmail.com>
Date: Sun, 10 Jul 2016 11:45:30 +0300
Subject: [PATCH] cgroup: check options and capabilities before locking to
 avoid a deadlock

Signed-off-by: Topi Miettinen <toiwoton@gmail.com>
---
 kernel/cgroup.c | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 1931679..1190559 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2100,6 +2100,21 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		return ERR_PTR(-EPERM);
 	}
 
+	/* First find the desired set of subsystems */
+	ret = parse_cgroupfs_options(data, &opts);
+	if (ret)
+		goto out_free;
+
+	/*
+	 * We know this subsystem has not yet been bound.  Users in a non-init
+	 * user namespace may only mount hierarchies with no bound subsystems,
+	 * i.e. 'none,name=user1'
+	 */
+	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
+		ret = -EPERM;
+		goto out_free;
+	}
+
 	/*
 	 * The first time anyone tries to mount a cgroup, enable the list
 	 * linking each css_set to its tasks and fix up all existing tasks.
@@ -2121,11 +2136,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 
 	cgroup_lock_and_drain_offline(&cgrp_dfl_root.cgrp);
 
-	/* First find the desired set of subsystems */
-	ret = parse_cgroupfs_options(data, &opts);
-	if (ret)
-		goto out_unlock;
-
 	/*
 	 * Destruction of cgroup root is asynchronous, so subsystems may
 	 * still be dying after the previous unmount.  Let's drain the
@@ -2216,16 +2226,6 @@ static struct dentry *cgroup_mount(struct file_system_type *fs_type,
 		goto out_unlock;
 	}
 
-	/*
-	 * We know this subsystem has not yet been bound.  Users in a non-init
-	 * user namespace may only mount hierarchies with no bound subsystems,
-	 * i.e. 'none,name=user1'
-	 */
-	if (!opts.none && !capable(CAP_SYS_ADMIN)) {
-		ret = -EPERM;
-		goto out_unlock;
-	}
-
 	root = kzalloc(sizeof(*root), GFP_KERNEL);
 	if (!root) {
 		ret = -ENOMEM;
-- 
2.8.1

next prev parent reply	other threads:[~2016-07-10  9:04 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-06-23 15:07 [PATCH] capabilities: add capability cgroup controller Topi Miettinen
2016-06-23 15:07 ` Topi Miettinen
     [not found] ` <1466694434-1420-1-git-send-email-toiwoton-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-06-23 21:03   ` Kees Cook
2016-06-23 21:03     ` Kees Cook
2016-06-23 21:38 ` Tejun Heo
2016-06-24  0:22   ` Topi Miettinen
2016-06-24 15:48     ` Tejun Heo
2016-06-24 15:59       ` Serge E. Hallyn
     [not found]         ` <20160624155916.GA8759-7LNsyQBKDXoIagZqoN9o3w@public.gmane.org>
2016-06-24 16:35           ` Tejun Heo
2016-06-24 16:35             ` Tejun Heo
2016-06-24 16:59             ` Serge E. Hallyn
2016-06-24 17:21               ` Eric W. Biederman
2016-06-24 17:21                 ` Eric W. Biederman
2016-06-24 17:39                 ` Serge E. Hallyn
     [not found]                 ` <87mvmaa4f6.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-06-26 19:03                   ` Topi Miettinen
2016-06-26 19:03                     ` Topi Miettinen
2016-06-28  4:57                     ` Eric W. Biederman
2016-06-28  4:57                       ` Eric W. Biederman
     [not found]                       ` <87por1syg1.fsf-JOvCrm2gF+uungPnsOpG7nhyD016LWXt@public.gmane.org>
2016-07-02 11:20                         ` Topi Miettinen
2016-07-02 11:20                           ` Topi Miettinen
2016-06-24 17:24               ` Tejun Heo
2016-06-26 19:14                 ` Topi Miettinen
2016-06-26 22:26                   ` Tejun Heo
2016-06-27 14:54                     ` Serge E. Hallyn
2016-06-27 19:10                       ` Topi Miettinen
2016-06-27 19:17                         ` Tejun Heo
     [not found]                           ` <CAOS58YM+h0w_UciXLbiJcKizPkXV66FL57LT7Mc+RRWspN+Y2Q-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-27 19:49                             ` Serge E. Hallyn
2016-06-27 19:49                               ` Serge E. Hallyn
2016-07-03 15:08                               ` Topi Miettinen
2016-07-03 15:08                                 ` Topi Miettinen
     [not found]                                 ` <218f2bef-5e5e-89c4-154b-24dc49c82c31-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-07-03 16:13                                   ` [PATCH] capabilities: audit capability use kbuild test robot
2016-07-03 16:13                                     ` kbuild test robot
2016-07-07  9:16                                 ` [PATCH] capabilities: add capability cgroup controller Petr Mladek
2016-07-07 20:27                                   ` Topi Miettinen
2016-07-08  9:13                                     ` Petr Mladek
2016-07-09 16:38                                       ` Topi Miettinen
2016-07-10  9:04                                       ` Topi Miettinen [this message]
2016-07-10  9:04                                         ` Topi Miettinen
2016-06-23 23:46 ` Andrew Morton
2016-06-23 23:46   ` Andrew Morton
     [not found]   ` <20160623164614.cc871a52402fca6179bef246-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b@public.gmane.org>
2016-06-24  1:14     ` Topi Miettinen
2016-06-24  1:14       ` Topi Miettinen
     [not found]       ` <6a129f00-dbef-9916-ffaa-792b6b413362-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-06-24  4:15         ` Andy Lutomirski
2016-06-24  4:15           ` Andy Lutomirski
     [not found]           ` <CALCETrUBLxL116XrD6V9--H17q84KrMTYHm8UJF_kO3Reoh2pw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-06-25 18:00             ` Djalal Harouni
2016-06-25 18:00               ` Djalal Harouni

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:1931679 dfblob:1190559 dfblob:1931679 dfblob:1190559 )
 OR (
bs:"cgroup: check options and capabilities before locking to" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5e3cd49c-2ae1-efc1-ea85-afa1ff7b22cb@gmail.com \
    --to=toiwoton@gmail.com \
    --cc=David.Woodhouse@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=ard.biesheuvel@linaro.org \
    --cc=cgroups@vger.kernel.org \
    --cc=corbet@lwn.net \
    --cc=dhowells@redhat.com \
    --cc=ebiederm@xmission.com \
    --cc=hannes@cmpxchg.org \
    --cc=james.l.morris@oracle.com \
    --cc=keescook@chromium.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=luto@kernel.org \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=pmladek@suse.com \
    --cc=serge@hallyn.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.