From: Oleg Nesterov <oleg@redhat.com>
To: Brent Lovelace <brent.lovelace@candelatech.com>
Cc: linux-kernel@vger.kernel.org, lizefan@huawei.com, tj@kernel.org,
Balbir Singh <bsingharora@gmail.com>
Subject: Re: Bisected: Kernel deadlock bug related to cgroups.
Date: Wed, 31 Aug 2016 12:16:40 +0200 [thread overview]
Message-ID: <20160831101639.GA3919@redhat.com> (raw)
In-Reply-To: <4395815f-e115-6666-83d6-eb2f67d22227@candelatech.com>
On 08/30, Brent Lovelace wrote:
>
> I found a kernel deadlock regression bug introduced in the 4.4 kernel.
...
> I bisected to this commit id:
> ----------------------------------------------------------------------------------
> commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96
> Author: Oleg Nesterov <oleg@redhat.com>
> Date: Fri Nov 27 19:57:19 2015 +0100
>
> cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()
Thanks Brent!
> systemd D ffff88007dfcfd10 0 1 0 0x00000000
> ffff88007dfcfd10 00ff88007dfcfcd8 0000000000000001 ffff88007e296f80
> ffff88007dff8000 ffff88007dfd0000 ffffffff82bae220 ffff880036758c00
> fffffffffffffff2 ffff88005e327b00 ffff88007dfcfd28 ffffffff816c571f
> Call Trace:
> [<ffffffff816c571f>] schedule+0x7a/0x8f
> [<ffffffff81126ed2>] percpu_down_write+0xad/0xc4
> [<ffffffff811220c8>] ? wake_up_atomic_t+0x25/0x25
> [<ffffffff81169b4d>] __cgroup_procs_write+0x72/0x229
> [<ffffffff8112b2eb>] ? lock_acquire+0x103/0x18f
so it sleeps in wait_event() waiting for active readers, and the new
readers will block. In particular, do_exit() will block.
> kworker/u8:2 D ffff880036befb58 0 185 2 0x00000000
> Workqueue: netns cleanup_net
> ffff880036befb58 00ff880036befbd0 ffffffff00000002 ffff88007e316f80
> ffff8800783e8000 ffff880036bf0000 ffff88005917bed0 ffff8800783e8000
> ffffffff816c8953 ffff88005917bed8 ffff880036befb70 ffffffff816c571f
> Call Trace:
> [<ffffffff816c8953>] ? usleep_range+0x3a/0x3a
> [<ffffffff816c571f>] schedule+0x7a/0x8f
> [<ffffffff816c8982>] schedule_timeout+0x2f/0xd8
> [<ffffffff816c9204>] ? _raw_spin_unlock_irq+0x27/0x3f
> [<ffffffff816c8953>] ? usleep_range+0x3a/0x3a
> [<ffffffff81129ea2>] ? trace_hardirqs_on_caller+0x16f/0x18b
> [<ffffffff816c5f44>] do_wait_for_common+0xf0/0x127
> [<ffffffff816c5f44>] ? do_wait_for_common+0xf0/0x127
> [<ffffffff811101c3>] ? wake_up_q+0x42/0x42
> [<ffffffff816c5ff2>] wait_for_common+0x36/0x50
> [<ffffffff816c6024>] wait_for_completion+0x18/0x1a
> [<ffffffff81106b5f>] kthread_stop+0xc8/0x217
> [<ffffffffa06d440c>] pg_net_exit+0xbc/0x112 [pktgen]
> [<ffffffff81604fa2>] ops_exit_list+0x3d/0x4e
> [<ffffffff81606123>] cleanup_net+0x19f/0x234
> [<ffffffff81101aa3>] process_one_work+0x237/0x46b
> [<ffffffff8110216d>] worker_thread+0x1e7/0x292
> [<ffffffff81101f86>] ? rescuer_thread+0x285/0x285
> [<ffffffff81106ded>] kthread+0xc4/0xcc
> [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> [<ffffffff816c9d6f>] ret_from_fork+0x3f/0x70
> [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> 3 locks held by kworker/u8:2/185:
> #0: ("%s""netns"){.+.+.+}, at: [<ffffffff811019ad>]
> process_one_work+0x141/0x46b
> #1: (net_cleanup_work){+.+.+.}, at: [<ffffffff811019ad>]
> process_one_work+0x141/0x46b
> #2: (net_mutex){+.+.+.}, at: [<ffffffff81605ffe>] cleanup_net+0x7a/0x234
Note that it sleeps with net_mutex held. Probably waiting for kpktgend_*
below.
> vsftpd D ffff880054867c68 0 4352 2611 0x00000000
> ffff880054867c68 00ff88005933a480 ffff880000000000 ffff88007e216f80
> ffff88005933a480 ffff880054868000 0000000000000246 ffff880054867cc0
> ffff88005933a480 ffffffff81cea268 ffff880054867c80 ffffffff816c571f
> Call Trace:
> [<ffffffff816c571f>] schedule+0x7a/0x8f
> [<ffffffff816c59b1>] schedule_preempt_disabled+0x10/0x19
> [<ffffffff816c64c0>] mutex_lock_nested+0x1c0/0x3a0
> [<ffffffff81606233>] ? copy_net_ns+0x7b/0xf8
> [<ffffffff81606233>] copy_net_ns+0x7b/0xf8
> [<ffffffff81606233>] ? copy_net_ns+0x7b/0xf8
> [<ffffffff811073c9>] create_new_namespaces+0xfc/0x16b
> [<ffffffff8110759c>] copy_namespaces+0x164/0x186
> [<ffffffff810ea6b1>] copy_process+0x10d2/0x195d
> [<ffffffff810eb094>] _do_fork+0x8c/0x2fb
> [<ffffffff81003044>] ? lockdep_sys_exit_thunk+0x12/0x14
> [<ffffffff810eb375>] SyS_clone+0x14/0x16
> [<ffffffff816c99b6>] entry_SYSCALL_64_fastpath+0x16/0x76
> 2 locks held by vsftpd/4352:
> #0: (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff810e9b97>]
> copy_process+0x5b8/0x195d
> #1: (net_mutex){+.+.+.}, at: [<ffffffff81606233>] copy_net_ns+0x7b/0xf8
This waits for net_mutex held by kworker/u8:2 above. And with
cgroup_threadgroup_rwsem acquired for reading, that is why systemd
above hangs.
> kpktgend_0 D ffff88005917bce8 0 4354 2 0x00000000
> ffff88005917bce8 00ffffffa06d5d06 ffff880000000000 ffff88007e216f80
> ffff88007a4ec900 ffff88005917c000 ffff88007a4ec900 ffffffffa06d5d06
> ffff88005917bed0 0000000000000000 ffff88005917bd00 ffffffff816c571f
> Call Trace:
> [<ffffffffa06d5d06>] ? pg_net_init+0x346/0x346 [pktgen]
> [<ffffffff816c571f>] schedule+0x7a/0x8f
> [<ffffffff816c85ee>] rwsem_down_read_failed+0xdc/0xf8
> [<ffffffff8136d634>] call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff8136d634>] ? call_rwsem_down_read_failed+0x14/0x30
> [<ffffffff810f91e0>] ? exit_signals+0x17/0x103
> [<ffffffff81126dc4>] ? percpu_down_read+0x4d/0x5f
> [<ffffffff810f91e0>] exit_signals+0x17/0x103
> [<ffffffff810ed342>] do_exit+0x105/0x9a4
> [<ffffffffa06d5d06>] ? pg_net_init+0x346/0x346 [pktgen]
> [<ffffffff81106df5>] kthread+0xcc/0xcc
> [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> [<ffffffff816c9d6f>] ret_from_fork+0x3f/0x70
> [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> 1 lock held by kpktgend_0/4354:
> #0: (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff810f91e0>]
it can't take cgroup_threadgroup_rwsem for reading, so it can't exit,
and that is why kworker/u8:2 hangs.
> kpktgend_1 D ffff88007a4e3ce8 0 4355 2 0x00000000
...
> kpktgend_2 D ffff8800549f7ce8 0 4356 2 0x00000000
...
> kpktgend_3 D ffff88005e2b7ce8 0 4357 2 0x00000000
...
The same.
Could you try the recent 568ac888215c7fb2fab "cgroup: reduce read
locked section of cgroup_threadgroup_rwsem during fork" patch?
Attached below.
With this patch copy_net_ns() should be called outside of
cgroup_threadgroup_rwsem, the deadlock should hopefully go away.
Thanks,
Oleg.
---
commit 568ac888215c7fb2fabe8ea739b00ec3c1f5d440
Author: Balbir Singh <bsingharora@gmail.com>
Date: Wed Aug 10 15:43:06 2016 -0400
cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
cgroup_threadgroup_rwsem is acquired in read mode during process exit
and fork. It is also grabbed in write mode during
__cgroups_proc_write(). I've recently run into a scenario with lots
of memory pressure and OOM and I am beginning to see
systemd
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
percpu_down_write+0x114/0x170
__cgroup_procs_write.isra.12+0xb8/0x3c0
cgroup_file_write+0x74/0x1a0
kernfs_fop_write+0x188/0x200
__vfs_write+0x6c/0xe0
vfs_write+0xc0/0x230
SyS_write+0x6c/0x110
system_call+0x38/0xb4
This thread is waiting on the reader of cgroup_threadgroup_rwsem to
exit. The reader itself is under memory pressure and has gone into
reclaim after fork. There are times the reader also ends up waiting on
oom_lock as well.
__switch_to+0x1f8/0x350
__schedule+0x30c/0x990
schedule+0x48/0xc0
jbd2_log_wait_commit+0xd4/0x180
ext4_evict_inode+0x88/0x5c0
evict+0xf8/0x2a0
dispose_list+0x50/0x80
prune_icache_sb+0x6c/0x90
super_cache_scan+0x190/0x210
shrink_slab.part.15+0x22c/0x4c0
shrink_zone+0x288/0x3c0
do_try_to_free_pages+0x1dc/0x590
try_to_free_pages+0xdc/0x260
__alloc_pages_nodemask+0x72c/0xc90
alloc_pages_current+0xb4/0x1a0
page_table_alloc+0xc0/0x170
__pte_alloc+0x58/0x1f0
copy_page_range+0x4ec/0x950
copy_process.isra.5+0x15a0/0x1870
_do_fork+0xa8/0x4b0
ppc_clone+0x8/0xc
In the meanwhile, all processes exiting/forking are blocked almost
stalling the system.
This patch moves the threadgroup_change_begin from before
cgroup_fork() to just before cgroup_canfork(). There is no nee to
worry about threadgroup changes till the task is actually added to the
threadgroup. This avoids having to call reclaim with
cgroup_threadgroup_rwsem held.
tj: Subject and description edits.
Signed-off-by: Balbir Singh <bsingharora@gmail.com>
Acked-by: Zefan Li <lizefan@huawei.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org # v4.2+
Signed-off-by: Tejun Heo <tj@kernel.org>
diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d..aaf7823 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,7 +1404,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->real_start_time = ktime_get_boot_ns();
p->io_context = NULL;
p->audit_context = NULL;
- threadgroup_change_begin(current);
cgroup_fork(p);
#ifdef CONFIG_NUMA
p->mempolicy = mpol_dup(p->mempolicy);
@@ -1556,6 +1555,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
INIT_LIST_HEAD(&p->thread_group);
p->task_works = NULL;
+ threadgroup_change_begin(current);
/*
* Ensure that the cgroup subsystem policies allow the new process to be
* forked. It should be noted the the new process's css_set can be changed
@@ -1656,6 +1656,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
bad_fork_cancel_cgroup:
cgroup_cancel_fork(p);
bad_fork_free_pid:
+ threadgroup_change_end(current);
if (pid != &init_struct_pid)
free_pid(pid);
bad_fork_cleanup_thread:
@@ -1688,7 +1689,6 @@ bad_fork_cleanup_policy:
mpol_put(p->mempolicy);
bad_fork_cleanup_threadgroup_lock:
#endif
- threadgroup_change_end(current);
delayacct_tsk_free(p);
bad_fork_cleanup_count:
atomic_dec(&p->cred->user->processes);
next prev parent reply other threads:[~2016-08-31 10:17 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-08-30 18:39 Bisected: Kernel deadlock bug related to cgroups Brent Lovelace
2016-08-31 10:16 ` Oleg Nesterov [this message]
2016-08-31 10:33 ` Balbir Singh
2016-08-31 23:19 ` Brent Lovelace
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160831101639.GA3919@redhat.com \
--to=oleg@redhat.com \
--cc=brent.lovelace@candelatech.com \
--cc=bsingharora@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lizefan@huawei.com \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.