Re: Bisected: Kernel deadlock bug related to cgroups.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Oleg Nesterov <oleg@redhat.com>
To: Brent Lovelace <brent.lovelace@candelatech.com>
Cc: linux-kernel@vger.kernel.org, lizefan@huawei.com, tj@kernel.org,
	Balbir Singh <bsingharora@gmail.com>
Subject: Re: Bisected: Kernel deadlock bug related to cgroups.
Date: Wed, 31 Aug 2016 12:16:40 +0200	[thread overview]
Message-ID: <20160831101639.GA3919@redhat.com> (raw)
In-Reply-To: <4395815f-e115-6666-83d6-eb2f67d22227@candelatech.com>

On 08/30, Brent Lovelace wrote:
>
> I found a kernel deadlock regression bug introduced in the 4.4 kernel.
...
> I bisected to this commit id:
> ----------------------------------------------------------------------------------
> commit c9e75f0492b248aeaa7af8991a6fc9a21506bc96
> Author: Oleg Nesterov <oleg@redhat.com>
> Date:   Fri Nov 27 19:57:19 2015 +0100
>
>      cgroup: pids: fix race between cgroup_post_fork() and cgroup_migrate()

Thanks Brent!

> systemd         D ffff88007dfcfd10     0     1      0 0x00000000
>   ffff88007dfcfd10 00ff88007dfcfcd8 0000000000000001 ffff88007e296f80
>   ffff88007dff8000 ffff88007dfd0000 ffffffff82bae220 ffff880036758c00
>   fffffffffffffff2 ffff88005e327b00 ffff88007dfcfd28 ffffffff816c571f
> Call Trace:
>   [<ffffffff816c571f>] schedule+0x7a/0x8f
>   [<ffffffff81126ed2>] percpu_down_write+0xad/0xc4
>   [<ffffffff811220c8>] ? wake_up_atomic_t+0x25/0x25
>   [<ffffffff81169b4d>] __cgroup_procs_write+0x72/0x229
>   [<ffffffff8112b2eb>] ? lock_acquire+0x103/0x18f

so it sleeps in wait_event() waiting for active readers, and the new
readers will block. In particular, do_exit() will block.

> kworker/u8:2    D ffff880036befb58     0   185      2 0x00000000
> Workqueue: netns cleanup_net
>   ffff880036befb58 00ff880036befbd0 ffffffff00000002 ffff88007e316f80
>   ffff8800783e8000 ffff880036bf0000 ffff88005917bed0 ffff8800783e8000
>   ffffffff816c8953 ffff88005917bed8 ffff880036befb70 ffffffff816c571f
> Call Trace:
>   [<ffffffff816c8953>] ? usleep_range+0x3a/0x3a
>   [<ffffffff816c571f>] schedule+0x7a/0x8f
>   [<ffffffff816c8982>] schedule_timeout+0x2f/0xd8
>   [<ffffffff816c9204>] ? _raw_spin_unlock_irq+0x27/0x3f
>   [<ffffffff816c8953>] ? usleep_range+0x3a/0x3a
>   [<ffffffff81129ea2>] ? trace_hardirqs_on_caller+0x16f/0x18b
>   [<ffffffff816c5f44>] do_wait_for_common+0xf0/0x127
>   [<ffffffff816c5f44>] ? do_wait_for_common+0xf0/0x127
>   [<ffffffff811101c3>] ? wake_up_q+0x42/0x42
>   [<ffffffff816c5ff2>] wait_for_common+0x36/0x50
>   [<ffffffff816c6024>] wait_for_completion+0x18/0x1a
>   [<ffffffff81106b5f>] kthread_stop+0xc8/0x217
>   [<ffffffffa06d440c>] pg_net_exit+0xbc/0x112 [pktgen]
>   [<ffffffff81604fa2>] ops_exit_list+0x3d/0x4e
>   [<ffffffff81606123>] cleanup_net+0x19f/0x234
>   [<ffffffff81101aa3>] process_one_work+0x237/0x46b
>   [<ffffffff8110216d>] worker_thread+0x1e7/0x292
>   [<ffffffff81101f86>] ? rescuer_thread+0x285/0x285
>   [<ffffffff81106ded>] kthread+0xc4/0xcc
>   [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
>   [<ffffffff816c9d6f>] ret_from_fork+0x3f/0x70
>   [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> 3 locks held by kworker/u8:2/185:
>   #0:  ("%s""netns"){.+.+.+}, at: [<ffffffff811019ad>]
> process_one_work+0x141/0x46b
>   #1:  (net_cleanup_work){+.+.+.}, at: [<ffffffff811019ad>]
> process_one_work+0x141/0x46b
>   #2:  (net_mutex){+.+.+.}, at: [<ffffffff81605ffe>] cleanup_net+0x7a/0x234

Note that it sleeps with net_mutex held. Probably waiting for kpktgend_*
below.

> vsftpd          D ffff880054867c68     0  4352   2611 0x00000000
>   ffff880054867c68 00ff88005933a480 ffff880000000000 ffff88007e216f80
>   ffff88005933a480 ffff880054868000 0000000000000246 ffff880054867cc0
>   ffff88005933a480 ffffffff81cea268 ffff880054867c80 ffffffff816c571f
> Call Trace:
>   [<ffffffff816c571f>] schedule+0x7a/0x8f
>   [<ffffffff816c59b1>] schedule_preempt_disabled+0x10/0x19
>   [<ffffffff816c64c0>] mutex_lock_nested+0x1c0/0x3a0
>   [<ffffffff81606233>] ? copy_net_ns+0x7b/0xf8
>   [<ffffffff81606233>] copy_net_ns+0x7b/0xf8
>   [<ffffffff81606233>] ? copy_net_ns+0x7b/0xf8
>   [<ffffffff811073c9>] create_new_namespaces+0xfc/0x16b
>   [<ffffffff8110759c>] copy_namespaces+0x164/0x186
>   [<ffffffff810ea6b1>] copy_process+0x10d2/0x195d
>   [<ffffffff810eb094>] _do_fork+0x8c/0x2fb
>   [<ffffffff81003044>] ? lockdep_sys_exit_thunk+0x12/0x14
>   [<ffffffff810eb375>] SyS_clone+0x14/0x16
>   [<ffffffff816c99b6>] entry_SYSCALL_64_fastpath+0x16/0x76
> 2 locks held by vsftpd/4352:
>   #0:  (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff810e9b97>]
> copy_process+0x5b8/0x195d
>   #1:  (net_mutex){+.+.+.}, at: [<ffffffff81606233>] copy_net_ns+0x7b/0xf8

This waits for net_mutex held by kworker/u8:2 above. And with
cgroup_threadgroup_rwsem acquired for reading, that is why systemd
above hangs.

> kpktgend_0      D ffff88005917bce8     0  4354      2 0x00000000
>   ffff88005917bce8 00ffffffa06d5d06 ffff880000000000 ffff88007e216f80
>   ffff88007a4ec900 ffff88005917c000 ffff88007a4ec900 ffffffffa06d5d06
>   ffff88005917bed0 0000000000000000 ffff88005917bd00 ffffffff816c571f
> Call Trace:
>   [<ffffffffa06d5d06>] ? pg_net_init+0x346/0x346 [pktgen]
>   [<ffffffff816c571f>] schedule+0x7a/0x8f
>   [<ffffffff816c85ee>] rwsem_down_read_failed+0xdc/0xf8
>   [<ffffffff8136d634>] call_rwsem_down_read_failed+0x14/0x30
>   [<ffffffff8136d634>] ? call_rwsem_down_read_failed+0x14/0x30
>   [<ffffffff810f91e0>] ? exit_signals+0x17/0x103
>   [<ffffffff81126dc4>] ? percpu_down_read+0x4d/0x5f
>   [<ffffffff810f91e0>] exit_signals+0x17/0x103
>   [<ffffffff810ed342>] do_exit+0x105/0x9a4
>   [<ffffffffa06d5d06>] ? pg_net_init+0x346/0x346 [pktgen]
>   [<ffffffff81106df5>] kthread+0xcc/0xcc
>   [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
>   [<ffffffff816c9d6f>] ret_from_fork+0x3f/0x70
>   [<ffffffff81106d29>] ? kthread_parkme+0x1f/0x1f
> 1 lock held by kpktgend_0/4354:
>   #0:  (&cgroup_threadgroup_rwsem){++++++}, at: [<ffffffff810f91e0>]

it can't take cgroup_threadgroup_rwsem for reading, so it can't exit,
and that is why kworker/u8:2 hangs.

> kpktgend_1      D ffff88007a4e3ce8     0  4355      2 0x00000000
...
> kpktgend_2      D ffff8800549f7ce8     0  4356      2 0x00000000
...
> kpktgend_3      D ffff88005e2b7ce8     0  4357      2 0x00000000
...

The same.

Could you try the recent 568ac888215c7fb2fab "cgroup: reduce read
locked section of cgroup_threadgroup_rwsem during fork" patch?
Attached below.


With this patch copy_net_ns() should be called outside of
cgroup_threadgroup_rwsem, the deadlock should hopefully go away.

Thanks,

Oleg.
---

commit 568ac888215c7fb2fabe8ea739b00ec3c1f5d440
Author: Balbir Singh <bsingharora@gmail.com>
Date:   Wed Aug 10 15:43:06 2016 -0400

    cgroup: reduce read locked section of cgroup_threadgroup_rwsem during fork
    
    cgroup_threadgroup_rwsem is acquired in read mode during process exit
    and fork.  It is also grabbed in write mode during
    __cgroups_proc_write().  I've recently run into a scenario with lots
    of memory pressure and OOM and I am beginning to see
    
    systemd
    
     __switch_to+0x1f8/0x350
     __schedule+0x30c/0x990
     schedule+0x48/0xc0
     percpu_down_write+0x114/0x170
     __cgroup_procs_write.isra.12+0xb8/0x3c0
     cgroup_file_write+0x74/0x1a0
     kernfs_fop_write+0x188/0x200
     __vfs_write+0x6c/0xe0
     vfs_write+0xc0/0x230
     SyS_write+0x6c/0x110
     system_call+0x38/0xb4
    
    This thread is waiting on the reader of cgroup_threadgroup_rwsem to
    exit.  The reader itself is under memory pressure and has gone into
    reclaim after fork. There are times the reader also ends up waiting on
    oom_lock as well.
    
     __switch_to+0x1f8/0x350
     __schedule+0x30c/0x990
     schedule+0x48/0xc0
     jbd2_log_wait_commit+0xd4/0x180
     ext4_evict_inode+0x88/0x5c0
     evict+0xf8/0x2a0
     dispose_list+0x50/0x80
     prune_icache_sb+0x6c/0x90
     super_cache_scan+0x190/0x210
     shrink_slab.part.15+0x22c/0x4c0
     shrink_zone+0x288/0x3c0
     do_try_to_free_pages+0x1dc/0x590
     try_to_free_pages+0xdc/0x260
     __alloc_pages_nodemask+0x72c/0xc90
     alloc_pages_current+0xb4/0x1a0
     page_table_alloc+0xc0/0x170
     __pte_alloc+0x58/0x1f0
     copy_page_range+0x4ec/0x950
     copy_process.isra.5+0x15a0/0x1870
     _do_fork+0xa8/0x4b0
     ppc_clone+0x8/0xc
    
    In the meanwhile, all processes exiting/forking are blocked almost
    stalling the system.
    
    This patch moves the threadgroup_change_begin from before
    cgroup_fork() to just before cgroup_canfork().  There is no nee to
    worry about threadgroup changes till the task is actually added to the
    threadgroup.  This avoids having to call reclaim with
    cgroup_threadgroup_rwsem held.
    
    tj: Subject and description edits.
    
    Signed-off-by: Balbir Singh <bsingharora@gmail.com>
    Acked-by: Zefan Li <lizefan@huawei.com>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Andrew Morton <akpm@linux-foundation.org>
    Cc: stable@vger.kernel.org # v4.2+
    Signed-off-by: Tejun Heo <tj@kernel.org>

diff --git a/kernel/fork.c b/kernel/fork.c
index 52e725d..aaf7823 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1404,7 +1404,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	p->real_start_time = ktime_get_boot_ns();
 	p->io_context = NULL;
 	p->audit_context = NULL;
-	threadgroup_change_begin(current);
 	cgroup_fork(p);
 #ifdef CONFIG_NUMA
 	p->mempolicy = mpol_dup(p->mempolicy);
@@ -1556,6 +1555,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 	INIT_LIST_HEAD(&p->thread_group);
 	p->task_works = NULL;
 
+	threadgroup_change_begin(current);
 	/*
 	 * Ensure that the cgroup subsystem policies allow the new process to be
 	 * forked. It should be noted the the new process's css_set can be changed
@@ -1656,6 +1656,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 bad_fork_cancel_cgroup:
 	cgroup_cancel_fork(p);
 bad_fork_free_pid:
+	threadgroup_change_end(current);
 	if (pid != &init_struct_pid)
 		free_pid(pid);
 bad_fork_cleanup_thread:
@@ -1688,7 +1689,6 @@ bad_fork_cleanup_policy:
 	mpol_put(p->mempolicy);
 bad_fork_cleanup_threadgroup_lock:
 #endif
-	threadgroup_change_end(current);
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
 	atomic_dec(&p->cred->user->processes);

next prev parent reply	other threads:[~2016-08-31 10:17 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-08-30 18:39 Bisected: Kernel deadlock bug related to cgroups Brent Lovelace
2016-08-31 10:16 ` Oleg Nesterov [this message]
2016-08-31 10:33   ` Balbir Singh
2016-08-31 23:19     ` Brent Lovelace

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:52e725d dfblob:aaf7823 )
 OR (
bs:"Re: Bisected: Kernel deadlock bug related to cgroups." )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160831101639.GA3919@redhat.com \
    --to=oleg@redhat.com \
    --cc=brent.lovelace@candelatech.com \
    --cc=bsingharora@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=lizefan@huawei.com \
    --cc=tj@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.