[stable-3.10.y] possible unsafe locking warning

stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [stable-3.10.y] possible unsafe locking warning
@ 2014-05-28 10:06 Gu Zheng
  2014-05-28 14:26 ` Greg KH
  2014-05-28 15:48 ` Tejun Heo
  0 siblings, 2 replies; 8+ messages in thread
From: Gu Zheng @ 2014-05-28 10:06 UTC (permalink / raw)
  To: stable; +Cc: Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen

Hi all,
When offline the whole memory of a movable numa node on kernel stable-3.10-y,
the following possible deadlock warning occurs.

[ 2457.467359] 
[ 2457.485175] =================================
[ 2457.537325] [ INFO: inconsistent lock state ]
[ 2457.589476] 3.10.39+ #4 Not tainted
[ 2457.631218] ---------------------------------
[ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
[ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
[ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
[ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
[ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
[ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
[ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
[ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
[ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
[ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
[ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
[ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
[ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
[ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
[ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
[ 2458.745214] irq event stamp: 49
[ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
[ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
[ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
[ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
[ 2459.195852] 
[ 2459.195852] other info that might help us debug this:
[ 2459.274024]  Possible unsafe locking scenario:
[ 2459.274024] 
[ 2459.344911]        CPU0
[ 2459.374161]        ----
[ 2459.403408]   lock(&sig->group_rwsem);
[ 2459.448490]   <Interrupt>
[ 2459.479825]     lock(&sig->group_rwsem);
[ 2459.526979] 
[ 2459.526979]  *** DEADLOCK ***
[ 2459.526979] 
[ 2459.597866] no locks held by kswapd2/1151.
[ 2459.646896] 
[ 2459.646896] stack backtrace:
[ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
[ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
[ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
[ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
[ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
[ 2460.163087] Call Trace:
[ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
[ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
[ 2460.322679]  [<ffffffff810be0e0>] ? check_usage_backwards+0x160/0x160
[ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
[ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
[ 2460.530136]  [<ffffffff8101acd3>] ? native_sched_clock+0x13/0x80
[ 2460.602065]  [<ffffffff8101ad49>] ? sched_clock+0x9/0x10
[ 2460.665668]  [<ffffffff81096f05>] ? sched_clock_cpu+0xb5/0x100
[ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
[ 2460.800156]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
[ 2460.866885]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
[ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
[ 2460.996166]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
[ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
[ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
[ 2461.186976]  [<ffffffff810841e0>] ? wake_up_bit+0x30/0x30
[ 2461.251629]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
[ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
[ 2461.379870]  [<ffffffff815e12eb>] ? wait_for_completion+0x3b/0x110
[ 2461.453879]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
[ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
[ 2461.596689]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140

And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
(because it was marked SHOULD STOP when offline pages completed), which needs to acquire
sig->group_rwsem in exit_signals(), so the deadlock occurs.

       thread-1                           			 |            thread-2
                                                                 |
__offline_pages():                                               | system_call_fastpath()
|-> kswapd_stop(node);                                           | |-> ......
    |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
        |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
        |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
            |                                                    |     |-> threadgroup_lock(tsk)
|<----------|                                                    |        // Here, got the lock.
|-> kswapd()                                                     |    |-> ...
    |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
            return;                                              |         |-> flex_array_alloc()
            |                                                    |             |-> kzalloc()
|<----------|                                                    |                |-> wait for kswapd to reclaim memory
|-> kthread()                                                    |
    |-> do_exit(ret)                                             |
        |-> exit_signals()                                       |
            |-> threadgroup_change_begin(tsk)                    |
                |-> down_read(&tsk->signal->group_rwsem)         |
                    // Here, acquire the lock. 

If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
Any comments about this? If I missed something, please correct me.:)

Regards,
Gu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-05-28 10:06 [stable-3.10.y] possible unsafe locking warning Gu Zheng
@ 2014-05-28 14:26 ` Greg KH
  2014-05-29  2:53   ` Gu Zheng
  2014-05-29  2:53   ` Gu Zheng
  2014-05-28 15:48 ` Tejun Heo
  1 sibling, 2 replies; 8+ messages in thread
From: Greg KH @ 2014-05-28 14:26 UTC (permalink / raw)
  To: Gu Zheng; +Cc: stable, Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> Hi all,
> When offline the whole memory of a movable numa node on kernel stable-3.10-y,
> the following possible deadlock warning occurs.
> 
> [ 2457.467359] 
> [ 2457.485175] =================================
> [ 2457.537325] [ INFO: inconsistent lock state ]
> [ 2457.589476] 3.10.39+ #4 Not tainted
> [ 2457.631218] ---------------------------------
> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
> [ 2458.745214] irq event stamp: 49
> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
> [ 2459.195852] 
> [ 2459.195852] other info that might help us debug this:
> [ 2459.274024]  Possible unsafe locking scenario:
> [ 2459.274024] 
> [ 2459.344911]        CPU0
> [ 2459.374161]        ----
> [ 2459.403408]   lock(&sig->group_rwsem);
> [ 2459.448490]   <Interrupt>
> [ 2459.479825]     lock(&sig->group_rwsem);
> [ 2459.526979] 
> [ 2459.526979]  *** DEADLOCK ***
> [ 2459.526979] 
> [ 2459.597866] no locks held by kswapd2/1151.
> [ 2459.646896] 
> [ 2459.646896] stack backtrace:
> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
> [ 2460.163087] Call Trace:
> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
> [ 2460.322679]  [<ffffffff810be0e0>] ? check_usage_backwards+0x160/0x160
> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
> [ 2460.530136]  [<ffffffff8101acd3>] ? native_sched_clock+0x13/0x80
> [ 2460.602065]  [<ffffffff8101ad49>] ? sched_clock+0x9/0x10
> [ 2460.665668]  [<ffffffff81096f05>] ? sched_clock_cpu+0xb5/0x100
> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
> [ 2460.800156]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
> [ 2460.866885]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
> [ 2460.996166]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
> [ 2461.186976]  [<ffffffff810841e0>] ? wake_up_bit+0x30/0x30
> [ 2461.251629]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
> [ 2461.379870]  [<ffffffff815e12eb>] ? wait_for_completion+0x3b/0x110
> [ 2461.453879]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
> [ 2461.596689]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
> 
> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
> sig->group_rwsem in exit_signals(), so the deadlock occurs.
> 
>        thread-1                           			 |            thread-2
>                                                                  |
> __offline_pages():                                               | system_call_fastpath()
> |-> kswapd_stop(node);                                           | |-> ......
>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>             |                                                    |     |-> threadgroup_lock(tsk)
> |<----------|                                                    |        // Here, got the lock.
> |-> kswapd()                                                     |    |-> ...
>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>             return;                                              |         |-> flex_array_alloc()
>             |                                                    |             |-> kzalloc()
> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
> |-> kthread()                                                    |
>     |-> do_exit(ret)                                             |
>         |-> exit_signals()                                       |
>             |-> threadgroup_change_begin(tsk)                    |
>                 |-> down_read(&tsk->signal->group_rwsem)         |
>                     // Here, acquire the lock. 
> 
> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
> Any comments about this? If I missed something, please correct me.:)

Can you test the latest kernel release to verify this?  There's nothing
we can do to an old kernel version that isn't already fixed in upstream
first.

greg k-h

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-05-28 10:06 [stable-3.10.y] possible unsafe locking warning Gu Zheng
  2014-05-28 14:26 ` Greg KH
@ 2014-05-28 15:48 ` Tejun Heo
  2014-06-05  5:44   ` Gu Zheng
  1 sibling, 1 reply; 8+ messages in thread
From: Tejun Heo @ 2014-05-28 15:48 UTC (permalink / raw)
  To: Gu Zheng
  Cc: stable, Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen,
	Johannes Weiner

(cc'ing Johannes for mm-foo)

Hello,

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
> [ 2458.745214] irq event stamp: 49
> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
> [ 2459.195852] 
> [ 2459.195852] other info that might help us debug this:
> [ 2459.274024]  Possible unsafe locking scenario:
> [ 2459.274024] 
> [ 2459.344911]        CPU0
> [ 2459.374161]        ----
> [ 2459.403408]   lock(&sig->group_rwsem);
> [ 2459.448490]   <Interrupt>
> [ 2459.479825]     lock(&sig->group_rwsem);
> [ 2459.526979] 
> [ 2459.526979]  *** DEADLOCK ***
> [ 2459.526979] 
> [ 2459.597866] no locks held by kswapd2/1151.
> [ 2459.646896] 
> [ 2459.646896] stack backtrace:
> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
> [ 2460.163087] Call Trace:
> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0

The lockdep warning is about threadgroup_lock being grabbed by kswapd
which is depended upon during memory reclaim when the lock may be held
by tasks which may wait on memory reclaim.  From the backtrace, it
looks like the right thing to do is marking the kswapd that it's no
longer a memory reclaimer once before it starts exiting.

> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
> sig->group_rwsem in exit_signals(), so the deadlock occurs.
> 
>        thread-1                           			 |            thread-2
>                                                                  |
> __offline_pages():                                               | system_call_fastpath()
> |-> kswapd_stop(node);                                           | |-> ......
>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>             |                                                    |     |-> threadgroup_lock(tsk)
> |<----------|                                                    |        // Here, got the lock.
> |-> kswapd()                                                     |    |-> ...
>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>             return;                                              |         |-> flex_array_alloc()
>             |                                                    |             |-> kzalloc()
> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
> |-> kthread()                                                    |
>     |-> do_exit(ret)                                             |
>         |-> exit_signals()                                       |
>             |-> threadgroup_change_begin(tsk)                    |
>                 |-> down_read(&tsk->signal->group_rwsem)         |
>                     // Here, acquire the lock. 
> 
> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
> Any comments about this? If I missed something, please correct me.:)

Not sure whether this can actually happen but if so the right fix
would be making thread-2 not wait for kswapd which is exiting and can
no longer serve as memory reclaimer.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-05-28 14:26 ` Greg KH
@ 2014-05-29  2:53   ` Gu Zheng
  2014-05-29  2:53   ` Gu Zheng
  1 sibling, 0 replies; 8+ messages in thread
From: Gu Zheng @ 2014-05-29  2:53 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen

Hi Greg,

On 05/28/2014 10:26 PM, Greg KH wrote:

> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
>> Hi all,
>> When offline the whole memory of a movable numa node on kernel stable-3.10-y,
>> the following possible deadlock warning occurs.
>>
>> [ 2457.467359] 
>> [ 2457.485175] =================================
>> [ 2457.537325] [ INFO: inconsistent lock state ]
>> [ 2457.589476] 3.10.39+ #4 Not tainted
>> [ 2457.631218] ---------------------------------
>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
>> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
>> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
>> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
>> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
>> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
>> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
>> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
>> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
>> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
>> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
>> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
>> [ 2458.745214] irq event stamp: 49
>> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
>> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
>> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
>> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
>> [ 2459.195852] 
>> [ 2459.195852] other info that might help us debug this:
>> [ 2459.274024]  Possible unsafe locking scenario:
>> [ 2459.274024] 
>> [ 2459.344911]        CPU0
>> [ 2459.374161]        ----
>> [ 2459.403408]   lock(&sig->group_rwsem);
>> [ 2459.448490]   <Interrupt>
>> [ 2459.479825]     lock(&sig->group_rwsem);
>> [ 2459.526979] 
>> [ 2459.526979]  *** DEADLOCK ***
>> [ 2459.526979] 
>> [ 2459.597866] no locks held by kswapd2/1151.
>> [ 2459.646896] 
>> [ 2459.646896] stack backtrace:
>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
>> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
>> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
>> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
>> [ 2460.163087] Call Trace:
>> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
>> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
>> [ 2460.322679]  [<ffffffff810be0e0>] ? check_usage_backwards+0x160/0x160
>> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
>> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
>> [ 2460.530136]  [<ffffffff8101acd3>] ? native_sched_clock+0x13/0x80
>> [ 2460.602065]  [<ffffffff8101ad49>] ? sched_clock+0x9/0x10
>> [ 2460.665668]  [<ffffffff81096f05>] ? sched_clock_cpu+0xb5/0x100
>> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
>> [ 2460.800156]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
>> [ 2460.866885]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
>> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
>> [ 2460.996166]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
>> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
>> [ 2461.186976]  [<ffffffff810841e0>] ? wake_up_bit+0x30/0x30
>> [ 2461.251629]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
>> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
>> [ 2461.379870]  [<ffffffff815e12eb>] ? wait_for_completion+0x3b/0x110
>> [ 2461.453879]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
>> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
>> [ 2461.596689]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
>>
>> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
>> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
>> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
>> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
>> sig->group_rwsem in exit_signals(), so the deadlock occurs.
>>
>>        thread-1                           			 |            thread-2
>>                                                                  |
>> __offline_pages():                                               | system_call_fastpath()
>> |-> kswapd_stop(node);                                           | |-> ......
>>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>>             |                                                    |     |-> threadgroup_lock(tsk)
>> |<----------|                                                    |        // Here, got the lock.
>> |-> kswapd()                                                     |    |-> ...
>>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>>             return;                                              |         |-> flex_array_alloc()
>>             |                                                    |             |-> kzalloc()
>> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
>> |-> kthread()                                                    |
>>     |-> do_exit(ret)                                             |
>>         |-> exit_signals()                                       |
>>             |-> threadgroup_change_begin(tsk)                    |
>>                 |-> down_read(&tsk->signal->group_rwsem)         |
>>                     // Here, acquire the lock. 
>>
>> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
>> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
>> Any comments about this? If I missed something, please correct me.:)
> 
> Can you test the latest kernel release to verify this?  There's nothing
> we can do to an old kernel version that isn't already fixed in upstream
> first.

There is another lockdep warning in the booting stage with the latest kernel, so
I can not verify this issue on it now.

Thanks,
Gu 

> 
> greg k-h
> .
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-05-28 14:26 ` Greg KH
  2014-05-29  2:53   ` Gu Zheng
@ 2014-05-29  2:53   ` Gu Zheng
  1 sibling, 0 replies; 8+ messages in thread
From: Gu Zheng @ 2014-05-29  2:53 UTC (permalink / raw)
  To: Greg KH; +Cc: stable, Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen

Hi Greg,

On 05/28/2014 10:26 PM, Greg KH wrote:

> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
>> Hi all,
>> When offline the whole memory of a movable numa node on kernel stable-3.10-y,
>> the following possible deadlock warning occurs.
>>
>> [ 2457.467359] 
>> [ 2457.485175] =================================
>> [ 2457.537325] [ INFO: inconsistent lock state ]
>> [ 2457.589476] 3.10.39+ #4 Not tainted
>> [ 2457.631218] ---------------------------------
>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
>> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
>> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
>> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
>> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
>> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
>> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
>> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
>> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
>> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
>> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
>> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
>> [ 2458.745214] irq event stamp: 49
>> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
>> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
>> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
>> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
>> [ 2459.195852] 
>> [ 2459.195852] other info that might help us debug this:
>> [ 2459.274024]  Possible unsafe locking scenario:
>> [ 2459.274024] 
>> [ 2459.344911]        CPU0
>> [ 2459.374161]        ----
>> [ 2459.403408]   lock(&sig->group_rwsem);
>> [ 2459.448490]   <Interrupt>
>> [ 2459.479825]     lock(&sig->group_rwsem);
>> [ 2459.526979] 
>> [ 2459.526979]  *** DEADLOCK ***
>> [ 2459.526979] 
>> [ 2459.597866] no locks held by kswapd2/1151.
>> [ 2459.646896] 
>> [ 2459.646896] stack backtrace:
>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
>> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
>> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
>> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
>> [ 2460.163087] Call Trace:
>> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
>> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
>> [ 2460.322679]  [<ffffffff810be0e0>] ? check_usage_backwards+0x160/0x160
>> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
>> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
>> [ 2460.530136]  [<ffffffff8101acd3>] ? native_sched_clock+0x13/0x80
>> [ 2460.602065]  [<ffffffff8101ad49>] ? sched_clock+0x9/0x10
>> [ 2460.665668]  [<ffffffff81096f05>] ? sched_clock_cpu+0xb5/0x100
>> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
>> [ 2460.800156]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
>> [ 2460.866885]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
>> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
>> [ 2460.996166]  [<ffffffff81071864>] ? exit_signals+0x24/0x130
>> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
>> [ 2461.186976]  [<ffffffff810841e0>] ? wake_up_bit+0x30/0x30
>> [ 2461.251629]  [<ffffffff81158ca0>] ? balance_pgdat+0x5e0/0x5e0
>> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
>> [ 2461.379870]  [<ffffffff815e12eb>] ? wait_for_completion+0x3b/0x110
>> [ 2461.453879]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
>> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
>> [ 2461.596689]  [<ffffffff81082f60>] ? kthread_create_on_node+0x140/0x140
>>
>> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
>> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
>> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
>> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
>> sig->group_rwsem in exit_signals(), so the deadlock occurs.
>>
>>        thread-1                           			 |            thread-2
>>                                                                  |
>> __offline_pages():                                               | system_call_fastpath()
>> |-> kswapd_stop(node);                                           | |-> ......
>>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>>             |                                                    |     |-> threadgroup_lock(tsk)
>> |<----------|                                                    |        // Here, got the lock.
>> |-> kswapd()                                                     |    |-> ...
>>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>>             return;                                              |         |-> flex_array_alloc()
>>             |                                                    |             |-> kzalloc()
>> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
>> |-> kthread()                                                    |
>>     |-> do_exit(ret)                                             |
>>         |-> exit_signals()                                       |
>>             |-> threadgroup_change_begin(tsk)                    |
>>                 |-> down_read(&tsk->signal->group_rwsem)         |
>>                     // Here, acquire the lock. 
>>
>> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
>> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
>> Any comments about this? If I missed something, please correct me.:)
> 
> Can you test the latest kernel release to verify this?  There's nothing
> we can do to an old kernel version that isn't already fixed in upstream
> first.

There is another lockdep warning in the booting stage with the latest kernel, so
I can not verify this issue on it now.

Thanks,
Gu 

> 
> greg k-h
> .
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-05-28 15:48 ` Tejun Heo
@ 2014-06-05  5:44   ` Gu Zheng
  2014-06-05 13:24     ` Johannes Weiner
  0 siblings, 1 reply; 8+ messages in thread
From: Gu Zheng @ 2014-06-05  5:44 UTC (permalink / raw)
  To: Tejun Heo, Johannes Weiner
  Cc: stable, Cgroups, linux-kernel, Yasuaki Ishimatsu, tangchen

Hi Tejun,
Sorry for late replay.
On 05/28/2014 11:48 PM, Tejun Heo wrote:

> (cc'ing Johannes for mm-foo)
> 
> Hello,
> 
> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
>> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
>> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
>> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
>> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
>> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
>> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
>> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
>> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
>> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
>> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
>> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
>> [ 2458.745214] irq event stamp: 49
>> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
>> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
>> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
>> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
>> [ 2459.195852] 
>> [ 2459.195852] other info that might help us debug this:
>> [ 2459.274024]  Possible unsafe locking scenario:
>> [ 2459.274024] 
>> [ 2459.344911]        CPU0
>> [ 2459.374161]        ----
>> [ 2459.403408]   lock(&sig->group_rwsem);
>> [ 2459.448490]   <Interrupt>
>> [ 2459.479825]     lock(&sig->group_rwsem);
>> [ 2459.526979] 
>> [ 2459.526979]  *** DEADLOCK ***
>> [ 2459.526979] 
>> [ 2459.597866] no locks held by kswapd2/1151.
>> [ 2459.646896] 
>> [ 2459.646896] stack backtrace:
>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
>> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
>> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
>> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
>> [ 2460.163087] Call Trace:
>> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
>> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
>> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
>> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
>> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
>> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
>> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
>> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
>> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
> 
> The lockdep warning is about threadgroup_lock being grabbed by kswapd
> which is depended upon during memory reclaim when the lock may be held
> by tasks which may wait on memory reclaim.  From the backtrace, it
> looks like the right thing to do is marking the kswapd that it's no
> longer a memory reclaimer once before it starts exiting.
> 
>> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
>> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
>> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
>> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
>> sig->group_rwsem in exit_signals(), so the deadlock occurs.
>>
>>        thread-1                           			 |            thread-2
>>                                                                  |
>> __offline_pages():                                               | system_call_fastpath()
>> |-> kswapd_stop(node);                                           | |-> ......
>>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>>             |                                                    |     |-> threadgroup_lock(tsk)
>> |<----------|                                                    |        // Here, got the lock.
>> |-> kswapd()                                                     |    |-> ...
>>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>>             return;                                              |         |-> flex_array_alloc()
>>             |                                                    |             |-> kzalloc()
>> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
>> |-> kthread()                                                    |
>>     |-> do_exit(ret)                                             |
>>         |-> exit_signals()                                       |
>>             |-> threadgroup_change_begin(tsk)                    |
>>                 |-> down_read(&tsk->signal->group_rwsem)         |
>>                     // Here, acquire the lock. 
>>
>> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
>> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
>> Any comments about this? If I missed something, please correct me.:)
> 
> Not sure whether this can actually happen but if so the right fix
> would be making thread-2 not wait for kswapd which is exiting and can
> no longer serve as memory reclaimer.

Thanks for your suggestion, I'll try this way if possible.

To Johannes:
Any comment about this issue and Tejun's suggestion?

Best regards,
Gu

> 
> Thanks.
> 



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-06-05  5:44   ` Gu Zheng
@ 2014-06-05 13:24     ` Johannes Weiner
  2014-06-12  7:18       ` Gu Zheng
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2014-06-05 13:24 UTC (permalink / raw)
  To: Gu Zheng
  Cc: Tejun Heo, stable, Cgroups, linux-kernel, Yasuaki Ishimatsu,
	tangchen, Andrew Morton, linux-mm

Hi,

[cc'ing Andrew and linux-mm for patch review and inclusion]

On Thu, Jun 05, 2014 at 01:44:38PM +0800, Gu Zheng wrote:
> Hi Tejun,
> Sorry for late replay.
> On 05/28/2014 11:48 PM, Tejun Heo wrote:
> 
> > (cc'ing Johannes for mm-foo)
> > 
> > Hello,
> > 
> > On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> >> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> >> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
> >> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
> >> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
> >> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
> >> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
> >> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
> >> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
> >> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
> >> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
> >> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
> >> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
> >> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
> >> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
> >> [ 2458.745214] irq event stamp: 49
> >> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
> >> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
> >> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
> >> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
> >> [ 2459.195852] 
> >> [ 2459.195852] other info that might help us debug this:
> >> [ 2459.274024]  Possible unsafe locking scenario:
> >> [ 2459.274024] 
> >> [ 2459.344911]        CPU0
> >> [ 2459.374161]        ----
> >> [ 2459.403408]   lock(&sig->group_rwsem);
> >> [ 2459.448490]   <Interrupt>
> >> [ 2459.479825]     lock(&sig->group_rwsem);
> >> [ 2459.526979] 
> >> [ 2459.526979]  *** DEADLOCK ***
> >> [ 2459.526979] 
> >> [ 2459.597866] no locks held by kswapd2/1151.
> >> [ 2459.646896] 
> >> [ 2459.646896] stack backtrace:
> >> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> >> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
> >> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
> >> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
> >> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
> >> [ 2460.163087] Call Trace:
> >> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
> >> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
> >> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
> >> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
> >> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
> >> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
> >> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
> >> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
> >> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
> >> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
> > 
> > The lockdep warning is about threadgroup_lock being grabbed by kswapd
> > which is depended upon during memory reclaim when the lock may be held
> > by tasks which may wait on memory reclaim.  From the backtrace, it
> > looks like the right thing to do is marking the kswapd that it's no
> > longer a memory reclaimer once before it starts exiting.

Yeah, that makes sense.  In fact, we can reset *all* the
reclaim-specific per-task states the second it stops performing
reclaim work.

> >> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
> >> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
> >> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
> >> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
> >> sig->group_rwsem in exit_signals(), so the deadlock occurs.
> >>
> >>        thread-1                           			 |            thread-2
> >>                                                                  |
> >> __offline_pages():                                               | system_call_fastpath()
> >> |-> kswapd_stop(node);                                           | |-> ......
> >>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
> >>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
> >>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
> >>             |                                                    |     |-> threadgroup_lock(tsk)
> >> |<----------|                                                    |        // Here, got the lock.
> >> |-> kswapd()                                                     |    |-> ...
> >>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
> >>             return;                                              |         |-> flex_array_alloc()
> >>             |                                                    |             |-> kzalloc()
> >> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
> >> |-> kthread()                                                    |
> >>     |-> do_exit(ret)                                             |
> >>         |-> exit_signals()                                       |
> >>             |-> threadgroup_change_begin(tsk)                    |
> >>                 |-> down_read(&tsk->signal->group_rwsem)         |
> >>                     // Here, acquire the lock. 
> >>
> >> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
> >> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
> >> Any comments about this? If I missed something, please correct me.:)
> > 
> > Not sure whether this can actually happen but if so the right fix
> > would be making thread-2 not wait for kswapd which is exiting and can
> > no longer serve as memory reclaimer.

There is never a direct wait for a specific kswapd thread in the
waitqueue sense.  The allocator wakes up the kswapds for all nodes
allowed in the allocation, then retries the allocation a few times in
the hope that kswapd does something before entering reclaim itself.

How far back do we need this in stable?

---

>From c3d76e3c208bc90b64b804ffefa114b920cab47e Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Thu, 5 Jun 2014 08:37:01 -0400
Subject: [patch] mm: vmscan: clear kswapd's special reclaim powers before
 exiting

When kswapd exits, it can end up taking locks that were previously
held by allocating tasks while they waited for reclaim.  Lockdep
currently warns about this:

On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
> [ 2458.745214] irq event stamp: 49
> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
> [ 2459.195852]
> [ 2459.195852] other info that might help us debug this:
> [ 2459.274024]  Possible unsafe locking scenario:
> [ 2459.274024]
> [ 2459.344911]        CPU0
> [ 2459.374161]        ----
> [ 2459.403408]   lock(&sig->group_rwsem);
> [ 2459.448490]   <Interrupt>
> [ 2459.479825]     lock(&sig->group_rwsem);
> [ 2459.526979]
> [ 2459.526979]  *** DEADLOCK ***
> [ 2459.526979]
> [ 2459.597866] no locks held by kswapd2/1151.
> [ 2459.646896]
> [ 2459.646896] stack backtrace:
> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
> [ 2460.163087] Call Trace:
> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0

This is because the kswapd thread is still marked as a reclaimer at
the time of exit.  But because it is exiting, nobody is actually
waiting on it to make reclaim progress anymore, and it's nothing but a
regular thread at this point.  Be tidy and strip it of all its powers
(PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
before returning from the thread function.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/vmscan.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 9a63d13739a6..4ac2eab860d2 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3425,7 +3425,10 @@ static int kswapd(void *p)
 		}
 	}
 
+	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
 	current->reclaim_state = NULL;
+	lockdep_clear_current_reclaim_state();
+
 	return 0;
 }
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [stable-3.10.y] possible unsafe locking warning
  2014-06-05 13:24     ` Johannes Weiner
@ 2014-06-12  7:18       ` Gu Zheng
  0 siblings, 0 replies; 8+ messages in thread
From: Gu Zheng @ 2014-06-12  7:18 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Tejun Heo, stable, Cgroups, linux-kernel, Yasuaki Ishimatsu,
	tangchen, Andrew Morton, linux-mm

Hi Johannes,
Sorry for late.
The patch works well with stable kernel-3.10.y, the warning is gone.
Thank you and Tejun very much.:)

Regards,
Gu
On 06/05/2014 09:24 PM, Johannes Weiner wrote:

> Hi,
> 
> [cc'ing Andrew and linux-mm for patch review and inclusion]
> 
> On Thu, Jun 05, 2014 at 01:44:38PM +0800, Gu Zheng wrote:
>> Hi Tejun,
>> Sorry for late replay.
>> On 05/28/2014 11:48 PM, Tejun Heo wrote:
>>
>>> (cc'ing Johannes for mm-foo)
>>>
>>> Hello,
>>>
>>> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
>>>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
>>>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
>>>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
>>>> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
>>>> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
>>>> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
>>>> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
>>>> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
>>>> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
>>>> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
>>>> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
>>>> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
>>>> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
>>>> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
>>>> [ 2458.745214] irq event stamp: 49
>>>> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
>>>> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
>>>> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
>>>> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
>>>> [ 2459.195852] 
>>>> [ 2459.195852] other info that might help us debug this:
>>>> [ 2459.274024]  Possible unsafe locking scenario:
>>>> [ 2459.274024] 
>>>> [ 2459.344911]        CPU0
>>>> [ 2459.374161]        ----
>>>> [ 2459.403408]   lock(&sig->group_rwsem);
>>>> [ 2459.448490]   <Interrupt>
>>>> [ 2459.479825]     lock(&sig->group_rwsem);
>>>> [ 2459.526979] 
>>>> [ 2459.526979]  *** DEADLOCK ***
>>>> [ 2459.526979] 
>>>> [ 2459.597866] no locks held by kswapd2/1151.
>>>> [ 2459.646896] 
>>>> [ 2459.646896] stack backtrace:
>>>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
>>>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
>>>> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
>>>> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
>>>> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
>>>> [ 2460.163087] Call Trace:
>>>> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
>>>> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
>>>> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
>>>> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
>>>> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
>>>> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
>>>> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
>>>> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
>>>> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
>>>> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
>>>
>>> The lockdep warning is about threadgroup_lock being grabbed by kswapd
>>> which is depended upon during memory reclaim when the lock may be held
>>> by tasks which may wait on memory reclaim.  From the backtrace, it
>>> looks like the right thing to do is marking the kswapd that it's no
>>> longer a memory reclaimer once before it starts exiting.
> 
> Yeah, that makes sense.  In fact, we can reset *all* the
> reclaim-specific per-task states the second it stops performing
> reclaim work.
> 
>>>> And when reference to the related code(kernel-3.10.y), it seems that cgroup_attach_task(thread-2,
>>>> attach kswapd) trigger kswapd(reclaim memory?) when trying to alloc memory(flex_array_alloc) under
>>>> the protection of sig->group_rwsem, but meanwhile the kswapd(thread-1) is in the exit routine
>>>> (because it was marked SHOULD STOP when offline pages completed), which needs to acquire
>>>> sig->group_rwsem in exit_signals(), so the deadlock occurs.
>>>>
>>>>        thread-1                           			 |            thread-2
>>>>                                                                  |
>>>> __offline_pages():                                               | system_call_fastpath()
>>>> |-> kswapd_stop(node);                                           | |-> ......
>>>>     |-> kthread_stop(kswapd)                                     | |-> cgroup_file_write()
>>>>         |-> set_bit(KTHREAD_SHOULD_STOP, &kthread->flags);       | |-> ......
>>>>         |-> wake_up_process(k)                                   | |-> attach_task_by_pid()
>>>>             |                                                    |     |-> threadgroup_lock(tsk)
>>>> |<----------|                                                    |        // Here, got the lock.
>>>> |-> kswapd()                                                     |    |-> ...
>>>>     |-> if (kthread_should_stop())                               |     |-> cgroup_attach_task()
>>>>             return;                                              |         |-> flex_array_alloc()
>>>>             |                                                    |             |-> kzalloc()
>>>> |<----------|                                                    |                |-> wait for kswapd to reclaim memory
>>>> |-> kthread()                                                    |
>>>>     |-> do_exit(ret)                                             |
>>>>         |-> exit_signals()                                       |
>>>>             |-> threadgroup_change_begin(tsk)                    |
>>>>                 |-> down_read(&tsk->signal->group_rwsem)         |
>>>>                     // Here, acquire the lock. 
>>>>
>>>> If my analysis is correct, the latest kernel may have the same issue, though the flex_array was replaced
>>>> by list, but we still need to alloc memory(e.g. in find_css_set()), so the race may still occur.
>>>> Any comments about this? If I missed something, please correct me.:)
>>>
>>> Not sure whether this can actually happen but if so the right fix
>>> would be making thread-2 not wait for kswapd which is exiting and can
>>> no longer serve as memory reclaimer.
> 
> There is never a direct wait for a specific kswapd thread in the
> waitqueue sense.  The allocator wakes up the kswapds for all nodes
> allowed in the allocation, then retries the allocation a few times in
> the hope that kswapd does something before entering reclaim itself.
> 
> How far back do we need this in stable?
> 
> ---
> 
>>>From c3d76e3c208bc90b64b804ffefa114b920cab47e Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@cmpxchg.org>
> Date: Thu, 5 Jun 2014 08:37:01 -0400
> Subject: [patch] mm: vmscan: clear kswapd's special reclaim powers before
>  exiting
> 
> When kswapd exits, it can end up taking locks that were previously
> held by allocating tasks while they waited for reclaim.  Lockdep
> currently warns about this:
> 
> On Wed, May 28, 2014 at 06:06:34PM +0800, Gu Zheng wrote:
>> [ 2457.683370] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-R} usage.
>> [ 2457.761540] kswapd2/1151 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [ 2457.824102]  (&sig->group_rwsem){+++++?}, at: [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2457.923538] {RECLAIM_FS-ON-W} state was registered at:
>> [ 2457.985055]   [<ffffffff810bfc99>] mark_held_locks+0xb9/0x140
>> [ 2458.053976]   [<ffffffff810c1e3a>] lockdep_trace_alloc+0x7a/0xe0
>> [ 2458.126015]   [<ffffffff81194f47>] kmem_cache_alloc_trace+0x37/0x240
>> [ 2458.202214]   [<ffffffff812c6e89>] flex_array_alloc+0x99/0x1a0
>> [ 2458.272175]   [<ffffffff810da563>] cgroup_attach_task+0x63/0x430
>> [ 2458.344214]   [<ffffffff810dcca0>] attach_task_by_pid+0x210/0x280
>> [ 2458.417294]   [<ffffffff810dcd26>] cgroup_procs_write+0x16/0x20
>> [ 2458.488287]   [<ffffffff810d8410>] cgroup_file_write+0x120/0x2c0
>> [ 2458.560320]   [<ffffffff811b21a0>] vfs_write+0xc0/0x1f0
>> [ 2458.622994]   [<ffffffff811b2bac>] SyS_write+0x4c/0xa0
>> [ 2458.684618]   [<ffffffff815ec3c0>] tracesys+0xdd/0xe2
>> [ 2458.745214] irq event stamp: 49
>> [ 2458.782794] hardirqs last  enabled at (49): [<ffffffff815e2b56>] _raw_spin_unlock_irqrestore+0x36/0x70
>> [ 2458.894388] hardirqs last disabled at (48): [<ffffffff815e337b>] _raw_spin_lock_irqsave+0x2b/0xa0
>> [ 2459.000771] softirqs last  enabled at (0): [<ffffffff81059247>] copy_process.part.24+0x627/0x15f0
>> [ 2459.107161] softirqs last disabled at (0): [<          (null)>]           (null)
>> [ 2459.195852]
>> [ 2459.195852] other info that might help us debug this:
>> [ 2459.274024]  Possible unsafe locking scenario:
>> [ 2459.274024]
>> [ 2459.344911]        CPU0
>> [ 2459.374161]        ----
>> [ 2459.403408]   lock(&sig->group_rwsem);
>> [ 2459.448490]   <Interrupt>
>> [ 2459.479825]     lock(&sig->group_rwsem);
>> [ 2459.526979]
>> [ 2459.526979]  *** DEADLOCK ***
>> [ 2459.526979]
>> [ 2459.597866] no locks held by kswapd2/1151.
>> [ 2459.646896]
>> [ 2459.646896] stack backtrace:
>> [ 2459.699049] CPU: 30 PID: 1151 Comm: kswapd2 Not tainted 3.10.39+ #4
>> [ 2459.774098] Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 01.48 05/07/2014
>> [ 2459.895983]  ffffffff82284bf0 ffff88085856bbf8 ffffffff815dbcf6 ffff88085856bc48
>> [ 2459.985003]  ffffffff815d67c6 0000000000000000 ffff880800000001 ffff880800000001
>> [ 2460.074024]  000000000000000a ffff88085edc9600 ffffffff810be0e0 0000000000000009
>> [ 2460.163087] Call Trace:
>> [ 2460.192345]  [<ffffffff815dbcf6>] dump_stack+0x19/0x1b
>> [ 2460.253874]  [<ffffffff815d67c6>] print_usage_bug+0x1f7/0x208
>> [ 2460.399807]  [<ffffffff810bfb5d>] mark_lock+0x21d/0x2a0
>> [ 2460.462369]  [<ffffffff810c076a>] __lock_acquire+0x52a/0xb60
>> [ 2460.735516]  [<ffffffff810c1592>] lock_acquire+0xa2/0x140
>> [ 2460.935691]  [<ffffffff815e01e1>] down_read+0x51/0xa0
>> [ 2461.062888]  [<ffffffff81071864>] exit_signals+0x24/0x130
>> [ 2461.127536]  [<ffffffff81060d55>] do_exit+0xb5/0xa50
>> [ 2461.320433]  [<ffffffff8108303b>] kthread+0xdb/0x100
>> [ 2461.532049]  [<ffffffff815ec0ec>] ret_from_fork+0x7c/0xb0
> 
> This is because the kswapd thread is still marked as a reclaimer at
> the time of exit.  But because it is exiting, nobody is actually
> waiting on it to make reclaim progress anymore, and it's nothing but a
> regular thread at this point.  Be tidy and strip it of all its powers
> (PF_MEMALLOC, PF_SWAPWRITE, PF_KSWAPD, and the lockdep reclaim state)
> before returning from the thread function.
> 
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/vmscan.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 9a63d13739a6..4ac2eab860d2 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3425,7 +3425,10 @@ static int kswapd(void *p)
>  		}
>  	}
>  
> +	tsk->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD);
>  	current->reclaim_state = NULL;
> +	lockdep_clear_current_reclaim_state();
> +
>  	return 0;
>  }
>  



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-06-12  7:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-28 10:06 [stable-3.10.y] possible unsafe locking warning Gu Zheng
2014-05-28 14:26 ` Greg KH
2014-05-29  2:53   ` Gu Zheng
2014-05-29  2:53   ` Gu Zheng
2014-05-28 15:48 ` Tejun Heo
2014-06-05  5:44   ` Gu Zheng
2014-06-05 13:24     ` Johannes Weiner
2014-06-12  7:18       ` Gu Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).