[linux-next] khugepaged inconsistent lock state

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [linux-next] khugepaged inconsistent lock state
@ 2015-09-21  4:46 Sergey Senozhatsky
  2015-09-21 15:01 ` Kirill A. Shutemov
  0 siblings, 1 reply; 6+ messages in thread
From: Sergey Senozhatsky @ 2015-09-21  4:46 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michal Hocko, linux-mm, linux-kernel, Sergey Senozhatsky

Hi,

4.3.0-rc1-next-20150918

[18344.236625] =================================
[18344.236628] [ INFO: inconsistent lock state ]
[18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
[18344.236636] ---------------------------------
[18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
[18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
[18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
[18344.236662] {IN-RECLAIM_FS-W} state was registered at:
[18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
[18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
[18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
[18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
[18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
[18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
[18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
[18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
[18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
[18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
[18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
[18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
[18344.236708] irq event stamp: 6517947
[18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
[18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
[18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
[18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
[18344.236722] 
               other info that might help us debug this:
[18344.236723]  Possible unsafe locking scenario:

[18344.236724]        CPU0
[18344.236725]        ----
[18344.236726]   lock(&anon_vma->rwsem);
[18344.236728]   <Interrupt>
[18344.236729]     lock(&anon_vma->rwsem);
[18344.236731] 
                *** DEADLOCK ***

[18344.236733] 2 locks held by khugepaged/32:
[18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
[18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
[18344.236741] 
               stack backtrace:
[18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
[18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
[18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
[18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
[18344.236755] Call Trace:
[18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
[18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
[18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
[18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
[18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
[18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
[18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
[18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
[18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
[18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
[18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
[18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
[18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
[18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
[18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
[18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
[18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
[18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
[18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
[18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
[18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea


	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-next] khugepaged inconsistent lock state
  2015-09-21  4:46 [linux-next] khugepaged inconsistent lock state Sergey Senozhatsky
@ 2015-09-21 15:01 ` Kirill A. Shutemov
  2015-09-21 23:57   ` Hugh Dickins
  0 siblings, 1 reply; 6+ messages in thread
From: Kirill A. Shutemov @ 2015-09-21 15:01 UTC (permalink / raw)
  To: Sergey Senozhatsky, Michal Hocko, Ebru Akagunduz, Hugh Dickins
  Cc: Andrew Morton, linux-mm, linux-kernel, Sergey Senozhatsky

On Mon, Sep 21, 2015 at 01:46:00PM +0900, Sergey Senozhatsky wrote:
> Hi,
> 
> 4.3.0-rc1-next-20150918
> 
> [18344.236625] =================================
> [18344.236628] [ INFO: inconsistent lock state ]
> [18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
> [18344.236636] ---------------------------------
> [18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> [18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> [18344.236662] {IN-RECLAIM_FS-W} state was registered at:
> [18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
> [18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
> [18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
> [18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
> [18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
> [18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
> [18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
> [18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
> [18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
> [18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
> [18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
> [18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> [18344.236708] irq event stamp: 6517947
> [18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
> [18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
> [18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
> [18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
> [18344.236722] 
>                other info that might help us debug this:
> [18344.236723]  Possible unsafe locking scenario:
> 
> [18344.236724]        CPU0
> [18344.236725]        ----
> [18344.236726]   lock(&anon_vma->rwsem);
> [18344.236728]   <Interrupt>
> [18344.236729]     lock(&anon_vma->rwsem);
> [18344.236731] 
>                 *** DEADLOCK ***
> 
> [18344.236733] 2 locks held by khugepaged/32:
> [18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
> [18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> [18344.236741] 
>                stack backtrace:
> [18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
> [18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
> [18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
> [18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
> [18344.236755] Call Trace:
> [18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
> [18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
> [18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
> [18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
> [18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
> [18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
> [18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
> [18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
> [18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
> [18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
> [18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
> [18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
> [18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
> [18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
> [18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
> [18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
> [18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
> [18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
> [18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> [18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> [18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea

Hm. If I read this correctly, we see following scenario:

 - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
 - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
 - __read_swap_cache_async() tries to allocate the page for swap in;
 - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
   given gfp_mask we could end up in direct relaim.
 - Lockdep already knows that reclaim sometimes (e.g. in case of
   split_huge_page()) wants to take anon_vma lock on its own.

Therefore deadlock is possible.

I see two ways to fix this:

 - take anon_vma lock *after* __collapse_huge_page_swapin() in
   collapse_huge_page(): I don't really see why we need the lock
   during swapin;
 - respect FAULT_FLAG_RETRY_NOWAIT in do_swap_page(): add GFP_NOWAIT to
   gfp_mask for swapin_readahead() in this case.

I guess it could be beneficial to do both.

Any comments?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-next] khugepaged inconsistent lock state
  2015-09-21 15:01 ` Kirill A. Shutemov
@ 2015-09-21 23:57   ` Hugh Dickins
  2015-09-23 13:22     ` Kirill A. Shutemov
  0 siblings, 1 reply; 6+ messages in thread
From: Hugh Dickins @ 2015-09-21 23:57 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Sergey Senozhatsky, Michal Hocko, Ebru Akagunduz, Hugh Dickins,
	Andrew Morton, linux-mm, linux-kernel, Sergey Senozhatsky

On Mon, 21 Sep 2015, Kirill A. Shutemov wrote:
> On Mon, Sep 21, 2015 at 01:46:00PM +0900, Sergey Senozhatsky wrote:
> > Hi,
> > 
> > 4.3.0-rc1-next-20150918
> > 
> > [18344.236625] =================================
> > [18344.236628] [ INFO: inconsistent lock state ]
> > [18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
> > [18344.236636] ---------------------------------
> > [18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > [18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > [18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> > [18344.236662] {IN-RECLAIM_FS-W} state was registered at:
> > [18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
> > [18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
> > [18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
> > [18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
> > [18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
> > [18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
> > [18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
> > [18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
> > [18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
> > [18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
> > [18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
> > [18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> > [18344.236708] irq event stamp: 6517947
> > [18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
> > [18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
> > [18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
> > [18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
> > [18344.236722] 
> >                other info that might help us debug this:
> > [18344.236723]  Possible unsafe locking scenario:
> > 
> > [18344.236724]        CPU0
> > [18344.236725]        ----
> > [18344.236726]   lock(&anon_vma->rwsem);
> > [18344.236728]   <Interrupt>
> > [18344.236729]     lock(&anon_vma->rwsem);
> > [18344.236731] 
> >                 *** DEADLOCK ***
> > 
> > [18344.236733] 2 locks held by khugepaged/32:
> > [18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
> > [18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> > [18344.236741] 
> >                stack backtrace:
> > [18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
> > [18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
> > [18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
> > [18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
> > [18344.236755] Call Trace:
> > [18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
> > [18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
> > [18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
> > [18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
> > [18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
> > [18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
> > [18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
> > [18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
> > [18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
> > [18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
> > [18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
> > [18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
> > [18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
> > [18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
> > [18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
> > [18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
> > [18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
> > [18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
> > [18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> > [18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> > [18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> 
> Hm. If I read this correctly, we see following scenario:
> 
>  - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
>  - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
>  - __read_swap_cache_async() tries to allocate the page for swap in;
>  - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
>    given gfp_mask we could end up in direct relaim.
>  - Lockdep already knows that reclaim sometimes (e.g. in case of
>    split_huge_page()) wants to take anon_vma lock on its own.
> 
> Therefore deadlock is possible.

Oh, thank you for working that out.  As usual with a lockdep trace,
I knew it was telling me something important, but in a language I
just couldn't understand without spending much longer to decode it.
Yes, wrong to call do_swap_page() while holding anon_vma lock.

> 
> I see two ways to fix this:
> 
>  - take anon_vma lock *after* __collapse_huge_page_swapin() in
>    collapse_huge_page(): I don't really see why we need the lock
>    during swapin;

Agreed.

>  - respect FAULT_FLAG_RETRY_NOWAIT in do_swap_page(): add GFP_NOWAIT to
>    gfp_mask for swapin_readahead() in this case.

Sounds like a good idea; though I have some reservations you're welcome
to ignore.  Partly because it goes beyond what's actually needed here,
partly because there's going to be plenty of waiting while the swapin
is done, partly because I think such a change may better belong to a
larger effort, extending FAULT_FLAG_RETRY somehow to cover the memory
allocation as well as the I/O phase (but we might be hoping instead for
a deeper attack on mmap_sem which would make FAULT_FLAG_RETRY redundant).

And the down_write of mmap_sem here, across all of those (63? 511?)
swapins, worries me.  Should the swapins be done higher up, under
just a down_read of mmap_sem (required to guard vma)?  Or should
mmap_sem be dropped and retaken repeatedly, and the various things
(including vma itself) be checked repeatedly?  I don't know.

Hugh

> 
> I guess it could be beneficial to do both.
> 
> Any comments?
> 
> -- 
>  Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-next] khugepaged inconsistent lock state
  2015-09-21 23:57   ` Hugh Dickins
@ 2015-09-23 13:22     ` Kirill A. Shutemov
  2015-09-23 20:01       ` Ebru Akagündüz
  2015-09-24  4:22       ` Sergey Senozhatsky
  0 siblings, 2 replies; 6+ messages in thread
From: Kirill A. Shutemov @ 2015-09-23 13:22 UTC (permalink / raw)
  To: Hugh Dickins, Ebru Akagunduz
  Cc: Sergey Senozhatsky, Michal Hocko, Andrew Morton, linux-mm,
	linux-kernel, Sergey Senozhatsky

On Mon, Sep 21, 2015 at 04:57:05PM -0700, Hugh Dickins wrote:
> On Mon, 21 Sep 2015, Kirill A. Shutemov wrote:
> > On Mon, Sep 21, 2015 at 01:46:00PM +0900, Sergey Senozhatsky wrote:
> > > Hi,
> > > 
> > > 4.3.0-rc1-next-20150918
> > > 
> > > [18344.236625] =================================
> > > [18344.236628] [ INFO: inconsistent lock state ]
> > > [18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
> > > [18344.236636] ---------------------------------
> > > [18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> > > [18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
> > > [18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> > > [18344.236662] {IN-RECLAIM_FS-W} state was registered at:
> > > [18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
> > > [18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
> > > [18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
> > > [18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
> > > [18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
> > > [18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
> > > [18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
> > > [18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
> > > [18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
> > > [18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
> > > [18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
> > > [18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> > > [18344.236708] irq event stamp: 6517947
> > > [18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
> > > [18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
> > > [18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
> > > [18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
> > > [18344.236722] 
> > >                other info that might help us debug this:
> > > [18344.236723]  Possible unsafe locking scenario:
> > > 
> > > [18344.236724]        CPU0
> > > [18344.236725]        ----
> > > [18344.236726]   lock(&anon_vma->rwsem);
> > > [18344.236728]   <Interrupt>
> > > [18344.236729]     lock(&anon_vma->rwsem);
> > > [18344.236731] 
> > >                 *** DEADLOCK ***
> > > 
> > > [18344.236733] 2 locks held by khugepaged/32:
> > > [18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
> > > [18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> > > [18344.236741] 
> > >                stack backtrace:
> > > [18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
> > > [18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
> > > [18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
> > > [18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
> > > [18344.236755] Call Trace:
> > > [18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
> > > [18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
> > > [18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
> > > [18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
> > > [18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
> > > [18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
> > > [18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
> > > [18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
> > > [18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
> > > [18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
> > > [18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
> > > [18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
> > > [18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
> > > [18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
> > > [18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
> > > [18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
> > > [18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
> > > [18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
> > > [18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> > > [18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> > > [18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> > 
> > Hm. If I read this correctly, we see following scenario:
> > 
> >  - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
> >  - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
> >  - __read_swap_cache_async() tries to allocate the page for swap in;
> >  - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
> >    given gfp_mask we could end up in direct relaim.
> >  - Lockdep already knows that reclaim sometimes (e.g. in case of
> >    split_huge_page()) wants to take anon_vma lock on its own.
> > 
> > Therefore deadlock is possible.
> 
> Oh, thank you for working that out.  As usual with a lockdep trace,
> I knew it was telling me something important, but in a language I
> just couldn't understand without spending much longer to decode it.
> Yes, wrong to call do_swap_page() while holding anon_vma lock.
> 
> > 
> > I see two ways to fix this:
> > 
> >  - take anon_vma lock *after* __collapse_huge_page_swapin() in
> >    collapse_huge_page(): I don't really see why we need the lock
> >    during swapin;
> 
> Agreed.

Okay. Patch for this is below.

Ebru, could you test it?

> >  - respect FAULT_FLAG_RETRY_NOWAIT in do_swap_page(): add GFP_NOWAIT to
> >    gfp_mask for swapin_readahead() in this case.
> 
> Sounds like a good idea; though I have some reservations you're welcome
> to ignore.  Partly because it goes beyond what's actually needed here,
> partly because there's going to be plenty of waiting while the swapin
> is done, partly because I think such a change may better belong to a
> larger effort, extending FAULT_FLAG_RETRY somehow to cover the memory
> allocation as well as the I/O phase (but we might be hoping instead for
> a deeper attack on mmap_sem which would make FAULT_FLAG_RETRY redundant).

I agree, we need something more coherent here, than what I wanted to do at
first. I don't have time for this :-/

Anyone?

> And the down_write of mmap_sem here, across all of those (63? 511?)
> swapins, worries me.  Should the swapins be done higher up, under
> just a down_read of mmap_sem (required to guard vma)?  Or should
> mmap_sem be dropped and retaken repeatedly, and the various things
> (including vma itself) be checked repeatedly?  I don't know.

Ebru, would you willing to rework collapse_huge_page() to call
__collapse_huge_page_swapin() under down_read(mmap_sem)?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-next] khugepaged inconsistent lock state
  2015-09-23 13:22     ` Kirill A. Shutemov
@ 2015-09-23 20:01       ` Ebru Akagündüz
  2015-09-24  4:22       ` Sergey Senozhatsky
  1 sibling, 0 replies; 6+ messages in thread
From: Ebru Akagündüz @ 2015-09-23 20:01 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Sergey Senozhatsky, Michal Hocko, Andrew Morton,
	linux-mm, linux-kernel, Sergey Senozhatsky

2015-09-23 15:22 GMT+02:00 Kirill A. Shutemov <kirill@shutemov.name>:
> On Mon, Sep 21, 2015 at 04:57:05PM -0700, Hugh Dickins wrote:
>> On Mon, 21 Sep 2015, Kirill A. Shutemov wrote:
>> > On Mon, Sep 21, 2015 at 01:46:00PM +0900, Sergey Senozhatsky wrote:
>> > > Hi,
>> > >
>> > > 4.3.0-rc1-next-20150918
>> > >
>> > > [18344.236625] =================================
>> > > [18344.236628] [ INFO: inconsistent lock state ]
>> > > [18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
>> > > [18344.236636] ---------------------------------
>> > > [18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
>> > > [18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> > > [18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
>> > > [18344.236662] {IN-RECLAIM_FS-W} state was registered at:
>> > > [18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
>> > > [18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
>> > > [18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
>> > > [18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
>> > > [18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
>> > > [18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
>> > > [18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
>> > > [18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
>> > > [18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
>> > > [18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
>> > > [18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
>> > > [18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
>> > > [18344.236708] irq event stamp: 6517947
>> > > [18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
>> > > [18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
>> > > [18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
>> > > [18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
>> > > [18344.236722]
>> > >                other info that might help us debug this:
>> > > [18344.236723]  Possible unsafe locking scenario:
>> > >
>> > > [18344.236724]        CPU0
>> > > [18344.236725]        ----
>> > > [18344.236726]   lock(&anon_vma->rwsem);
>> > > [18344.236728]   <Interrupt>
>> > > [18344.236729]     lock(&anon_vma->rwsem);
>> > > [18344.236731]
>> > >                 *** DEADLOCK ***
>> > >
>> > > [18344.236733] 2 locks held by khugepaged/32:
>> > > [18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
>> > > [18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
>> > > [18344.236741]
>> > >                stack backtrace:
>> > > [18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
>> > > [18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
>> > > [18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
>> > > [18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
>> > > [18344.236755] Call Trace:
>> > > [18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
>> > > [18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
>> > > [18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
>> > > [18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
>> > > [18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
>> > > [18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
>> > > [18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
>> > > [18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
>> > > [18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
>> > > [18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
>> > > [18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
>> > > [18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
>> > > [18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
>> > > [18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
>> > > [18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
>> > > [18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
>> > > [18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
>> > > [18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
>> > > [18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
>> > > [18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
>> > > [18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
>> >
>> > Hm. If I read this correctly, we see following scenario:
>> >
>> >  - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
>> >  - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
>> >  - __read_swap_cache_async() tries to allocate the page for swap in;
>> >  - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
>> >    given gfp_mask we could end up in direct relaim.
>> >  - Lockdep already knows that reclaim sometimes (e.g. in case of
>> >    split_huge_page()) wants to take anon_vma lock on its own.
>> >
>> > Therefore deadlock is possible.
>>
>> Oh, thank you for working that out.  As usual with a lockdep trace,
>> I knew it was telling me something important, but in a language I
>> just couldn't understand without spending much longer to decode it.
>> Yes, wrong to call do_swap_page() while holding anon_vma lock.
>>
>> >
>> > I see two ways to fix this:
>> >
>> >  - take anon_vma lock *after* __collapse_huge_page_swapin() in
>> >    collapse_huge_page(): I don't really see why we need the lock
>> >    during swapin;
>>
>> Agreed.
>
> Okay. Patch for this is below.
>
> Ebru, could you test it?
>
I did not test yet. I attend WomENcourage conference for a couple of days.
I will test changes after when I get back home.

>> >  - respect FAULT_FLAG_RETRY_NOWAIT in do_swap_page(): add GFP_NOWAIT to
>> >    gfp_mask for swapin_readahead() in this case.
>>
>> Sounds like a good idea; though I have some reservations you're welcome
>> to ignore.  Partly because it goes beyond what's actually needed here,
>> partly because there's going to be plenty of waiting while the swapin
>> is done, partly because I think such a change may better belong to a
>> larger effort, extending FAULT_FLAG_RETRY somehow to cover the memory
>> allocation as well as the I/O phase (but we might be hoping instead for
>> a deeper attack on mmap_sem which would make FAULT_FLAG_RETRY redundant).
>
> I agree, we need something more coherent here, than what I wanted to do at
> first. I don't have time for this :-/
>
> Anyone?
>
>> And the down_write of mmap_sem here, across all of those (63? 511?)
>> swapins, worries me.  Should the swapins be done higher up, under
>> just a down_read of mmap_sem (required to guard vma)?  Or should
>> mmap_sem be dropped and retaken repeatedly, and the various things
>> (including vma itself) be checked repeatedly?  I don't know.
>
> Ebru, would you willing to rework collapse_huge_page() to call
> __collapse_huge_page_swapin() under down_read(mmap_sem)?
>
> From 6d5eba0e7be517b5c0ee1d5492737c17d02f5202 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Wed, 23 Sep 2015 16:01:02 +0300
> Subject: [PATCH] thp: do not hold anon_vma lock during swap in
>
> khugepaged does swap in during collapse under anon_vma lock. It causes
> complain from lockdep. The trace below shows following scenario:
>
>  - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
>  - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
>  - __read_swap_cache_async() tries to allocate the page for swap in;
>  - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
>    given gfp_mask we could end up in direct relaim.
>  - Lockdep already knows that reclaim sometimes (e.g. in case of
>    split_huge_page()) wants to take anon_vma lock on its own.
>
> Therefore deadlock is possible.
>
> The fix is to take anon_vma lock after swap in.
>
> [18344.236625] =================================
> [18344.236628] [ INFO: inconsistent lock state ]
> [18344.236633] 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361 Not tainted
> [18344.236636] ---------------------------------
> [18344.236640] inconsistent {IN-RECLAIM_FS-W} -> {RECLAIM_FS-ON-W} usage.
> [18344.236645] khugepaged/32 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [18344.236648]  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> [18344.236662] {IN-RECLAIM_FS-W} state was registered at:
> [18344.236666]   [<ffffffff8107d747>] __lock_acquire+0x8e2/0x1183
> [18344.236673]   [<ffffffff8107e7ac>] lock_acquire+0x10b/0x1a6
> [18344.236678]   [<ffffffff8150a367>] down_write+0x3b/0x6a
> [18344.236686]   [<ffffffff811360d8>] split_huge_page_to_list+0x5b/0x61f
> [18344.236689]   [<ffffffff811224b3>] add_to_swap+0x37/0x78
> [18344.236691]   [<ffffffff810fd650>] shrink_page_list+0x4c2/0xb9a
> [18344.236694]   [<ffffffff810fe47c>] shrink_inactive_list+0x371/0x5d9
> [18344.236696]   [<ffffffff810fee2f>] shrink_lruvec+0x410/0x5ae
> [18344.236698]   [<ffffffff810ff024>] shrink_zone+0x57/0x140
> [18344.236700]   [<ffffffff810ffc79>] kswapd+0x6a5/0x91b
> [18344.236702]   [<ffffffff81059588>] kthread+0x107/0x10f
> [18344.236706]   [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> [18344.236708] irq event stamp: 6517947
> [18344.236709] hardirqs last  enabled at (6517947): [<ffffffff810f2d0c>] get_page_from_freelist+0x362/0x59e
> [18344.236713] hardirqs last disabled at (6517946): [<ffffffff8150ba41>] _raw_spin_lock_irqsave+0x18/0x51
> [18344.236715] softirqs last  enabled at (6507072): [<ffffffff81041cb0>] __do_softirq+0x2df/0x3f5
> [18344.236719] softirqs last disabled at (6507055): [<ffffffff81041fb5>] irq_exit+0x40/0x94
> [18344.236722]
>                other info that might help us debug this:
> [18344.236723]  Possible unsafe locking scenario:
>
> [18344.236724]        CPU0
> [18344.236725]        ----
> [18344.236726]   lock(&anon_vma->rwsem);
> [18344.236728]   <Interrupt>
> [18344.236729]     lock(&anon_vma->rwsem);
> [18344.236731]
>                 *** DEADLOCK ***
>
> [18344.236733] 2 locks held by khugepaged/32:
> [18344.236733]  #0:  (&mm->mmap_sem){++++++}, at: [<ffffffff81134122>] khugepaged+0x5cf/0x1987
> [18344.236738]  #1:  (&anon_vma->rwsem){++++?.}, at: [<ffffffff81134403>] khugepaged+0x8b0/0x1987
> [18344.236741]
>                stack backtrace:
> [18344.236744] CPU: 3 PID: 32 Comm: khugepaged Not tainted 4.3.0-rc1-next-20150918-dbg-00014-ge5128d0-dirty #361
> [18344.236747]  0000000000000000 ffff880132827a00 ffffffff81230867 ffffffff8237ba90
> [18344.236750]  ffff880132827a38 ffffffff810ea9b9 000000000000000a ffff8801333b52e0
> [18344.236753]  ffff8801333b4c00 ffffffff8107b3ce 000000000000000a ffff880132827a78
> [18344.236755] Call Trace:
> [18344.236758]  [<ffffffff81230867>] dump_stack+0x4e/0x79
> [18344.236761]  [<ffffffff810ea9b9>] print_usage_bug.part.24+0x259/0x268
> [18344.236763]  [<ffffffff8107b3ce>] ? print_shortest_lock_dependencies+0x180/0x180
> [18344.236765]  [<ffffffff8107c7fc>] mark_lock+0x381/0x567
> [18344.236766]  [<ffffffff8107ca40>] mark_held_locks+0x5e/0x74
> [18344.236768]  [<ffffffff8107ee9f>] lockdep_trace_alloc+0xb0/0xb3
> [18344.236771]  [<ffffffff810f30cc>] __alloc_pages_nodemask+0x99/0x856
> [18344.236772]  [<ffffffff810ebaf9>] ? find_get_entry+0x14b/0x17a
> [18344.236774]  [<ffffffff810ebb16>] ? find_get_entry+0x168/0x17a
> [18344.236777]  [<ffffffff811226d9>] __read_swap_cache_async+0x7b/0x1aa
> [18344.236778]  [<ffffffff8112281d>] read_swap_cache_async+0x15/0x2d
> [18344.236780]  [<ffffffff8112294f>] swapin_readahead+0x11a/0x16a
> [18344.236783]  [<ffffffff81112791>] do_swap_page+0xa7/0x36b
> [18344.236784]  [<ffffffff81112791>] ? do_swap_page+0xa7/0x36b
> [18344.236787]  [<ffffffff8113444c>] khugepaged+0x8f9/0x1987
> [18344.236790]  [<ffffffff810772f3>] ? wait_woken+0x88/0x88
> [18344.236792]  [<ffffffff81133b53>] ? maybe_pmd_mkwrite+0x1a/0x1a
> [18344.236794]  [<ffffffff81059588>] kthread+0x107/0x10f
> [18344.236797]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
> [18344.236799]  [<ffffffff8150c7bf>] ret_from_fork+0x3f/0x70
> [18344.236801]  [<ffffffff81059481>] ? kthread_create_on_node+0x1ea/0x1ea
>
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> ---
>  mm/huge_memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dd58ecfcafe6..06c8f6d8fee2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2725,10 +2725,10 @@ static void collapse_huge_page(struct mm_struct *mm,
>                 goto out;
>         }
>
> -       anon_vma_lock_write(vma->anon_vma);
> -
>         __collapse_huge_page_swapin(mm, vma, address, pmd);
>
> +       anon_vma_lock_write(vma->anon_vma);
> +
>         pte = pte_offset_map(pmd, address);
>         pte_ptl = pte_lockptr(mm, pmd);
>
thanks,
Ebru

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [linux-next] khugepaged inconsistent lock state
  2015-09-23 13:22     ` Kirill A. Shutemov
  2015-09-23 20:01       ` Ebru Akagündüz
@ 2015-09-24  4:22       ` Sergey Senozhatsky
  1 sibling, 0 replies; 6+ messages in thread
From: Sergey Senozhatsky @ 2015-09-24  4:22 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Hugh Dickins, Ebru Akagunduz, Sergey Senozhatsky, Michal Hocko,
	Andrew Morton, linux-mm, linux-kernel, Sergey Senozhatsky

On (09/23/15 16:22), Kirill A. Shutemov wrote:
[..]
> khugepaged does swap in during collapse under anon_vma lock. It causes
> complain from lockdep. The trace below shows following scenario:
> 
>  - khugepaged tries to swap in a page under mmap_sem and anon_vma lock;
>  - do_swap_page() calls swapin_readahead() with GFP_HIGHUSER_MOVABLE;
>  - __read_swap_cache_async() tries to allocate the page for swap in;
>  - lockdep_trace_alloc() in __alloc_pages_nodemask() notices that with
>    given gfp_mask we could end up in direct relaim.
>  - Lockdep already knows that reclaim sometimes (e.g. in case of
>    split_huge_page()) wants to take anon_vma lock on its own.
> 
> Therefore deadlock is possible.
[..]

Gave it some testing on my box. Works fine on my side.

I guess you can add (if needed)
Tested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>

	-ss

> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
> ---
>  mm/huge_memory.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index dd58ecfcafe6..06c8f6d8fee2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2725,10 +2725,10 @@ static void collapse_huge_page(struct mm_struct *mm,
>  		goto out;
>  	}
>  
> -	anon_vma_lock_write(vma->anon_vma);
> -
>  	__collapse_huge_page_swapin(mm, vma, address, pmd);
>  
> +	anon_vma_lock_write(vma->anon_vma);
> +
>  	pte = pte_offset_map(pmd, address);
>  	pte_ptl = pte_lockptr(mm, pmd);
>  
> -- 
>  Kirill A. Shutemov
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-09-24  4:21 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-21  4:46 [linux-next] khugepaged inconsistent lock state Sergey Senozhatsky
2015-09-21 15:01 ` Kirill A. Shutemov
2015-09-21 23:57   ` Hugh Dickins
2015-09-23 13:22     ` Kirill A. Shutemov
2015-09-23 20:01       ` Ebru Akagündüz
2015-09-24  4:22       ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).