[linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
       [not found] <20160601131122.7dbb0a65@canb.auug.org.au>
@ 2016-06-02  1:48 ` Sergey Senozhatsky
  2016-06-02  9:21   ` Michal Hocko
  2016-06-02 13:24   ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-02  1:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Vlastimil Babka, Michal Hocko, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel

On (06/01/16 13:11), Stephen Rothwell wrote:
> Hi all,
> 
> Changes since 20160531:
> 
> My fixes tree contains:
> 
>   of: silence warnings due to max() usage
> 
> The arm tree gained a conflict against Linus' tree.
> 
> Non-merge commits (relative to Linus' tree): 1100
>  936 files changed, 38159 insertions(+), 17475 deletions(-)

Hello,

the cc1 process ended up in DN state during kernel -j4 compilation.

...
[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
[ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
[ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
[ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
[ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
[ 2856.323068] Call Trace:
[ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
[ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
[ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
[ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
[ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
[ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
[ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
[ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
[ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
[ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
[ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
[ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
[ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
[ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
[ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
[ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
[ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
[ 2877.322870] Call Trace:
[ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
[ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
[ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
[ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
[ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
[ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
[ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
[ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
[ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
[ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
[ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
[ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
[ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
...

ps aux | grep cc1
ss        4582  0.0  0.0      0     0 pts/23   DN+  10:10   0:01 [cc1]

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
@ 2016-06-02  9:21   ` Michal Hocko
  2016-06-02 12:08     ` Sergey Senozhatsky
  2016-06-03  7:15     ` Sergey Senozhatsky
  2016-06-02 13:24   ` Vlastimil Babka
  1 sibling, 2 replies; 27+ messages in thread
From: Michal Hocko @ 2016-06-02  9:21 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

[CCing Andrea]

On Thu 02-06-16 10:48:35, Sergey Senozhatsky wrote:
> On (06/01/16 13:11), Stephen Rothwell wrote:
> > Hi all,
> > 
> > Changes since 20160531:
> > 
> > My fixes tree contains:
> > 
> >   of: silence warnings due to max() usage
> > 
> > The arm tree gained a conflict against Linus' tree.
> > 
> > Non-merge commits (relative to Linus' tree): 1100
> >  936 files changed, 38159 insertions(+), 17475 deletions(-)
> 
> Hello,
> 
> the cc1 process ended up in DN state during kernel -j4 compilation.
> 
> ...
> [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2856.323068] Call Trace:
> [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

down_write in the exit path is certainly not nice. It is hard to tell
who is blocking the mmap_sem but it is clear that __khugepaged_exit
waits for the khugepaged to release its mmap_sem. Do you hapen to have a
trace of khugepaged? Note that the lock holder might be another writer
which just hasn't pinned mm_users so khugepaged might be blocked on read
lock as well. Or khugepaged might be just stuck somewhere...

I am trying to wrap my head around the synchronization here and I
suspect it is unnecessarily complex. We should be able to go without
down_write in the exit path... The following patch would only workaround
the issue you are seeing but I guess it is worth considering this
approach.

Andrea, does the following look reasonable to you? I haven't tested it
and I might be missing some subtle details. The code is really not
trivial...
---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  9:21   ` Michal Hocko
@ 2016-06-02 12:08     ` Sergey Senozhatsky
  2016-06-02 12:21       ` Michal Hocko
  2016-06-03  7:15     ` Sergey Senozhatsky
  1 sibling, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-02 12:08 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

Hello Michal,

On (06/02/16 11:21), Michal Hocko wrote:
[..]
> > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> > [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> > [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> > [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> > [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> > [ 2856.323068] Call Trace:
> > [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> > [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> > [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> > [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> > [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> > [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> > [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> > [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> > [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> > [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> > [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> > [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> > [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> 
> down_write in the exit path is certainly not nice. It is hard to tell
> who is blocking the mmap_sem but it is clear that __khugepaged_exit
> waits for the khugepaged to release its mmap_sem. Do you hapen to have a
> trace of khugepaged? Note that the lock holder might be another writer
> which just hasn't pinned mm_users so khugepaged might be blocked on read
> lock as well. Or khugepaged might be just stuck somewhere...

sorry, no. this is all I have. the kernel was compiled with almost no
debugging functionality enabled (no lockdep, no lock debug, nothing)
for zram performance testing purposes.

I'll try to reproduce the problem; and give your patch some testing.
thanks.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 12:08     ` Sergey Senozhatsky
@ 2016-06-02 12:21       ` Michal Hocko
  2016-06-03 13:51         ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-02 12:21 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On Thu 02-06-16 21:08:57, Sergey Senozhatsky wrote:
> Hello Michal,
> 
> On (06/02/16 11:21), Michal Hocko wrote:
> [..]
> > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> > > [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> > > [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> > > [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> > > [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> > > [ 2856.323068] Call Trace:
> > > [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> > > [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> > > [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> > > [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> > > [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> > > [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> > > [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> > > [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> > > [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> > > [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> > > [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> > > [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> > > [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> > 
> > down_write in the exit path is certainly not nice. It is hard to tell
> > who is blocking the mmap_sem but it is clear that __khugepaged_exit
> > waits for the khugepaged to release its mmap_sem. Do you hapen to have a
> > trace of khugepaged? Note that the lock holder might be another writer
> > which just hasn't pinned mm_users so khugepaged might be blocked on read
> > lock as well. Or khugepaged might be just stuck somewhere...
> 
> sorry, no. this is all I have. the kernel was compiled with almost no
> debugging functionality enabled (no lockdep, no lock debug, nothing)
> for zram performance testing purposes.
> 
> I'll try to reproduce the problem; and give your patch some testing.
> thanks.

The patch will drop the down_write from the exit path which is, I
believe the right thing to do, so it would paper over an existing
problem when khugepaged could get stuck with mmap_sem held for read (if
that is really a problem). So reproducing without the patch still makes
some sense.

Testing with the patch makes some sense as well, but I would like to
hear from Andrea whether the approach is good because I am wondering why
he hasn't done that before - it feels so much simpler than the current
code.

Anyway, thanks a lot for testing!

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 12:21       ` Michal Hocko
@ 2016-06-03 13:51         ` Andrea Arcangeli
  2016-06-03 14:46           ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Andrea Arcangeli @ 2016-06-03 13:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote:
> Testing with the patch makes some sense as well, but I would like to
> hear from Andrea whether the approach is good because I am wondering why
> he hasn't done that before - it feels so much simpler than the current
> code.

The down_write in the exit path comes from __ksm_exit. If you don't
like it there I'd suggest to also remove it from __ksm_exit.

This is a proposed cleanup correct?

The first thing that I can notice is that khugepaged_test_exit() then
can only be called and provide the expected retval, after
atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be
used instead.

However the code still uses khugepaged_test_exit in __khugepage_enter
that won't increase the mm_users, so then the patch relaxes that check
too much, albeit only for a debug check not strictly a bug.

The cons of this change purely that it'll decrease the responsiveness
in releasing the RAM of a killed task a bit.

To me the fewer time we hold the mm_users the better and I don't see
an obvious runtime improvement coming from this change. It's a bit
simpler yes, but the down_write in the exit path is well understood,
ksm does the same thing and it's in a slow path (it only happens if
the mm that exited is the current one under scan by either ksmd or
khugepaged, so normally the down_write is not executed in the exit
path and the "mm" is collected right away both as a mm_users and
mm_count).

In short I think it's a tradeoff: pros) removes down_write in a slow
path of the the mm exit which may simplify the code a bit, cons) it
could increase the latency in freeing memory as result of a task
exiting or being killed during the khugepaged scan, for example while
the THP is being allocated. While compaction runs to allocate the THP
in collapse_huge_page, if the task is killed currently the memory is
released right away, without waiting for the allocation to succeed or
fail.

I don't see a big enough problem with the down_write in a slow path of
khugepaged_exit to justify the increased latency in releasing memory.

I was very happy by Oleg's patch reducing the mm_users holding of
userfaultfd too. That was controlled by userland so it would only be
an issue for non-cooperative usage which isn't upstream yet, and it
was also much wider than this one would become with the patch applied,
but I liked the direction.

If prefer instead to remove the down_write, you probably could move
the test_exit before the down_read/write to bail out before taking the
lock: you don't need the mmap_sem to do test_exit anymore. The only
reason the text_exit would remain in fact is just to reduce the
latency of the memory freeing, it then becomes a voluntary preempt
cond_resched() to release the memory to make a parallel ;), but unable
to let the kernel free the memory while the THP allocation runs.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:51         ` Andrea Arcangeli
@ 2016-06-03 14:46           ` Michal Hocko
  2016-06-03 15:10             ` Andrea Arcangeli
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03 14:46 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Fri 03-06-16 15:51:54, Andrea Arcangeli wrote:
> On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote:
> > Testing with the patch makes some sense as well, but I would like to
> > hear from Andrea whether the approach is good because I am wondering why
> > he hasn't done that before - it feels so much simpler than the current
> > code.
> 
> The down_write in the exit path comes from __ksm_exit. If you don't
> like it there I'd suggest to also remove it from __ksm_exit.

I see

> This is a proposed cleanup correct?

yes this is a cleanup but also a robustness thing, see below.
 
> The first thing that I can notice is that khugepaged_test_exit() then
> can only be called and provide the expected retval, after
> atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be
> used instead.

I didn't get used to mmget_not_zero yet, but true a helper would be
better.

[...]

> To me the fewer time we hold the mm_users the better and I don't see
> an obvious runtime improvement coming from this change. It's a bit
> simpler yes, but the down_write in the exit path is well understood,
> ksm does the same thing and it's in a slow path (it only happens if
> the mm that exited is the current one under scan by either ksmd or
> khugepaged, so normally the down_write is not executed in the exit
> path and the "mm" is collected right away both as a mm_users and
> mm_count).

OK, I see your point. I wasn't aware that the mmap_sem is dropped
before the allocation request. Then the original code indeed might
get into exit_mmap earlier wrt. to the patch.

The reason I dislike taking write lock in the __mmput is basically
for the same reason you have pointed out. exit_mmap might be delayed
for an unbounded amount of time. khugepaged resp. ksmd might be well
behaved and release their read lock for costly operations or when they
detect the mm is dead but it is hard to guarantee that all potential
kernel users/drivers are behaving the same way. It is not really trivial
to check whether we have such users (there are 100+ users outside of mm/
as per my quick git grep).

The exit path should be as simple as possible with the amount of
external dependencies reduced to the bare minimum.

> In short I think it's a tradeoff: pros) removes down_write in a slow
> path of the the mm exit which may simplify the code a bit, cons) it
> could increase the latency in freeing memory as result of a task
> exiting or being killed during the khugepaged scan, for example while
> the THP is being allocated. While compaction runs to allocate the THP
> in collapse_huge_page, if the task is killed currently the memory is
> released right away, without waiting for the allocation to succeed or
> fail.

Are those latencies a real problem. The allocation itself shouldn't
really take a long time.

> I don't see a big enough problem with the down_write in a slow path of
> khugepaged_exit to justify the increased latency in releasing memory.

What do you think about the external dependencies mentioned above. Do
you think this is a sufficient argument wrt. occasional higher
latencies?
[...]

> If prefer instead to remove the down_write, you probably could move
> the test_exit before the down_read/write to bail out before taking the
> lock: you don't need the mmap_sem to do test_exit anymore. The only
> reason the text_exit would remain in fact is just to reduce the
> latency of the memory freeing, it then becomes a voluntary preempt
> cond_resched() to release the memory to make a parallel ;), but unable
> to let the kernel free the memory while the THP allocation runs.

OK, I will think about that as well.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 14:46           ` Michal Hocko
@ 2016-06-03 15:10             ` Andrea Arcangeli
  2016-06-07  7:34               ` Michal Hocko
  2016-06-08  8:19               ` Vlastimil Babka
  0 siblings, 2 replies; 27+ messages in thread
From: Andrea Arcangeli @ 2016-06-03 15:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Hugh Dickins

Hello Michal,

CC'ed Hugh,

On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
> What do you think about the external dependencies mentioned above. Do
> you think this is a sufficient argument wrt. occasional higher
> latencies?

It's a tradeoff and both latencies would be short and uncommon so it's
hard to tell.

There's also mmput_async for paths that may care about mmput
latencies. Exit itself cannot use it, it's mostly for people taking
the mm_users pin that may not want to wait for mmput to run. It also
shouldn't happen that often, it's a slow path.

The whole model inherited from KSM is to deliberately depend only on
the mmap_sem + test_exit + mm_count, and never on mm_users, which to
me in principle doesn't sound bad. I consider KSM version a
"finegrined" implementation but I never thought it would be a problem
to wait a bit in exit() in case the slow path hits. I thought it was
more of a problem if exit() runs, the parent then start a new task but
the memory wasn't freed yet.

So I would suggest Hugh to share his view on the down_write/up_write
that may temporarily block mmput (until the next test_exit bailout
point) vs higher latency in reaching exit_mmap for a real exit(2) that
would happen with the proposed change.

Thanks!
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 15:10             ` Andrea Arcangeli
@ 2016-06-07  7:34               ` Michal Hocko
  2016-06-08  8:19               ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2016-06-07  7:34 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Hugh Dickins

On Fri 03-06-16 17:10:01, Andrea Arcangeli wrote:
> Hello Michal,
> 
> CC'ed Hugh,
> 
> On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
> > What do you think about the external dependencies mentioned above. Do
> > you think this is a sufficient argument wrt. occasional higher
> > latencies?
> 
> It's a tradeoff and both latencies would be short and uncommon so it's
> hard to tell.
> 
> There's also mmput_async for paths that may care about mmput
> latencies. Exit itself cannot use it, it's mostly for people taking
> the mm_users pin that may not want to wait for mmput to run. It also
> shouldn't happen that often, it's a slow path.
> 
> The whole model inherited from KSM is to deliberately depend only on
> the mmap_sem + test_exit + mm_count, and never on mm_users, which to
> me in principle doesn't sound bad.

I do agree that this model is quite clever (albeit convoluted). It just
assumes that all other mmap_sem users are behaving the same. Now most
in-kernel users will do get_task_mm() and then lock mmap_sem, but I
haven't checked all of them and it is quite possible that some of those
would like to optimize in a similar way and only increment mm_count.
I might be too pessimistic about the out of mm code but I would feel
much better if the exit path didn't depend on them.

Anyway, if the current model sounds better I will definitely not insist
on my patch. It is more of an idea for simplification than a fix for
anything I have seen happening in the real life.
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 15:10             ` Andrea Arcangeli
  2016-06-07  7:34               ` Michal Hocko
@ 2016-06-08  8:19               ` Vlastimil Babka
  1 sibling, 0 replies; 27+ messages in thread
From: Vlastimil Babka @ 2016-06-08  8:19 UTC (permalink / raw)
  To: Andrea Arcangeli, Michal Hocko
  Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Hugh Dickins

On 06/03/2016 05:10 PM, Andrea Arcangeli wrote:
> Hello Michal,
>
> CC'ed Hugh,
>
> On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote:
>> What do you think about the external dependencies mentioned above. Do
>> you think this is a sufficient argument wrt. occasional higher
>> latencies?
>
> It's a tradeoff and both latencies would be short and uncommon so it's
> hard to tell.

Shouldn't it be possible to do a mmput() before the hugepage allocation, 
and then again mmget_not_zero()? That way it's no longer a tradeoff?

> There's also mmput_async for paths that may care about mmput
> latencies. Exit itself cannot use it, it's mostly for people taking
> the mm_users pin that may not want to wait for mmput to run. It also
> shouldn't happen that often, it's a slow path.
>
> The whole model inherited from KSM is to deliberately depend only on
> the mmap_sem + test_exit + mm_count, and never on mm_users, which to
> me in principle doesn't sound bad. I consider KSM version a
> "finegrined" implementation but I never thought it would be a problem
> to wait a bit in exit() in case the slow path hits. I thought it was
> more of a problem if exit() runs, the parent then start a new task but
> the memory wasn't freed yet.
>
> So I would suggest Hugh to share his view on the down_write/up_write
> that may temporarily block mmput (until the next test_exit bailout
> point) vs higher latency in reaching exit_mmap for a real exit(2) that
> would happen with the proposed change.
>
> Thanks!
> Andrea
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  9:21   ` Michal Hocko
  2016-06-02 12:08     ` Sergey Senozhatsky
@ 2016-06-03  7:15     ` Sergey Senozhatsky
  2016-06-03  7:25       ` Michal Hocko
  1 sibling, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  7:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

Hello,

On (06/02/16 11:21), Michal Hocko wrote:
[..]
> @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
>  
>  		collect_mm_slot(mm_slot);
>  	}
> +	mmput(mm);
>  
>  	return progress;
>  }

this possibly sleeping mmput() is called from
under the spin_lock(&khugepaged_mm_lock).

there is also a trivial build fixup needed
(move collect_mm_slot() before __khugepaged_exit()).


it's quite hard to trigger the bug (somehow), so I can't
follow up with more information as of now.

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  7:15     ` Sergey Senozhatsky
@ 2016-06-03  7:25       ` Michal Hocko
  2016-06-03  8:43         ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03  7:25 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 16:15:51, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/02/16 11:21), Michal Hocko wrote:
> [..]
> > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
> >  
> >  		collect_mm_slot(mm_slot);
> >  	}
> > +	mmput(mm);
> >  
> >  	return progress;
> >  }
> 
> this possibly sleeping mmput() is called from
> under the spin_lock(&khugepaged_mm_lock).

You are right. khugepaged_scan_mm_slot returns with the lock held.
mmput_async would deal with it.
 
> there is also a trivial build fixup needed
> (move collect_mm_slot() before __khugepaged_exit()).

will fix that. Thanks!

> it's quite hard to trigger the bug (somehow), so I can't
> follow up with more information as of now.

Thanks anyway!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  7:25       ` Michal Hocko
@ 2016-06-03  8:43         ` Sergey Senozhatsky
  2016-06-03  9:55           ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  8:43 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On (06/03/16 09:25), Michal Hocko wrote:
> > it's quite hard to trigger the bug (somehow), so I can't
> > follow up with more information as of now.

either I did something very silly fixing up the patch, or the
patch may be causing general protection faults on my system.

RIP collect_mm_slot() + 0x42/0x84
	khugepaged
	prepare_to_wait_event
	maybe_pmd_mkwrite
	kthread
	_raw_sin_unlock_irq
	ret_from_fork
	kthread_create_on_node

collect_mm_slot() + 0x42/0x84 is

0000000000000328 <collect_mm_slot>:
     328:       55                      push   %rbp
     329:       48 89 e5                mov    %rsp,%rbp
     32c:       53                      push   %rbx
     32d:       48 8b 5f 20             mov    0x20(%rdi),%rbx
     331:       8b 43 48                mov    0x48(%rbx),%eax
     334:       ff c8                   dec    %eax
     336:       7f 71                   jg     3a9 <collect_mm_slot+0x81>
     338:       48 8b 57 08             mov    0x8(%rdi),%rdx
     33c:       48 85 d2                test   %rdx,%rdx
     33f:       74 1e                   je     35f <collect_mm_slot+0x37>
     341:       48 8b 07                mov    (%rdi),%rax
     344:       48 85 c0                test   %rax,%rax
     347:       48 89 02                mov    %rax,(%rdx)
     34a:       74 04                   je     350 <collect_mm_slot+0x28>
     34c:       48 89 50 08             mov    %rdx,0x8(%rax)
     350:       48 c7 07 00 00 00 00    movq   $0x0,(%rdi)
     357:       48 c7 47 08 00 00 00    movq   $0x0,0x8(%rdi)
     35e:       00 
     35f:       48 8b 57 10             mov    0x10(%rdi),%rdx
     363:       48 8b 47 18             mov    0x18(%rdi),%rax
     367:       48 89 fe                mov    %rdi,%rsi
     36a:       48 89 42 08             mov    %rax,0x8(%rdx)
		^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     36e:       48 89 10                mov    %rdx,(%rax)
     371:       48 b8 00 01 00 00 00    movabs $0xdead000000000100,%rax
     378:       00 ad de 
     37b:       48 89 47 10             mov    %rax,0x10(%rdi)
     37f:       48 b8 00 02 00 00 00    movabs $0xdead000000000200,%rax
     386:       00 ad de 
     389:       48 89 47 18             mov    %rax,0x18(%rdi)
     38d:       48 8b 3d 00 00 00 00    mov    0x0(%rip),%rdi        # 394 <collect_mm_slot+0x6c>
     394:       e8 00 00 00 00          callq  399 <collect_mm_slot+0x71>
     399:       f0 ff 4b 4c             lock decl 0x4c(%rbx)
     39d:       74 02                   je     3a1 <collect_mm_slot+0x79>
     39f:       eb 08                   jmp    3a9 <collect_mm_slot+0x81>
     3a1:       48 89 df                mov    %rbx,%rdi
     3a4:       e8 00 00 00 00          callq  3a9 <collect_mm_slot+0x81>
     3a9:       5b                      pop    %rbx
     3aa:       5d                      pop    %rbp
     3ab:       c3                      retq

which is list_del(&mm_slot->mm_node), I believe.


I attached the patch (just in case).

---
 mm/huge_memory.c | 87 +++++++++++++++++++++++++-------------------------------
 1 file changed, 39 insertions(+), 48 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 292cedd..1c82fa4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1938,7 +1938,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm,
 
 static inline int khugepaged_test_exit(struct mm_struct *mm)
 {
-	return atomic_read(&mm->mm_users) == 0;
+	/* the only pin is from khugepaged_scan_mm_slot */
+	return atomic_read(&mm->mm_users) <= 1;
 }
 
 int __khugepaged_enter(struct mm_struct *mm)
@@ -1950,8 +1951,6 @@ int __khugepaged_enter(struct mm_struct *mm)
 	if (!mm_slot)
 		return -ENOMEM;
 
-	/* __khugepaged_exit() must not run from under us */
-	VM_BUG_ON_MM(khugepaged_test_exit(mm), mm);
 	if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) {
 		free_mm_slot(mm_slot);
 		return 0;
@@ -1994,36 +1993,40 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma,
 	return 0;
 }
 
-void __khugepaged_exit(struct mm_struct *mm)
+static void collect_mm_slot(struct mm_slot *mm_slot)
 {
-	struct mm_slot *mm_slot;
-	int free = 0;
+	struct mm_struct *mm = mm_slot->mm;
 
-	spin_lock(&khugepaged_mm_lock);
-	mm_slot = get_mm_slot(mm);
-	if (mm_slot && khugepaged_scan.mm_slot != mm_slot) {
+	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
+
+	if (khugepaged_test_exit(mm)) {
+		/* free mm_slot */
 		hash_del(&mm_slot->hash);
 		list_del(&mm_slot->mm_node);
-		free = 1;
-	}
-	spin_unlock(&khugepaged_mm_lock);
 
-	if (free) {
-		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	} else if (mm_slot) {
 		/*
-		 * This is required to serialize against
-		 * khugepaged_test_exit() (which is guaranteed to run
-		 * under mmap sem read mode). Stop here (after we
-		 * return all pagetables will be destroyed) until
-		 * khugepaged has finished working on the pagetables
-		 * under the mmap_sem.
+		 * Not strictly needed because the mm exited already.
+		 *
+		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 		 */
-		down_write(&mm->mmap_sem);
-		up_write(&mm->mmap_sem);
+
+		/* khugepaged_mm_lock actually not necessary for the below */
+		free_mm_slot(mm_slot);
+		mmdrop(mm);
+	}
+}
+
+void __khugepaged_exit(struct mm_struct *mm)
+{
+	struct mm_slot *mm_slot;
+
+	spin_lock(&khugepaged_mm_lock);
+	mm_slot = get_mm_slot(mm);
+	if (mm_slot) {
+		collect_mm_slot(mm_slot);
+		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	}
+	spin_unlock(&khugepaged_mm_lock);
 }
 
 static void release_pte_page(struct page *page)
@@ -2738,29 +2741,6 @@ out:
 	return ret;
 }
 
-static void collect_mm_slot(struct mm_slot *mm_slot)
-{
-	struct mm_struct *mm = mm_slot->mm;
-
-	VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock));
-
-	if (khugepaged_test_exit(mm)) {
-		/* free mm_slot */
-		hash_del(&mm_slot->hash);
-		list_del(&mm_slot->mm_node);
-
-		/*
-		 * Not strictly needed because the mm exited already.
-		 *
-		 * clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
-		 */
-
-		/* khugepaged_mm_lock actually not necessary for the below */
-		free_mm_slot(mm_slot);
-		mmdrop(mm);
-	}
-}
-
 static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 					    struct page **hpage)
 	__releases(&khugepaged_mm_lock)
@@ -2782,6 +2762,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages,
 		khugepaged_scan.address = 0;
 		khugepaged_scan.mm_slot = mm_slot;
 	}
+
+	/*
+	 * Do not even try to do anything if the current mm is already
+	 * dead. khugepaged_mm_lock will make sure only this or
+	 * __khugepaged_exit does the unhasing.
+	 */
+	if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) {
+		collect_mm_slot(mm_slot);
+		return progress;
+	}
 	spin_unlock(&khugepaged_mm_lock);
 
 	mm = mm_slot->mm;
@@ -2865,6 +2855,7 @@ breakouterloop_mmap_sem:
 
 		collect_mm_slot(mm_slot);
 	}
+	mmput_async(mm);
 
 	return progress;
 }
-- 
2.9.0.rc1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  8:43         ` Sergey Senozhatsky
@ 2016-06-03  9:55           ` Michal Hocko
  2016-06-03 10:05             ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03  9:55 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote:
> On (06/03/16 09:25), Michal Hocko wrote:
> > > it's quite hard to trigger the bug (somehow), so I can't
> > > follow up with more information as of now.
> 
> either I did something very silly fixing up the patch, or the
> patch may be causing general protection faults on my system.
> 
> RIP collect_mm_slot() + 0x42/0x84
> 	khugepaged

So is this really collect_mm_slot called directly from khugepaged or is
some inlining going on there?

> 	prepare_to_wait_event
> 	maybe_pmd_mkwrite
> 	kthread
> 	_raw_sin_unlock_irq
> 	ret_from_fork
> 	kthread_create_on_node
> 
> collect_mm_slot() + 0x42/0x84 is

I guess that the problem is that I have missed that __khugepaged_exit
doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
top fixes that?
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6574c62ca4a3..e6f4e6fd587a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2021,6 +2021,8 @@ void __khugepaged_exit(struct mm_struct *mm)
 	spin_lock(&khugepaged_mm_lock);
 	mm_slot = get_mm_slot(mm);
 	if (mm_slot) {
+		if (khugepaged_scan.mm_slot == mm_slot)
+			khugepaged_scan.mm_slot = NULL;
 		collect_mm_slot(mm_slot);
 		clear_bit(MMF_VM_HUGEPAGE, &mm->flags);
 	}
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  9:55           ` Michal Hocko
@ 2016-06-03 10:05             ` Michal Hocko
  2016-06-03 13:38               ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03 10:05 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel,
	Andrea Arcangeli

On Fri 03-06-16 11:55:49, Michal Hocko wrote:
> On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote:
> > On (06/03/16 09:25), Michal Hocko wrote:
> > > > it's quite hard to trigger the bug (somehow), so I can't
> > > > follow up with more information as of now.
> > 
> > either I did something very silly fixing up the patch, or the
> > patch may be causing general protection faults on my system.
> > 
> > RIP collect_mm_slot() + 0x42/0x84
> > 	khugepaged
> 
> So is this really collect_mm_slot called directly from khugepaged or is
> some inlining going on there?
> 
> > 	prepare_to_wait_event
> > 	maybe_pmd_mkwrite
> > 	kthread
> > 	_raw_sin_unlock_irq
> > 	ret_from_fork
> > 	kthread_create_on_node
> > 
> > collect_mm_slot() + 0x42/0x84 is
> 
> I guess that the problem is that I have missed that __khugepaged_exit
> doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
> top fixes that?

That wouldn't be sufficient after a closer look. We need to do the same
from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess
it would be better to stick it into collect_mm_slot.

Thanks for your testing!
---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6574c62ca4a3..0432581fb87c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2011,6 +2011,9 @@ static void collect_mm_slot(struct mm_slot *mm_slot)
 		/* khugepaged_mm_lock actually not necessary for the below */
 		free_mm_slot(mm_slot);
 		mmdrop(mm);
+
+		if (khugepaged_scan.mm_slot == mm_slot)
+			khugepaged_scan.mm_slot = NULL;
 	}
 }
 
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 10:05             ` Michal Hocko
@ 2016-06-03 13:38               ` Sergey Senozhatsky
  2016-06-03 13:45                 ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03 13:38 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On (06/03/16 12:05), Michal Hocko wrote:
> > > RIP collect_mm_slot() + 0x42/0x84
> > > 	khugepaged
> > 
> > So is this really collect_mm_slot called directly from khugepaged or is
> > some inlining going on there?

inlining I suppose.

> > > 	prepare_to_wait_event
> > > 	maybe_pmd_mkwrite
> > > 	kthread
> > > 	_raw_sin_unlock_irq
> > > 	ret_from_fork
> > > 	kthread_create_on_node
> > > 
> > > collect_mm_slot() + 0x42/0x84 is
> > 
> > I guess that the problem is that I have missed that __khugepaged_exit
> > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on
> > top fixes that?
> 
> That wouldn't be sufficient after a closer look. We need to do the same
> from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess
> it would be better to stick it into collect_mm_slot.

Michal, I'll try to test during the weekend (away from the affected box
now), but in the worst case it can as late as next Thursday (gonna travel
next week).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:38               ` Sergey Senozhatsky
@ 2016-06-03 13:45                 ` Michal Hocko
  2016-06-03 13:49                   ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03 13:45 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel, Andrea Arcangeli

On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote:
[...]
> Michal, I'll try to test during the weekend (away from the affected box
> now), but in the worst case it can as late as next Thursday (gonna travel
> next week).

No problem. I would really like to hear from Andrea before we give this
a serious try anyway.

Thanks!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:45                 ` Michal Hocko
@ 2016-06-03 13:49                   ` Michal Hocko
  2016-06-04  7:51                     ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Michal Hocko @ 2016-06-03 13:49 UTC (permalink / raw)
  To: Sergey Senozhatsky, Andrea Arcangeli
  Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka,
	Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next,
	linux-kernel

On Fri 03-06-16 15:45:09, Michal Hocko wrote:
> On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote:
> [...]
> > Michal, I'll try to test during the weekend (away from the affected box
> > now), but in the worst case it can as late as next Thursday (gonna travel
> > next week).
> 
> No problem. I would really like to hear from Andrea before we give this
> a serious try anyway.

And just for an easier review, here is what I have right now:
---

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03 13:49                   ` Michal Hocko
@ 2016-06-04  7:51                     ` Sergey Senozhatsky
  2016-06-06  8:39                       ` Michal Hocko
  0 siblings, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-04  7:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Sergey Senozhatsky, Andrea Arcangeli, Sergey Senozhatsky,
	Andrew Morton, Vlastimil Babka, Kirill A. Shutemov,
	Stephen Rothwell, linux-mm, linux-next, linux-kernel

Hello,

On (06/03/16 15:49), Michal Hocko wrote:
> __khugepaged_exit is called during the final __mmput and it employs a
> complex synchronization dances to make sure it doesn't race with the
> khugepaged which might be scanning this mm at the same time. This is
> all caused by the fact that khugepaged doesn't pin mm_users. Things
> would simplify considerably if we simply check the mm at
> khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
> is dead and we can unhash the mm_slot and move on to another one. This
> will also guarantee that __khugepaged_exit cannot race with khugepaged
> and so we can free up the slot if it is still hashed.
> 
> Signed-off-by: Michal Hocko <mhocko@suse.com>

with this patch and
http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch

I saw no problems during my tests (well, may be didn't test hard
enough).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-04  7:51                     ` Sergey Senozhatsky
@ 2016-06-06  8:39                       ` Michal Hocko
  0 siblings, 0 replies; 27+ messages in thread
From: Michal Hocko @ 2016-06-06  8:39 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Sergey Senozhatsky, Andrea Arcangeli, Andrew Morton,
	Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel

On Sat 04-06-16 16:51:14, Sergey Senozhatsky wrote:
> Hello,
> 
> On (06/03/16 15:49), Michal Hocko wrote:
> > __khugepaged_exit is called during the final __mmput and it employs a
> > complex synchronization dances to make sure it doesn't race with the
> > khugepaged which might be scanning this mm at the same time. This is
> > all caused by the fact that khugepaged doesn't pin mm_users. Things
> > would simplify considerably if we simply check the mm at
> > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it
> > is dead and we can unhash the mm_slot and move on to another one. This
> > will also guarantee that __khugepaged_exit cannot race with khugepaged
> > and so we can free up the slot if it is still hashed.
> > 
> > Signed-off-by: Michal Hocko <mhocko@suse.com>
> 
> with this patch and
> http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch
> 
> I saw no problems during my tests (well, may be didn't test hard
> enough).

Thanks for the testing!
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
  2016-06-02  9:21   ` Michal Hocko
@ 2016-06-02 13:24   ` Vlastimil Babka
  2016-06-02 18:58     ` Ebru Akagunduz
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
  1 sibling, 2 replies; 27+ messages in thread
From: Vlastimil Babka @ 2016-06-02 13:24 UTC (permalink / raw)
  To: Sergey Senozhatsky, Andrew Morton, Ebru Akagunduz
  Cc: Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm,
	linux-next, linux-kernel, Rik van Riel, Andrea Arcangeli

[+CC's]

On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote:
> On (06/01/16 13:11), Stephen Rothwell wrote:
>> Hi all,
>>
>> Changes since 20160531:
>>
>> My fixes tree contains:
>>
>>   of: silence warnings due to max() usage
>>
>> The arm tree gained a conflict against Linus' tree.
>>
>> Non-merge commits (relative to Linus' tree): 1100
>>  936 files changed, 38159 insertions(+), 17475 deletions(-)
>
> Hello,
>
> the cc1 process ended up in DN state during kernel -j4 compilation.
>
> ...
> [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2856.323068] Call Trace:
> [ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
>
> [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
> [ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> [ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> [ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> [ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> [ 2877.322870] Call Trace:
> [ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
> [ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> [ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> [ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> [ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> [ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
> [ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> [ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
> [ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> [ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> [ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> [ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> [ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f

I think it's this patch:

http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch

Some parts of the code in collapse_huge_page() that were under 
down_write(mmap_sem) are under down_read() after the patch. But there's 
"goto out" which continues via "goto out_up_write" which does 
up_write(mmap_sem) so there's an imbalance. One path seems to go via 
both up_read() and up_write(). I can imagine this can cause a stuck 
down_write() among other things?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 13:24   ` Vlastimil Babka
@ 2016-06-02 18:58     ` Ebru Akagunduz
  2016-06-03  1:00       ` Sergey Senozhatsky
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
  1 sibling, 1 reply; 27+ messages in thread
From: Ebru Akagunduz @ 2016-06-02 18:58 UTC (permalink / raw)
  To: vbabka, sergey.senozhatsky.work, akpm
  Cc: mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel,
	riel, aarcange

On Thu, Jun 02, 2016 at 03:24:05PM +0200, Vlastimil Babka wrote:
> [+CC's]
> 
> On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote:
> >On (06/01/16 13:11), Stephen Rothwell wrote:
> >>Hi all,
> >>
> >>Changes since 20160531:
> >>
> >>My fixes tree contains:
> >>
> >>  of: silence warnings due to max() usage
> >>
> >>The arm tree gained a conflict against Linus' tree.
> >>
> >>Non-merge commits (relative to Linus' tree): 1100
> >> 936 files changed, 38159 insertions(+), 17475 deletions(-)
> >
> >Hello,
> >
> >the cc1 process ended up in DN state during kernel -j4 compilation.
> >
> >...
> >[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds.
> >[ 2856.323055]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> >[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >[ 2856.323059] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> >[ 2856.323062]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> >[ 2856.323065]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> >[ 2856.323067]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> >[ 2856.323068] Call Trace:
> >[ 2856.323074]  [<ffffffff81441e33>] schedule+0x83/0x98
> >[ 2856.323077]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> >[ 2856.323080]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> >[ 2856.323083]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> >[ 2856.323084]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> >[ 2856.323086]  [<ffffffff81443630>] down_write+0x1f/0x2e
> >[ 2856.323089]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> >[ 2856.323091]  [<ffffffff8103702a>] mmput+0x29/0xc5
> >[ 2856.323093]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> >[ 2856.323095]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> >[ 2856.323097]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> >[ 2856.323099]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> >[ 2856.323101]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> >
> >[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds.
> >[ 2877.322858]       Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453
> >[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >[ 2877.322861] cc1             D ffff880057e9fd78     0  4582   4575 0x00000000
> >[ 2877.322865]  ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000
> >[ 2877.322867]  ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80
> >[ 2877.322867]  ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00
> >[ 2877.322870] Call Trace:
> >[ 2877.322875]  [<ffffffff81441e33>] schedule+0x83/0x98
> >[ 2877.322878]  [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3
> >[ 2877.322881]  [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d
> >[ 2877.322884]  [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30
> >[ 2877.322885]  [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30
> >[ 2877.322887]  [<ffffffff81443630>] down_write+0x1f/0x2e
> >[ 2877.322890]  [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a
> >[ 2877.322892]  [<ffffffff8103702a>] mmput+0x29/0xc5
> >[ 2877.322894]  [<ffffffff8103bbd8>] do_exit+0x34c/0x894
> >[ 2877.322896]  [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399
> >[ 2877.322898]  [<ffffffff8103c188>] do_group_exit+0x3c/0x98
> >[ 2877.322900]  [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf
> >[ 2877.322902]  [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f
> 
> I think it's this patch:
> 
> http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch
> 
> Some parts of the code in collapse_huge_page() that were under
> down_write(mmap_sem) are under down_read() after the patch. But
> there's "goto out" which continues via "goto out_up_write" which
> does up_write(mmap_sem) so there's an imbalance. One path seems to
> go via both up_read() and up_write(). I can imagine this can cause a
> stuck down_write() among other things?
Recently, I realized the same imbalance, it is an obvious
inconsistency. I don't know, this issue can be related with
mine. I'll send a fix patch.

Kind regards.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-02 18:58     ` Ebru Akagunduz
@ 2016-06-03  1:00       ` Sergey Senozhatsky
  2016-06-03  1:29         ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  1:00 UTC (permalink / raw)
  To: Ebru Akagunduz
  Cc: Vlastimil Babka, sergey.senozhatsky.work, Andrew Morton,
	Michal Hocko, Kirill A. Shutemov, Stephen Rothwell,
	Andrea Arcangeli, Rik van Riel, linux-mm, linux-next,
	linux-kernel

On (06/02/16 21:58), Ebru Akagunduz wrote:
[..]
> > I think it's this patch:
> > 
> > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch
> > 
> > Some parts of the code in collapse_huge_page() that were under
> > down_write(mmap_sem) are under down_read() after the patch. But
> > there's "goto out" which continues via "goto out_up_write" which
> > does up_write(mmap_sem) so there's an imbalance. One path seems to
> > go via both up_read() and up_write(). I can imagine this can cause a
> > stuck down_write() among other things?
> Recently, I realized the same imbalance, it is an obvious
> inconsistency. I don't know, this issue can be related with
> mine. I'll send a fix patch.

a good find by Vlastimil.

Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called
from collapse_huge_page() under the down_read(&mm->mmap_sem), is there
any reason to do the nested down_read(&mm->mmap_sem)?

collapse_huge_page()
...
	down_read(&mm->mmap_sem);
	result = hugepage_vma_revalidate(mm, vma, address);
	if (result)
		goto out;

	pmd = mm_find_pmd(mm, address);
	if (!pmd) {
		result = SCAN_PMD_NULL;
		goto out;
	}

	if (allocstall == curr_allocstall && swap != 0) {
		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
			{
			:	if (ret & VM_FAULT_RETRY) {
			:		down_read(&mm->mmap_sem);
			:		^^^^^^^^^
			:		if (hugepage_vma_revalidate(mm, vma, address))
			:			return false;
			:	}
			}

			up_read(&mm->mmap_sem);
			goto out;
		}
	}

	up_read(&mm->mmap_sem);



so if __collapse_huge_page_swapin() retruns true we have:
	- down_read() twice, up_read() once?

the locking rules here are a bit confusing. (I didn't have my morning coffee yet).

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  1:00       ` Sergey Senozhatsky
@ 2016-06-03  1:29         ` Sergey Senozhatsky
  2016-06-03  4:14           ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  1:29 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli,
	Rik van Riel, linux-mm, linux-next, linux-kernel

On (06/03/16 10:00), Sergey Senozhatsky wrote:
> a good find by Vlastimil.
> 
> Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called
> from collapse_huge_page() under the down_read(&mm->mmap_sem), is there
> any reason to do the nested down_read(&mm->mmap_sem)?
> 
> collapse_huge_page()
> ...
> 	down_read(&mm->mmap_sem);
> 	result = hugepage_vma_revalidate(mm, vma, address);
> 	if (result)
> 		goto out;
> 
> 	pmd = mm_find_pmd(mm, address);
> 	if (!pmd) {
> 		result = SCAN_PMD_NULL;
> 		goto out;
> 	}
> 
> 	if (allocstall == curr_allocstall && swap != 0) {
> 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
> 			{
> 			:	if (ret & VM_FAULT_RETRY) {
> 			:		down_read(&mm->mmap_sem);
> 			:		^^^^^^^^^

oh... it's in a loop

		for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
						pte++, _address += PAGE_SIZE) {
			ret = do_swap_page()
			if (ret & VM_FAULT_RETRY) {
				down_read(&mm->mmap_sem);
				^^^^^^^^^
				...
			}
		}

so there can be multiple sem->count++ in __collapse_huge_page_swapin(),
and you don't know how many sem->count-- you need to do later? is this
correct or am I hallucinating?

	-ss

> 			:		if (hugepage_vma_revalidate(mm, vma, address))
> 			:			return false;
> 			:	}
> 			}
> 
> 			up_read(&mm->mmap_sem);
> 			goto out;
> 		}
> 	}
> 
> 	up_read(&mm->mmap_sem);
> 
> 
> 
> so if __collapse_huge_page_swapin() retruns true we have:
> 	- down_read() twice, up_read() once?
> 
> the locking rules here are a bit confusing. (I didn't have my morning coffee yet).
> 
> 	-ss
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup
  2016-06-03  1:29         ` Sergey Senozhatsky
@ 2016-06-03  4:14           ` Sergey Senozhatsky
  0 siblings, 0 replies; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-03  4:14 UTC (permalink / raw)
  To: Sergey Senozhatsky
  Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko,
	Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli,
	Rik van Riel, linux-mm, linux-next, linux-kernel

On (06/03/16 10:29), Sergey Senozhatsky wrote:
> > 	if (allocstall == curr_allocstall && swap != 0) {
> > 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
> > 			{
> > 			:	if (ret & VM_FAULT_RETRY) {
> > 			:		down_read(&mm->mmap_sem);
> > 			:		^^^^^^^^^
> 
> oh... it's in a loop
> 
> 		for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE;
> 						pte++, _address += PAGE_SIZE) {
> 			ret = do_swap_page()
> 			if (ret & VM_FAULT_RETRY) {
> 				down_read(&mm->mmap_sem);
> 				^^^^^^^^^
> 				...
> 			}
> 		}
> 
> so there can be multiple sem->count++ in __collapse_huge_page_swapin(),
> and you don't know how many sem->count-- you need to do later? is this
> correct or am I hallucinating?

No, I was wrong, sorry for the noise.

it's getting unlocked in

__collapse_huge_page_swapin()
	do_swap_page()
		lock_page_or_retry()
			if (flags & FAULT_FLAG_ALLOW_RETRY)
				up_read(&mm->mmap_sem);
	return VM_FAULT_RETRY

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-02 13:24   ` Vlastimil Babka
  2016-06-02 18:58     ` Ebru Akagunduz
@ 2016-06-03 12:28     ` Ebru Akagunduz
  2016-06-06 13:05       ` Vlastimil Babka
  1 sibling, 1 reply; 27+ messages in thread
From: Ebru Akagunduz @ 2016-06-03 12:28 UTC (permalink / raw)
  To: akpm
  Cc: vbabka, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr,
	linux-mm, linux-next, linux-kernel, riel, aarcange,
	Ebru Akagunduz

After creating revalidate vma function, locking inconsistency occured
due to directing the code path to wrong label. This patch directs
to correct label and fix the inconsistency.

Related commit that caused inconsistency:
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>
---
 mm/huge_memory.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 292cedd..8043d91 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2493,13 +2493,18 @@ static void collapse_huge_page(struct mm_struct *mm,
 	curr_allocstall = sum_vm_event(ALLOCSTALL);
 	down_read(&mm->mmap_sem);
 	result = hugepage_vma_revalidate(mm, vma, address);
-	if (result)
-		goto out;
+	if (result) {
+		mem_cgroup_cancel_charge(new_page, memcg, true);
+		up_read(&mm->mmap_sem);
+		goto out_nolock;
+	}
 
 	pmd = mm_find_pmd(mm, address);
 	if (!pmd) {
 		result = SCAN_PMD_NULL;
-		goto out;
+		mem_cgroup_cancel_charge(new_page, memcg, true);
+		up_read(&mm->mmap_sem);
+		goto out_nolock;
 	}
 
 	/*
@@ -2513,8 +2518,9 @@ static void collapse_huge_page(struct mm_struct *mm,
 		 * label out. Continuing to collapse causes inconsistency.
 		 */
 		if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) {
+			mem_cgroup_cancel_charge(new_page, memcg, true);
 			up_read(&mm->mmap_sem);
-			goto out;
+			goto out_nolock;
 		}
 	}
 
-- 
1.9.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
@ 2016-06-06 13:05       ` Vlastimil Babka
  2016-06-09  3:51         ` Sergey Senozhatsky
  0 siblings, 1 reply; 27+ messages in thread
From: Vlastimil Babka @ 2016-06-06 13:05 UTC (permalink / raw)
  To: Ebru Akagunduz, akpm
  Cc: sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm,
	linux-next, linux-kernel, riel, aarcange

On 06/03/2016 02:28 PM, Ebru Akagunduz wrote:
> After creating revalidate vma function, locking inconsistency occured
> due to directing the code path to wrong label. This patch directs
> to correct label and fix the inconsistency.
>
> Related commit that caused inconsistency:
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0
>
> Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com>

I think this does fix the inconsistency, thanks.

But looking at collapse_huge_page() as of latest -next, I wonder if 
there's another problem:

pmd = mm_find_pmd(mm, address);
...
up_read(&mm->mmap_sem);
down_write(&mm->mmap_sem);
hugepage_vma_revalidate(mm, address);
...
pte = pte_offset_map(pmd, address);

What guarantees that 'pmd' is still valid?

Vlastimil

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page
  2016-06-06 13:05       ` Vlastimil Babka
@ 2016-06-09  3:51         ` Sergey Senozhatsky
  0 siblings, 0 replies; 27+ messages in thread
From: Sergey Senozhatsky @ 2016-06-09  3:51 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Ebru Akagunduz, akpm, sergey.senozhatsky.work, mhocko,
	kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel,
	aarcange

On (06/06/16 15:05), Vlastimil Babka wrote:
[..]
> I think this does fix the inconsistency, thanks.
> 
> But looking at collapse_huge_page() as of latest -next, I wonder if there's
> another problem:
> 
> pmd = mm_find_pmd(mm, address);
> ...
> up_read(&mm->mmap_sem);
> down_write(&mm->mmap_sem);
> hugepage_vma_revalidate(mm, address);
> ...
> pte = pte_offset_map(pmd, address);
> 
> What guarantees that 'pmd' is still valid?

the same question applied to __collapse_huge_page_swapin(), I think.

__collapse_huge_page_swapin(pmd)
	pte = pte_offset_map(pmd, address);
	do_swap_page(mm, vma, _address, pte, pmd...)
		up_read(&mm->mmap_sem);
	down_read(&mm->mmap_sem);
	pte = pte_offset_map(pmd, _address);

	-ss

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2016-06-09  3:51 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20160601131122.7dbb0a65@canb.auug.org.au>
2016-06-02  1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky
2016-06-02  9:21   ` Michal Hocko
2016-06-02 12:08     ` Sergey Senozhatsky
2016-06-02 12:21       ` Michal Hocko
2016-06-03 13:51         ` Andrea Arcangeli
2016-06-03 14:46           ` Michal Hocko
2016-06-03 15:10             ` Andrea Arcangeli
2016-06-07  7:34               ` Michal Hocko
2016-06-08  8:19               ` Vlastimil Babka
2016-06-03  7:15     ` Sergey Senozhatsky
2016-06-03  7:25       ` Michal Hocko
2016-06-03  8:43         ` Sergey Senozhatsky
2016-06-03  9:55           ` Michal Hocko
2016-06-03 10:05             ` Michal Hocko
2016-06-03 13:38               ` Sergey Senozhatsky
2016-06-03 13:45                 ` Michal Hocko
2016-06-03 13:49                   ` Michal Hocko
2016-06-04  7:51                     ` Sergey Senozhatsky
2016-06-06  8:39                       ` Michal Hocko
2016-06-02 13:24   ` Vlastimil Babka
2016-06-02 18:58     ` Ebru Akagunduz
2016-06-03  1:00       ` Sergey Senozhatsky
2016-06-03  1:29         ` Sergey Senozhatsky
2016-06-03  4:14           ` Sergey Senozhatsky
2016-06-03 12:28     ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz
2016-06-06 13:05       ` Vlastimil Babka
2016-06-09  3:51         ` Sergey Senozhatsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox