* [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup [not found] <20160601131122.7dbb0a65@canb.auug.org.au> @ 2016-06-02 1:48 ` Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko 2016-06-02 13:24 ` Vlastimil Babka 0 siblings, 2 replies; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-02 1:48 UTC (permalink / raw) To: Andrew Morton Cc: Vlastimil Babka, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On (06/01/16 13:11), Stephen Rothwell wrote: > Hi all, > > Changes since 20160531: > > My fixes tree contains: > > of: silence warnings due to max() usage > > The arm tree gained a conflict against Linus' tree. > > Non-merge commits (relative to Linus' tree): 1100 > 936 files changed, 38159 insertions(+), 17475 deletions(-) Hello, the cc1 process ended up in DN state during kernel -j4 compilation. ... [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 [ 2856.323068] Call Trace: [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. [ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 [ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 [ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 [ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 [ 2877.322870] Call Trace: [ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 [ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 [ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d [ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 [ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 [ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e [ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a [ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 [ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 [ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 [ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 [ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf [ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f ... ps aux | grep cc1 ss 4582 0.0 0.0 0 0 pts/23 DN+ 10:10 0:01 [cc1] -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky @ 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-02 13:24 ` Vlastimil Babka 1 sibling, 2 replies; 27+ messages in thread From: Michal Hocko @ 2016-06-02 9:21 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli [CCing Andrea] On Thu 02-06-16 10:48:35, Sergey Senozhatsky wrote: > On (06/01/16 13:11), Stephen Rothwell wrote: > > Hi all, > > > > Changes since 20160531: > > > > My fixes tree contains: > > > > of: silence warnings due to max() usage > > > > The arm tree gained a conflict against Linus' tree. > > > > Non-merge commits (relative to Linus' tree): 1100 > > 936 files changed, 38159 insertions(+), 17475 deletions(-) > > Hello, > > the cc1 process ended up in DN state during kernel -j4 compilation. > > ... > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2856.323068] Call Trace: > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f down_write in the exit path is certainly not nice. It is hard to tell who is blocking the mmap_sem but it is clear that __khugepaged_exit waits for the khugepaged to release its mmap_sem. Do you hapen to have a trace of khugepaged? Note that the lock holder might be another writer which just hasn't pinned mm_users so khugepaged might be blocked on read lock as well. Or khugepaged might be just stuck somewhere... I am trying to wrap my head around the synchronization here and I suspect it is unnecessarily complex. We should be able to go without down_write in the exit path... The following patch would only workaround the issue you are seeing but I guess it is worth considering this approach. Andrea, does the following look reasonable to you? I haven't tested it and I might be missing some subtle details. The code is really not trivial... --- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 9:21 ` Michal Hocko @ 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-02 12:21 ` Michal Hocko 2016-06-03 7:15 ` Sergey Senozhatsky 1 sibling, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-02 12:08 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli Hello Michal, On (06/02/16 11:21), Michal Hocko wrote: [..] > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > > [ 2856.323068] Call Trace: > > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > down_write in the exit path is certainly not nice. It is hard to tell > who is blocking the mmap_sem but it is clear that __khugepaged_exit > waits for the khugepaged to release its mmap_sem. Do you hapen to have a > trace of khugepaged? Note that the lock holder might be another writer > which just hasn't pinned mm_users so khugepaged might be blocked on read > lock as well. Or khugepaged might be just stuck somewhere... sorry, no. this is all I have. the kernel was compiled with almost no debugging functionality enabled (no lockdep, no lock debug, nothing) for zram performance testing purposes. I'll try to reproduce the problem; and give your patch some testing. thanks. -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 12:08 ` Sergey Senozhatsky @ 2016-06-02 12:21 ` Michal Hocko 2016-06-03 13:51 ` Andrea Arcangeli 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-02 12:21 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Thu 02-06-16 21:08:57, Sergey Senozhatsky wrote: > Hello Michal, > > On (06/02/16 11:21), Michal Hocko wrote: > [..] > > > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > > > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > > > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > > > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > > > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > > > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > > > [ 2856.323068] Call Trace: > > > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > > > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > > > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > > > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > > > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > > > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > > > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > > > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > > > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > > > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > > > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > > > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > > > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > > > down_write in the exit path is certainly not nice. It is hard to tell > > who is blocking the mmap_sem but it is clear that __khugepaged_exit > > waits for the khugepaged to release its mmap_sem. Do you hapen to have a > > trace of khugepaged? Note that the lock holder might be another writer > > which just hasn't pinned mm_users so khugepaged might be blocked on read > > lock as well. Or khugepaged might be just stuck somewhere... > > sorry, no. this is all I have. the kernel was compiled with almost no > debugging functionality enabled (no lockdep, no lock debug, nothing) > for zram performance testing purposes. > > I'll try to reproduce the problem; and give your patch some testing. > thanks. The patch will drop the down_write from the exit path which is, I believe the right thing to do, so it would paper over an existing problem when khugepaged could get stuck with mmap_sem held for read (if that is really a problem). So reproducing without the patch still makes some sense. Testing with the patch makes some sense as well, but I would like to hear from Andrea whether the approach is good because I am wondering why he hasn't done that before - it feels so much simpler than the current code. Anyway, thanks a lot for testing! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 12:21 ` Michal Hocko @ 2016-06-03 13:51 ` Andrea Arcangeli 2016-06-03 14:46 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Andrea Arcangeli @ 2016-06-03 13:51 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote: > Testing with the patch makes some sense as well, but I would like to > hear from Andrea whether the approach is good because I am wondering why > he hasn't done that before - it feels so much simpler than the current > code. The down_write in the exit path comes from __ksm_exit. If you don't like it there I'd suggest to also remove it from __ksm_exit. This is a proposed cleanup correct? The first thing that I can notice is that khugepaged_test_exit() then can only be called and provide the expected retval, after atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be used instead. However the code still uses khugepaged_test_exit in __khugepage_enter that won't increase the mm_users, so then the patch relaxes that check too much, albeit only for a debug check not strictly a bug. The cons of this change purely that it'll decrease the responsiveness in releasing the RAM of a killed task a bit. To me the fewer time we hold the mm_users the better and I don't see an obvious runtime improvement coming from this change. It's a bit simpler yes, but the down_write in the exit path is well understood, ksm does the same thing and it's in a slow path (it only happens if the mm that exited is the current one under scan by either ksmd or khugepaged, so normally the down_write is not executed in the exit path and the "mm" is collected right away both as a mm_users and mm_count). In short I think it's a tradeoff: pros) removes down_write in a slow path of the the mm exit which may simplify the code a bit, cons) it could increase the latency in freeing memory as result of a task exiting or being killed during the khugepaged scan, for example while the THP is being allocated. While compaction runs to allocate the THP in collapse_huge_page, if the task is killed currently the memory is released right away, without waiting for the allocation to succeed or fail. I don't see a big enough problem with the down_write in a slow path of khugepaged_exit to justify the increased latency in releasing memory. I was very happy by Oleg's patch reducing the mm_users holding of userfaultfd too. That was controlled by userland so it would only be an issue for non-cooperative usage which isn't upstream yet, and it was also much wider than this one would become with the patch applied, but I liked the direction. If prefer instead to remove the down_write, you probably could move the test_exit before the down_read/write to bail out before taking the lock: you don't need the mmap_sem to do test_exit anymore. The only reason the text_exit would remain in fact is just to reduce the latency of the memory freeing, it then becomes a voluntary preempt cond_resched() to release the memory to make a parallel ;), but unable to let the kernel free the memory while the THP allocation runs. Thanks, Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:51 ` Andrea Arcangeli @ 2016-06-03 14:46 ` Michal Hocko 2016-06-03 15:10 ` Andrea Arcangeli 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 14:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Fri 03-06-16 15:51:54, Andrea Arcangeli wrote: > On Thu, Jun 02, 2016 at 02:21:10PM +0200, Michal Hocko wrote: > > Testing with the patch makes some sense as well, but I would like to > > hear from Andrea whether the approach is good because I am wondering why > > he hasn't done that before - it feels so much simpler than the current > > code. > > The down_write in the exit path comes from __ksm_exit. If you don't > like it there I'd suggest to also remove it from __ksm_exit. I see > This is a proposed cleanup correct? yes this is a cleanup but also a robustness thing, see below. > The first thing that I can notice is that khugepaged_test_exit() then > can only be called and provide the expected retval, after > atomic_inc_not_zero(mm_users). Also note mmget_not_zero() should be > used instead. I didn't get used to mmget_not_zero yet, but true a helper would be better. [...] > To me the fewer time we hold the mm_users the better and I don't see > an obvious runtime improvement coming from this change. It's a bit > simpler yes, but the down_write in the exit path is well understood, > ksm does the same thing and it's in a slow path (it only happens if > the mm that exited is the current one under scan by either ksmd or > khugepaged, so normally the down_write is not executed in the exit > path and the "mm" is collected right away both as a mm_users and > mm_count). OK, I see your point. I wasn't aware that the mmap_sem is dropped before the allocation request. Then the original code indeed might get into exit_mmap earlier wrt. to the patch. The reason I dislike taking write lock in the __mmput is basically for the same reason you have pointed out. exit_mmap might be delayed for an unbounded amount of time. khugepaged resp. ksmd might be well behaved and release their read lock for costly operations or when they detect the mm is dead but it is hard to guarantee that all potential kernel users/drivers are behaving the same way. It is not really trivial to check whether we have such users (there are 100+ users outside of mm/ as per my quick git grep). The exit path should be as simple as possible with the amount of external dependencies reduced to the bare minimum. > In short I think it's a tradeoff: pros) removes down_write in a slow > path of the the mm exit which may simplify the code a bit, cons) it > could increase the latency in freeing memory as result of a task > exiting or being killed during the khugepaged scan, for example while > the THP is being allocated. While compaction runs to allocate the THP > in collapse_huge_page, if the task is killed currently the memory is > released right away, without waiting for the allocation to succeed or > fail. Are those latencies a real problem. The allocation itself shouldn't really take a long time. > I don't see a big enough problem with the down_write in a slow path of > khugepaged_exit to justify the increased latency in releasing memory. What do you think about the external dependencies mentioned above. Do you think this is a sufficient argument wrt. occasional higher latencies? [...] > If prefer instead to remove the down_write, you probably could move > the test_exit before the down_read/write to bail out before taking the > lock: you don't need the mmap_sem to do test_exit anymore. The only > reason the text_exit would remain in fact is just to reduce the > latency of the memory freeing, it then becomes a voluntary preempt > cond_resched() to release the memory to make a parallel ;), but unable > to let the kernel free the memory while the THP allocation runs. OK, I will think about that as well. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 14:46 ` Michal Hocko @ 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 0 siblings, 2 replies; 27+ messages in thread From: Andrea Arcangeli @ 2016-06-03 15:10 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins Hello Michal, CC'ed Hugh, On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: > What do you think about the external dependencies mentioned above. Do > you think this is a sufficient argument wrt. occasional higher > latencies? It's a tradeoff and both latencies would be short and uncommon so it's hard to tell. There's also mmput_async for paths that may care about mmput latencies. Exit itself cannot use it, it's mostly for people taking the mm_users pin that may not want to wait for mmput to run. It also shouldn't happen that often, it's a slow path. The whole model inherited from KSM is to deliberately depend only on the mmap_sem + test_exit + mm_count, and never on mm_users, which to me in principle doesn't sound bad. I consider KSM version a "finegrined" implementation but I never thought it would be a problem to wait a bit in exit() in case the slow path hits. I thought it was more of a problem if exit() runs, the parent then start a new task but the memory wasn't freed yet. So I would suggest Hugh to share his view on the down_write/up_write that may temporarily block mmput (until the next test_exit bailout point) vs higher latency in reaching exit_mmap for a real exit(2) that would happen with the proposed change. Thanks! Andrea -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 15:10 ` Andrea Arcangeli @ 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 1 sibling, 0 replies; 27+ messages in thread From: Michal Hocko @ 2016-06-07 7:34 UTC (permalink / raw) To: Andrea Arcangeli Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins On Fri 03-06-16 17:10:01, Andrea Arcangeli wrote: > Hello Michal, > > CC'ed Hugh, > > On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: > > What do you think about the external dependencies mentioned above. Do > > you think this is a sufficient argument wrt. occasional higher > > latencies? > > It's a tradeoff and both latencies would be short and uncommon so it's > hard to tell. > > There's also mmput_async for paths that may care about mmput > latencies. Exit itself cannot use it, it's mostly for people taking > the mm_users pin that may not want to wait for mmput to run. It also > shouldn't happen that often, it's a slow path. > > The whole model inherited from KSM is to deliberately depend only on > the mmap_sem + test_exit + mm_count, and never on mm_users, which to > me in principle doesn't sound bad. I do agree that this model is quite clever (albeit convoluted). It just assumes that all other mmap_sem users are behaving the same. Now most in-kernel users will do get_task_mm() and then lock mmap_sem, but I haven't checked all of them and it is quite possible that some of those would like to optimize in a similar way and only increment mm_count. I might be too pessimistic about the out of mm code but I would feel much better if the exit path didn't depend on them. Anyway, if the current model sounds better I will definitely not insist on my patch. It is more of an idea for simplification than a fix for anything I have seen happening in the real life. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko @ 2016-06-08 8:19 ` Vlastimil Babka 1 sibling, 0 replies; 27+ messages in thread From: Vlastimil Babka @ 2016-06-08 8:19 UTC (permalink / raw) To: Andrea Arcangeli, Michal Hocko Cc: Sergey Senozhatsky, Sergey Senozhatsky, Andrew Morton, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Hugh Dickins On 06/03/2016 05:10 PM, Andrea Arcangeli wrote: > Hello Michal, > > CC'ed Hugh, > > On Fri, Jun 03, 2016 at 04:46:00PM +0200, Michal Hocko wrote: >> What do you think about the external dependencies mentioned above. Do >> you think this is a sufficient argument wrt. occasional higher >> latencies? > > It's a tradeoff and both latencies would be short and uncommon so it's > hard to tell. Shouldn't it be possible to do a mmput() before the hugepage allocation, and then again mmget_not_zero()? That way it's no longer a tradeoff? > There's also mmput_async for paths that may care about mmput > latencies. Exit itself cannot use it, it's mostly for people taking > the mm_users pin that may not want to wait for mmput to run. It also > shouldn't happen that often, it's a slow path. > > The whole model inherited from KSM is to deliberately depend only on > the mmap_sem + test_exit + mm_count, and never on mm_users, which to > me in principle doesn't sound bad. I consider KSM version a > "finegrined" implementation but I never thought it would be a problem > to wait a bit in exit() in case the slow path hits. I thought it was > more of a problem if exit() runs, the parent then start a new task but > the memory wasn't freed yet. > > So I would suggest Hugh to share his view on the down_write/up_write > that may temporarily block mmput (until the next test_exit bailout > point) vs higher latency in reaching exit_mmap for a real exit(2) that > would happen with the proposed change. > > Thanks! > Andrea > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky @ 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-03 7:25 ` Michal Hocko 1 sibling, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 7:15 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli Hello, On (06/02/16 11:21), Michal Hocko wrote: [..] > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > collect_mm_slot(mm_slot); > } > + mmput(mm); > > return progress; > } this possibly sleeping mmput() is called from under the spin_lock(&khugepaged_mm_lock). there is also a trivial build fixup needed (move collect_mm_slot() before __khugepaged_exit()). it's quite hard to trigger the bug (somehow), so I can't follow up with more information as of now. -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 7:15 ` Sergey Senozhatsky @ 2016-06-03 7:25 ` Michal Hocko 2016-06-03 8:43 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 7:25 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 16:15:51, Sergey Senozhatsky wrote: > Hello, > > On (06/02/16 11:21), Michal Hocko wrote: > [..] > > @@ -2863,6 +2854,7 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, > > > > collect_mm_slot(mm_slot); > > } > > + mmput(mm); > > > > return progress; > > } > > this possibly sleeping mmput() is called from > under the spin_lock(&khugepaged_mm_lock). You are right. khugepaged_scan_mm_slot returns with the lock held. mmput_async would deal with it. > there is also a trivial build fixup needed > (move collect_mm_slot() before __khugepaged_exit()). will fix that. Thanks! > it's quite hard to trigger the bug (somehow), so I can't > follow up with more information as of now. Thanks anyway! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 7:25 ` Michal Hocko @ 2016-06-03 8:43 ` Sergey Senozhatsky 2016-06-03 9:55 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 8:43 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On (06/03/16 09:25), Michal Hocko wrote: > > it's quite hard to trigger the bug (somehow), so I can't > > follow up with more information as of now. either I did something very silly fixing up the patch, or the patch may be causing general protection faults on my system. RIP collect_mm_slot() + 0x42/0x84 khugepaged prepare_to_wait_event maybe_pmd_mkwrite kthread _raw_sin_unlock_irq ret_from_fork kthread_create_on_node collect_mm_slot() + 0x42/0x84 is 0000000000000328 <collect_mm_slot>: 328: 55 push %rbp 329: 48 89 e5 mov %rsp,%rbp 32c: 53 push %rbx 32d: 48 8b 5f 20 mov 0x20(%rdi),%rbx 331: 8b 43 48 mov 0x48(%rbx),%eax 334: ff c8 dec %eax 336: 7f 71 jg 3a9 <collect_mm_slot+0x81> 338: 48 8b 57 08 mov 0x8(%rdi),%rdx 33c: 48 85 d2 test %rdx,%rdx 33f: 74 1e je 35f <collect_mm_slot+0x37> 341: 48 8b 07 mov (%rdi),%rax 344: 48 85 c0 test %rax,%rax 347: 48 89 02 mov %rax,(%rdx) 34a: 74 04 je 350 <collect_mm_slot+0x28> 34c: 48 89 50 08 mov %rdx,0x8(%rax) 350: 48 c7 07 00 00 00 00 movq $0x0,(%rdi) 357: 48 c7 47 08 00 00 00 movq $0x0,0x8(%rdi) 35e: 00 35f: 48 8b 57 10 mov 0x10(%rdi),%rdx 363: 48 8b 47 18 mov 0x18(%rdi),%rax 367: 48 89 fe mov %rdi,%rsi 36a: 48 89 42 08 mov %rax,0x8(%rdx) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 36e: 48 89 10 mov %rdx,(%rax) 371: 48 b8 00 01 00 00 00 movabs $0xdead000000000100,%rax 378: 00 ad de 37b: 48 89 47 10 mov %rax,0x10(%rdi) 37f: 48 b8 00 02 00 00 00 movabs $0xdead000000000200,%rax 386: 00 ad de 389: 48 89 47 18 mov %rax,0x18(%rdi) 38d: 48 8b 3d 00 00 00 00 mov 0x0(%rip),%rdi # 394 <collect_mm_slot+0x6c> 394: e8 00 00 00 00 callq 399 <collect_mm_slot+0x71> 399: f0 ff 4b 4c lock decl 0x4c(%rbx) 39d: 74 02 je 3a1 <collect_mm_slot+0x79> 39f: eb 08 jmp 3a9 <collect_mm_slot+0x81> 3a1: 48 89 df mov %rbx,%rdi 3a4: e8 00 00 00 00 callq 3a9 <collect_mm_slot+0x81> 3a9: 5b pop %rbx 3aa: 5d pop %rbp 3ab: c3 retq which is list_del(&mm_slot->mm_node), I believe. I attached the patch (just in case). --- mm/huge_memory.c | 87 +++++++++++++++++++++++++------------------------------- 1 file changed, 39 insertions(+), 48 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 292cedd..1c82fa4 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1938,7 +1938,8 @@ static void insert_to_mm_slots_hash(struct mm_struct *mm, static inline int khugepaged_test_exit(struct mm_struct *mm) { - return atomic_read(&mm->mm_users) == 0; + /* the only pin is from khugepaged_scan_mm_slot */ + return atomic_read(&mm->mm_users) <= 1; } int __khugepaged_enter(struct mm_struct *mm) @@ -1950,8 +1951,6 @@ int __khugepaged_enter(struct mm_struct *mm) if (!mm_slot) return -ENOMEM; - /* __khugepaged_exit() must not run from under us */ - VM_BUG_ON_MM(khugepaged_test_exit(mm), mm); if (unlikely(test_and_set_bit(MMF_VM_HUGEPAGE, &mm->flags))) { free_mm_slot(mm_slot); return 0; @@ -1994,36 +1993,40 @@ int khugepaged_enter_vma_merge(struct vm_area_struct *vma, return 0; } -void __khugepaged_exit(struct mm_struct *mm) +static void collect_mm_slot(struct mm_slot *mm_slot) { - struct mm_slot *mm_slot; - int free = 0; + struct mm_struct *mm = mm_slot->mm; - spin_lock(&khugepaged_mm_lock); - mm_slot = get_mm_slot(mm); - if (mm_slot && khugepaged_scan.mm_slot != mm_slot) { + VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); + + if (khugepaged_test_exit(mm)) { + /* free mm_slot */ hash_del(&mm_slot->hash); list_del(&mm_slot->mm_node); - free = 1; - } - spin_unlock(&khugepaged_mm_lock); - if (free) { - clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - free_mm_slot(mm_slot); - mmdrop(mm); - } else if (mm_slot) { /* - * This is required to serialize against - * khugepaged_test_exit() (which is guaranteed to run - * under mmap sem read mode). Stop here (after we - * return all pagetables will be destroyed) until - * khugepaged has finished working on the pagetables - * under the mmap_sem. + * Not strictly needed because the mm exited already. + * + * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); */ - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); + + /* khugepaged_mm_lock actually not necessary for the below */ + free_mm_slot(mm_slot); + mmdrop(mm); + } +} + +void __khugepaged_exit(struct mm_struct *mm) +{ + struct mm_slot *mm_slot; + + spin_lock(&khugepaged_mm_lock); + mm_slot = get_mm_slot(mm); + if (mm_slot) { + collect_mm_slot(mm_slot); + clear_bit(MMF_VM_HUGEPAGE, &mm->flags); } + spin_unlock(&khugepaged_mm_lock); } static void release_pte_page(struct page *page) @@ -2738,29 +2741,6 @@ out: return ret; } -static void collect_mm_slot(struct mm_slot *mm_slot) -{ - struct mm_struct *mm = mm_slot->mm; - - VM_BUG_ON(NR_CPUS != 1 && !spin_is_locked(&khugepaged_mm_lock)); - - if (khugepaged_test_exit(mm)) { - /* free mm_slot */ - hash_del(&mm_slot->hash); - list_del(&mm_slot->mm_node); - - /* - * Not strictly needed because the mm exited already. - * - * clear_bit(MMF_VM_HUGEPAGE, &mm->flags); - */ - - /* khugepaged_mm_lock actually not necessary for the below */ - free_mm_slot(mm_slot); - mmdrop(mm); - } -} - static unsigned int khugepaged_scan_mm_slot(unsigned int pages, struct page **hpage) __releases(&khugepaged_mm_lock) @@ -2782,6 +2762,16 @@ static unsigned int khugepaged_scan_mm_slot(unsigned int pages, khugepaged_scan.address = 0; khugepaged_scan.mm_slot = mm_slot; } + + /* + * Do not even try to do anything if the current mm is already + * dead. khugepaged_mm_lock will make sure only this or + * __khugepaged_exit does the unhasing. + */ + if (!atomic_inc_not_zero(&mm_slot->mm->mm_users)) { + collect_mm_slot(mm_slot); + return progress; + } spin_unlock(&khugepaged_mm_lock); mm = mm_slot->mm; @@ -2865,6 +2855,7 @@ breakouterloop_mmap_sem: collect_mm_slot(mm_slot); } + mmput_async(mm); return progress; } -- 2.9.0.rc1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 8:43 ` Sergey Senozhatsky @ 2016-06-03 9:55 ` Michal Hocko 2016-06-03 10:05 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 9:55 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote: > On (06/03/16 09:25), Michal Hocko wrote: > > > it's quite hard to trigger the bug (somehow), so I can't > > > follow up with more information as of now. > > either I did something very silly fixing up the patch, or the > patch may be causing general protection faults on my system. > > RIP collect_mm_slot() + 0x42/0x84 > khugepaged So is this really collect_mm_slot called directly from khugepaged or is some inlining going on there? > prepare_to_wait_event > maybe_pmd_mkwrite > kthread > _raw_sin_unlock_irq > ret_from_fork > kthread_create_on_node > > collect_mm_slot() + 0x42/0x84 is I guess that the problem is that I have missed that __khugepaged_exit doesn't clear the cached khugepaged_scan.mm_slot. Does the following on top fixes that? --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6574c62ca4a3..e6f4e6fd587a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2021,6 +2021,8 @@ void __khugepaged_exit(struct mm_struct *mm) spin_lock(&khugepaged_mm_lock); mm_slot = get_mm_slot(mm); if (mm_slot) { + if (khugepaged_scan.mm_slot == mm_slot) + khugepaged_scan.mm_slot = NULL; collect_mm_slot(mm_slot); clear_bit(MMF_VM_HUGEPAGE, &mm->flags); } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 9:55 ` Michal Hocko @ 2016-06-03 10:05 ` Michal Hocko 2016-06-03 13:38 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 10:05 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 11:55:49, Michal Hocko wrote: > On Fri 03-06-16 17:43:47, Sergey Senozhatsky wrote: > > On (06/03/16 09:25), Michal Hocko wrote: > > > > it's quite hard to trigger the bug (somehow), so I can't > > > > follow up with more information as of now. > > > > either I did something very silly fixing up the patch, or the > > patch may be causing general protection faults on my system. > > > > RIP collect_mm_slot() + 0x42/0x84 > > khugepaged > > So is this really collect_mm_slot called directly from khugepaged or is > some inlining going on there? > > > prepare_to_wait_event > > maybe_pmd_mkwrite > > kthread > > _raw_sin_unlock_irq > > ret_from_fork > > kthread_create_on_node > > > > collect_mm_slot() + 0x42/0x84 is > > I guess that the problem is that I have missed that __khugepaged_exit > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on > top fixes that? That wouldn't be sufficient after a closer look. We need to do the same from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess it would be better to stick it into collect_mm_slot. Thanks for your testing! --- diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 6574c62ca4a3..0432581fb87c 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2011,6 +2011,9 @@ static void collect_mm_slot(struct mm_slot *mm_slot) /* khugepaged_mm_lock actually not necessary for the below */ free_mm_slot(mm_slot); mmdrop(mm); + + if (khugepaged_scan.mm_slot == mm_slot) + khugepaged_scan.mm_slot = NULL; } } -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 10:05 ` Michal Hocko @ 2016-06-03 13:38 ` Sergey Senozhatsky 2016-06-03 13:45 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 13:38 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On (06/03/16 12:05), Michal Hocko wrote: > > > RIP collect_mm_slot() + 0x42/0x84 > > > khugepaged > > > > So is this really collect_mm_slot called directly from khugepaged or is > > some inlining going on there? inlining I suppose. > > > prepare_to_wait_event > > > maybe_pmd_mkwrite > > > kthread > > > _raw_sin_unlock_irq > > > ret_from_fork > > > kthread_create_on_node > > > > > > collect_mm_slot() + 0x42/0x84 is > > > > I guess that the problem is that I have missed that __khugepaged_exit > > doesn't clear the cached khugepaged_scan.mm_slot. Does the following on > > top fixes that? > > That wouldn't be sufficient after a closer look. We need to do the same > from khugepaged_scan_mm_slot when atomic_inc_not_zero fails. So I guess > it would be better to stick it into collect_mm_slot. Michal, I'll try to test during the weekend (away from the affected box now), but in the worst case it can as late as next Thursday (gonna travel next week). -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:38 ` Sergey Senozhatsky @ 2016-06-03 13:45 ` Michal Hocko 2016-06-03 13:49 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 13:45 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Andrea Arcangeli On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote: [...] > Michal, I'll try to test during the weekend (away from the affected box > now), but in the worst case it can as late as next Thursday (gonna travel > next week). No problem. I would really like to hear from Andrea before we give this a serious try anyway. Thanks! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:45 ` Michal Hocko @ 2016-06-03 13:49 ` Michal Hocko 2016-06-04 7:51 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Michal Hocko @ 2016-06-03 13:49 UTC (permalink / raw) To: Sergey Senozhatsky, Andrea Arcangeli Cc: Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Fri 03-06-16 15:45:09, Michal Hocko wrote: > On Fri 03-06-16 22:38:13, Sergey Senozhatsky wrote: > [...] > > Michal, I'll try to test during the weekend (away from the affected box > > now), but in the worst case it can as late as next Thursday (gonna travel > > next week). > > No problem. I would really like to hear from Andrea before we give this > a serious try anyway. And just for an easier review, here is what I have right now: --- ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 13:49 ` Michal Hocko @ 2016-06-04 7:51 ` Sergey Senozhatsky 2016-06-06 8:39 ` Michal Hocko 0 siblings, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-04 7:51 UTC (permalink / raw) To: Michal Hocko Cc: Sergey Senozhatsky, Andrea Arcangeli, Sergey Senozhatsky, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel Hello, On (06/03/16 15:49), Michal Hocko wrote: > __khugepaged_exit is called during the final __mmput and it employs a > complex synchronization dances to make sure it doesn't race with the > khugepaged which might be scanning this mm at the same time. This is > all caused by the fact that khugepaged doesn't pin mm_users. Things > would simplify considerably if we simply check the mm at > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it > is dead and we can unhash the mm_slot and move on to another one. This > will also guarantee that __khugepaged_exit cannot race with khugepaged > and so we can free up the slot if it is still hashed. > > Signed-off-by: Michal Hocko <mhocko@suse.com> with this patch and http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch I saw no problems during my tests (well, may be didn't test hard enough). -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-04 7:51 ` Sergey Senozhatsky @ 2016-06-06 8:39 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2016-06-06 8:39 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Sergey Senozhatsky, Andrea Arcangeli, Andrew Morton, Vlastimil Babka, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel On Sat 04-06-16 16:51:14, Sergey Senozhatsky wrote: > Hello, > > On (06/03/16 15:49), Michal Hocko wrote: > > __khugepaged_exit is called during the final __mmput and it employs a > > complex synchronization dances to make sure it doesn't race with the > > khugepaged which might be scanning this mm at the same time. This is > > all caused by the fact that khugepaged doesn't pin mm_users. Things > > would simplify considerably if we simply check the mm at > > khugepaged_scan_mm_slot and if mm_users was already 0 then we know it > > is dead and we can unhash the mm_slot and move on to another one. This > > will also guarantee that __khugepaged_exit cannot race with khugepaged > > and so we can free up the slot if it is still hashed. > > > > Signed-off-by: Michal Hocko <mhocko@suse.com> > > with this patch and > http://ozlabs.org/~akpm/mmotm/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem-fix-2.patch > > I saw no problems during my tests (well, may be didn't test hard > enough). Thanks for the testing! -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko @ 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 1 sibling, 2 replies; 27+ messages in thread From: Vlastimil Babka @ 2016-06-02 13:24 UTC (permalink / raw) To: Sergey Senozhatsky, Andrew Morton, Ebru Akagunduz Cc: Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, linux-mm, linux-next, linux-kernel, Rik van Riel, Andrea Arcangeli [+CC's] On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote: > On (06/01/16 13:11), Stephen Rothwell wrote: >> Hi all, >> >> Changes since 20160531: >> >> My fixes tree contains: >> >> of: silence warnings due to max() usage >> >> The arm tree gained a conflict against Linus' tree. >> >> Non-merge commits (relative to Linus' tree): 1100 >> 936 files changed, 38159 insertions(+), 17475 deletions(-) > > Hello, > > the cc1 process ended up in DN state during kernel -j4 compilation. > > ... > [ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2856.323068] Call Trace: > [ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > [ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. > [ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > [ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > [ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > [ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > [ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > [ 2877.322870] Call Trace: > [ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 > [ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > [ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > [ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > [ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > [ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e > [ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > [ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 > [ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > [ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > [ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > [ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > [ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f I think it's this patch: http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch Some parts of the code in collapse_huge_page() that were under down_write(mmap_sem) are under down_read() after the patch. But there's "goto out" which continues via "goto out_up_write" which does up_write(mmap_sem) so there's an imbalance. One path seems to go via both up_read() and up_write(). I can imagine this can cause a stuck down_write() among other things? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 13:24 ` Vlastimil Babka @ 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 1 sibling, 1 reply; 27+ messages in thread From: Ebru Akagunduz @ 2016-06-02 18:58 UTC (permalink / raw) To: vbabka, sergey.senozhatsky.work, akpm Cc: mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On Thu, Jun 02, 2016 at 03:24:05PM +0200, Vlastimil Babka wrote: > [+CC's] > > On 06/02/2016 03:48 AM, Sergey Senozhatsky wrote: > >On (06/01/16 13:11), Stephen Rothwell wrote: > >>Hi all, > >> > >>Changes since 20160531: > >> > >>My fixes tree contains: > >> > >> of: silence warnings due to max() usage > >> > >>The arm tree gained a conflict against Linus' tree. > >> > >>Non-merge commits (relative to Linus' tree): 1100 > >> 936 files changed, 38159 insertions(+), 17475 deletions(-) > > > >Hello, > > > >the cc1 process ended up in DN state during kernel -j4 compilation. > > > >... > >[ 2856.323052] INFO: task cc1:4582 blocked for more than 21 seconds. > >[ 2856.323055] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > >[ 2856.323056] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >[ 2856.323059] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > >[ 2856.323062] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > >[ 2856.323065] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > >[ 2856.323067] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > >[ 2856.323068] Call Trace: > >[ 2856.323074] [<ffffffff81441e33>] schedule+0x83/0x98 > >[ 2856.323077] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > >[ 2856.323080] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > >[ 2856.323083] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > >[ 2856.323084] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > >[ 2856.323086] [<ffffffff81443630>] down_write+0x1f/0x2e > >[ 2856.323089] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > >[ 2856.323091] [<ffffffff8103702a>] mmput+0x29/0xc5 > >[ 2856.323093] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > >[ 2856.323095] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > >[ 2856.323097] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > >[ 2856.323099] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > >[ 2856.323101] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > > >[ 2877.322853] INFO: task cc1:4582 blocked for more than 21 seconds. > >[ 2877.322858] Not tainted 4.7.0-rc1-next-20160601-dbg-00012-g52c180e-dirty #453 > >[ 2877.322858] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > >[ 2877.322861] cc1 D ffff880057e9fd78 0 4582 4575 0x00000000 > >[ 2877.322865] ffff880057e9fd78 ffff880057e08000 ffff880057e9fd90 ffff880057ea0000 > >[ 2877.322867] ffff88005dc3dc68 ffffffff00000001 ffff880057e09500 ffff88005dc3dc80 > >[ 2877.322867] ffff880057e9fd90 ffffffff81441e33 ffff88005dc3dc68 ffff880057e9fe00 > >[ 2877.322870] Call Trace: > >[ 2877.322875] [<ffffffff81441e33>] schedule+0x83/0x98 > >[ 2877.322878] [<ffffffff81443d9b>] rwsem_down_write_failed+0x18e/0x1d3 > >[ 2877.322881] [<ffffffff810a87cf>] ? unlock_page+0x2b/0x2d > >[ 2877.322884] [<ffffffff811bdb77>] call_rwsem_down_write_failed+0x17/0x30 > >[ 2877.322885] [<ffffffff811bdb77>] ? call_rwsem_down_write_failed+0x17/0x30 > >[ 2877.322887] [<ffffffff81443630>] down_write+0x1f/0x2e > >[ 2877.322890] [<ffffffff810ea4f3>] __khugepaged_exit+0x104/0x11a > >[ 2877.322892] [<ffffffff8103702a>] mmput+0x29/0xc5 > >[ 2877.322894] [<ffffffff8103bbd8>] do_exit+0x34c/0x894 > >[ 2877.322896] [<ffffffff8102f9e0>] ? __do_page_fault+0x2f7/0x399 > >[ 2877.322898] [<ffffffff8103c188>] do_group_exit+0x3c/0x98 > >[ 2877.322900] [<ffffffff8103c1f3>] SyS_exit_group+0xf/0xf > >[ 2877.322902] [<ffffffff81444cdb>] entry_SYSCALL_64_fastpath+0x13/0x8f > > I think it's this patch: > > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch > > Some parts of the code in collapse_huge_page() that were under > down_write(mmap_sem) are under down_read() after the patch. But > there's "goto out" which continues via "goto out_up_write" which > does up_write(mmap_sem) so there's an imbalance. One path seems to > go via both up_read() and up_write(). I can imagine this can cause a > stuck down_write() among other things? Recently, I realized the same imbalance, it is an obvious inconsistency. I don't know, this issue can be related with mine. I'll send a fix patch. Kind regards. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-02 18:58 ` Ebru Akagunduz @ 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 1:29 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 1:00 UTC (permalink / raw) To: Ebru Akagunduz Cc: Vlastimil Babka, sergey.senozhatsky.work, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/02/16 21:58), Ebru Akagunduz wrote: [..] > > I think it's this patch: > > > > http://ozlabs.org/~akpm/mmots/broken-out/mm-thp-make-swapin-readahead-under-down_read-of-mmap_sem.patch > > > > Some parts of the code in collapse_huge_page() that were under > > down_write(mmap_sem) are under down_read() after the patch. But > > there's "goto out" which continues via "goto out_up_write" which > > does up_write(mmap_sem) so there's an imbalance. One path seems to > > go via both up_read() and up_write(). I can imagine this can cause a > > stuck down_write() among other things? > Recently, I realized the same imbalance, it is an obvious > inconsistency. I don't know, this issue can be related with > mine. I'll send a fix patch. a good find by Vlastimil. Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called from collapse_huge_page() under the down_read(&mm->mmap_sem), is there any reason to do the nested down_read(&mm->mmap_sem)? collapse_huge_page() ... down_read(&mm->mmap_sem); result = hugepage_vma_revalidate(mm, vma, address); if (result) goto out; pmd = mm_find_pmd(mm, address); if (!pmd) { result = SCAN_PMD_NULL; goto out; } if (allocstall == curr_allocstall && swap != 0) { if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { { : if (ret & VM_FAULT_RETRY) { : down_read(&mm->mmap_sem); : ^^^^^^^^^ : if (hugepage_vma_revalidate(mm, vma, address)) : return false; : } } up_read(&mm->mmap_sem); goto out; } } up_read(&mm->mmap_sem); so if __collapse_huge_page_swapin() retruns true we have: - down_read() twice, up_read() once? the locking rules here are a bit confusing. (I didn't have my morning coffee yet). -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 1:00 ` Sergey Senozhatsky @ 2016-06-03 1:29 ` Sergey Senozhatsky 2016-06-03 4:14 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 1:29 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/03/16 10:00), Sergey Senozhatsky wrote: > a good find by Vlastimil. > > Ebru, can you also re-visit __collapse_huge_page_swapin()? it's called > from collapse_huge_page() under the down_read(&mm->mmap_sem), is there > any reason to do the nested down_read(&mm->mmap_sem)? > > collapse_huge_page() > ... > down_read(&mm->mmap_sem); > result = hugepage_vma_revalidate(mm, vma, address); > if (result) > goto out; > > pmd = mm_find_pmd(mm, address); > if (!pmd) { > result = SCAN_PMD_NULL; > goto out; > } > > if (allocstall == curr_allocstall && swap != 0) { > if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { > { > : if (ret & VM_FAULT_RETRY) { > : down_read(&mm->mmap_sem); > : ^^^^^^^^^ oh... it's in a loop for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; pte++, _address += PAGE_SIZE) { ret = do_swap_page() if (ret & VM_FAULT_RETRY) { down_read(&mm->mmap_sem); ^^^^^^^^^ ... } } so there can be multiple sem->count++ in __collapse_huge_page_swapin(), and you don't know how many sem->count-- you need to do later? is this correct or am I hallucinating? -ss > : if (hugepage_vma_revalidate(mm, vma, address)) > : return false; > : } > } > > up_read(&mm->mmap_sem); > goto out; > } > } > > up_read(&mm->mmap_sem); > > > > so if __collapse_huge_page_swapin() retruns true we have: > - down_read() twice, up_read() once? > > the locking rules here are a bit confusing. (I didn't have my morning coffee yet). > > -ss > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup 2016-06-03 1:29 ` Sergey Senozhatsky @ 2016-06-03 4:14 ` Sergey Senozhatsky 0 siblings, 0 replies; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-03 4:14 UTC (permalink / raw) To: Sergey Senozhatsky Cc: Ebru Akagunduz, Vlastimil Babka, Andrew Morton, Michal Hocko, Kirill A. Shutemov, Stephen Rothwell, Andrea Arcangeli, Rik van Riel, linux-mm, linux-next, linux-kernel On (06/03/16 10:29), Sergey Senozhatsky wrote: > > if (allocstall == curr_allocstall && swap != 0) { > > if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { > > { > > : if (ret & VM_FAULT_RETRY) { > > : down_read(&mm->mmap_sem); > > : ^^^^^^^^^ > > oh... it's in a loop > > for (_address = address; _address < address + HPAGE_PMD_NR*PAGE_SIZE; > pte++, _address += PAGE_SIZE) { > ret = do_swap_page() > if (ret & VM_FAULT_RETRY) { > down_read(&mm->mmap_sem); > ^^^^^^^^^ > ... > } > } > > so there can be multiple sem->count++ in __collapse_huge_page_swapin(), > and you don't know how many sem->count-- you need to do later? is this > correct or am I hallucinating? No, I was wrong, sorry for the noise. it's getting unlocked in __collapse_huge_page_swapin() do_swap_page() lock_page_or_retry() if (flags & FAULT_FLAG_ALLOW_RETRY) up_read(&mm->mmap_sem); return VM_FAULT_RETRY -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz @ 2016-06-03 12:28 ` Ebru Akagunduz 2016-06-06 13:05 ` Vlastimil Babka 1 sibling, 1 reply; 27+ messages in thread From: Ebru Akagunduz @ 2016-06-03 12:28 UTC (permalink / raw) To: akpm Cc: vbabka, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange, Ebru Akagunduz After creating revalidate vma function, locking inconsistency occured due to directing the code path to wrong label. This patch directs to correct label and fix the inconsistency. Related commit that caused inconsistency: http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> --- mm/huge_memory.c | 14 ++++++++++---- 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 292cedd..8043d91 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -2493,13 +2493,18 @@ static void collapse_huge_page(struct mm_struct *mm, curr_allocstall = sum_vm_event(ALLOCSTALL); down_read(&mm->mmap_sem); result = hugepage_vma_revalidate(mm, vma, address); - if (result) - goto out; + if (result) { + mem_cgroup_cancel_charge(new_page, memcg, true); + up_read(&mm->mmap_sem); + goto out_nolock; + } pmd = mm_find_pmd(mm, address); if (!pmd) { result = SCAN_PMD_NULL; - goto out; + mem_cgroup_cancel_charge(new_page, memcg, true); + up_read(&mm->mmap_sem); + goto out_nolock; } /* @@ -2513,8 +2518,9 @@ static void collapse_huge_page(struct mm_struct *mm, * label out. Continuing to collapse causes inconsistency. */ if (!__collapse_huge_page_swapin(mm, vma, address, pmd)) { + mem_cgroup_cancel_charge(new_page, memcg, true); up_read(&mm->mmap_sem); - goto out; + goto out_nolock; } } -- 1.9.1 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz @ 2016-06-06 13:05 ` Vlastimil Babka 2016-06-09 3:51 ` Sergey Senozhatsky 0 siblings, 1 reply; 27+ messages in thread From: Vlastimil Babka @ 2016-06-06 13:05 UTC (permalink / raw) To: Ebru Akagunduz, akpm Cc: sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On 06/03/2016 02:28 PM, Ebru Akagunduz wrote: > After creating revalidate vma function, locking inconsistency occured > due to directing the code path to wrong label. This patch directs > to correct label and fix the inconsistency. > > Related commit that caused inconsistency: > http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0 > > Signed-off-by: Ebru Akagunduz <ebru.akagunduz@gmail.com> I think this does fix the inconsistency, thanks. But looking at collapse_huge_page() as of latest -next, I wonder if there's another problem: pmd = mm_find_pmd(mm, address); ... up_read(&mm->mmap_sem); down_write(&mm->mmap_sem); hugepage_vma_revalidate(mm, address); ... pte = pte_offset_map(pmd, address); What guarantees that 'pmd' is still valid? Vlastimil -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page 2016-06-06 13:05 ` Vlastimil Babka @ 2016-06-09 3:51 ` Sergey Senozhatsky 0 siblings, 0 replies; 27+ messages in thread From: Sergey Senozhatsky @ 2016-06-09 3:51 UTC (permalink / raw) To: Vlastimil Babka Cc: Ebru Akagunduz, akpm, sergey.senozhatsky.work, mhocko, kirill.shutemov, sfr, linux-mm, linux-next, linux-kernel, riel, aarcange On (06/06/16 15:05), Vlastimil Babka wrote: [..] > I think this does fix the inconsistency, thanks. > > But looking at collapse_huge_page() as of latest -next, I wonder if there's > another problem: > > pmd = mm_find_pmd(mm, address); > ... > up_read(&mm->mmap_sem); > down_write(&mm->mmap_sem); > hugepage_vma_revalidate(mm, address); > ... > pte = pte_offset_map(pmd, address); > > What guarantees that 'pmd' is still valid? the same question applied to __collapse_huge_page_swapin(), I think. __collapse_huge_page_swapin(pmd) pte = pte_offset_map(pmd, address); do_swap_page(mm, vma, _address, pte, pmd...) up_read(&mm->mmap_sem); down_read(&mm->mmap_sem); pte = pte_offset_map(pmd, _address); -ss -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2016-06-09 3:51 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <20160601131122.7dbb0a65@canb.auug.org.au> 2016-06-02 1:48 ` [linux-next: Tree for Jun 1] __khugepaged_exit rwsem_down_write_failed lockup Sergey Senozhatsky 2016-06-02 9:21 ` Michal Hocko 2016-06-02 12:08 ` Sergey Senozhatsky 2016-06-02 12:21 ` Michal Hocko 2016-06-03 13:51 ` Andrea Arcangeli 2016-06-03 14:46 ` Michal Hocko 2016-06-03 15:10 ` Andrea Arcangeli 2016-06-07 7:34 ` Michal Hocko 2016-06-08 8:19 ` Vlastimil Babka 2016-06-03 7:15 ` Sergey Senozhatsky 2016-06-03 7:25 ` Michal Hocko 2016-06-03 8:43 ` Sergey Senozhatsky 2016-06-03 9:55 ` Michal Hocko 2016-06-03 10:05 ` Michal Hocko 2016-06-03 13:38 ` Sergey Senozhatsky 2016-06-03 13:45 ` Michal Hocko 2016-06-03 13:49 ` Michal Hocko 2016-06-04 7:51 ` Sergey Senozhatsky 2016-06-06 8:39 ` Michal Hocko 2016-06-02 13:24 ` Vlastimil Babka 2016-06-02 18:58 ` Ebru Akagunduz 2016-06-03 1:00 ` Sergey Senozhatsky 2016-06-03 1:29 ` Sergey Senozhatsky 2016-06-03 4:14 ` Sergey Senozhatsky 2016-06-03 12:28 ` [PATCH] mm, thp: fix locking inconsistency in collapse_huge_page Ebru Akagunduz 2016-06-06 13:05 ` Vlastimil Babka 2016-06-09 3:51 ` Sergey Senozhatsky
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).