* [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs @ 2013-12-17 8:05 Wanpeng Li 2013-12-18 3:14 ` Wanpeng Li ` (2 more replies) 0 siblings, 3 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-17 8:05 UTC (permalink / raw) To: Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, linux-mm, linux-kernel, Wanpeng Li objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file correlates with offset-into-virtual-addresses does not hold. Hence what try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps the file to which the target page belongs. If vma locked, mlock the pages in the cluster, rather than unmapping them. However, not all pages are guarantee page locked instead of the check page. This patch fix the BUG by just confirm check page hold page lock instead of all pages in the virtual scan window against nolinear VMAs. [ 253.869145] kernel BUG at mm/mlock.c:82! [ 253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC [ 253.870098] Dumping ftrace buffer: [ 253.870098] (ftrace buffer empty) [ 253.870098] Modules linked in: [ 253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137 [ 253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000 [ 253.873310] RIP: 0010:[<ffffffff81281f28>] [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 [ 253.873310] RSP: 0000:ffff8804d34e99e8 EFLAGS: 00010246 [ 253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000 [ 253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0 [ 253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000 [ 253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000 [ 253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000 [ 253.873310] FS: 00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000 [ 253.873310] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0 [ 253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000 [ 253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 [ 253.873310] Stack: [ 253.873310] 0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31 [ 253.873310] 0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238 [ 253.873310] ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348 [ 253.873310] Call Trace: [ 253.873310] [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340 [ 253.873310] [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0 [ 253.873310] [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90 [ 253.873310] [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250 [ 253.873310] [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180 [ 253.873310] [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0 [ 253.873310] [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0 [ 253.873310] [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0 [ 253.873310] [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240 [ 253.873310] [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380 [ 253.873310] [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380 [ 253.873310] [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0 [ 253.873310] [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10 [ 253.873310] [<ffffffff843c4990>] tracesys+0xdd/0xe2 [ 253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08 5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba 2f 15 [ 253.873310] RIP [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 [ 253.873310] RSP <ffff8804d34e99e8> [ 253.904194] ---[ end trace be59c4a7f8edab3f ]--- Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> --- mm/huge_memory.c | 2 +- mm/internal.h | 4 ++-- mm/ksm.c | 2 +- mm/memory.c | 2 +- mm/mlock.c | 5 +++-- mm/rmap.c | 4 ++-- 6 files changed, 10 insertions(+), 9 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 33a5dc4..7a15b04 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if (page->mapping && trylock_page(page)) { lru_add_drain(); if (page->mapping) - mlock_vma_page(page); + mlock_vma_page(page, true); unlock_page(page); } } diff --git a/mm/internal.h b/mm/internal.h index a85a3ab..4ea2d4e 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, /* * must be called with vma's mmap_sem held for read or write, and page locked. */ -extern void mlock_vma_page(struct page *page); +extern void mlock_vma_page(struct page *page, bool check_page); extern unsigned int munlock_vma_page(struct page *page); /* @@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p) return 0; } static inline void clear_page_mlock(struct page *page) { } -static inline void mlock_vma_page(struct page *page) { } +static inline void mlock_vma_page(struct page *page, bool check_page) { } static inline void mlock_migrate_page(struct page *new, struct page *old) { } #endif /* !CONFIG_MMU */ diff --git a/mm/ksm.c b/mm/ksm.c index 175fff7..ec36f04 100644 --- a/mm/ksm.c +++ b/mm/ksm.c @@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, if (!PageMlocked(kpage)) { unlock_page(page); lock_page(kpage); - mlock_vma_page(kpage); + mlock_vma_page(kpage, true); page = kpage; /* for final unlock */ } } diff --git a/mm/memory.c b/mm/memory.c index cf6098c..a41df6a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1602,7 +1602,7 @@ split_fallthrough: * know the page is still mapped, we don't even * need to check for file-cache page truncation. */ - mlock_vma_page(page); + mlock_vma_page(page, true); unlock_page(page); } } diff --git a/mm/mlock.c b/mm/mlock.c index d480cd6..c395ec5 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page) * Mark page as mlocked if not already. * If page on LRU, isolate and putback to move to unevictable list. */ -void mlock_vma_page(struct page *page) +void mlock_vma_page(struct page *page, bool check_page) { - BUG_ON(!PageLocked(page)); + if (check_page) + BUG_ON(!PageLocked(page)); if (!TestSetPageMlocked(page)) { mod_zone_page_state(page_zone(page), NR_MLOCK, diff --git a/mm/rmap.c b/mm/rmap.c index 55c8b8d..79d456f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1297,7 +1297,7 @@ out_mlock: */ if (down_read_trylock(&vma->vm_mm->mmap_sem)) { if (vma->vm_flags & VM_LOCKED) { - mlock_vma_page(page); + mlock_vma_page(page, true); ret = SWAP_MLOCK; } up_read(&vma->vm_mm->mmap_sem); @@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, BUG_ON(!page || PageAnon(page)); if (locked_vma) { - mlock_vma_page(page); /* no-op if already mlocked */ + mlock_vma_page(page, page == check_page); /* no-op if already mlocked */ if (page == check_page) ret = SWAP_MLOCK; continue; /* don't unmap */ -- 1.8.3.2 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-17 8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li @ 2013-12-18 3:14 ` Wanpeng Li 2013-12-18 3:16 ` Wanpeng Li 2013-12-18 8:54 ` Vlastimil Babka 2 siblings, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 3:14 UTC (permalink / raw) To: Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote: Another alternative method to fix this bug. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-17 8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li 2013-12-18 3:14 ` Wanpeng Li @ 2013-12-18 3:16 ` Wanpeng Li 2013-12-18 3:23 ` Wanpeng Li [not found] ` <20131218032329.GA6044@hacker.(null)> 2013-12-18 8:54 ` Vlastimil Babka 2 siblings, 2 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 3:16 UTC (permalink / raw) To: Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 105 bytes --] On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote: Another alternative method to fix this bug. [-- Attachment #2: 0001-2.patch --] [-- Type: text/x-diff, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 3:16 ` Wanpeng Li @ 2013-12-18 3:23 ` Wanpeng Li [not found] ` <20131218032329.GA6044@hacker.(null)> 1 sibling, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 3:23 UTC (permalink / raw) To: Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 199 bytes --] On Wed, Dec 18, 2013 at 11:16:19AM +0800, Wanpeng Li wrote: >On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote: > >Another alternative method to fix this bug. Sorry for the wrong version. [-- Attachment #2: 0001-2.patch --] [-- Type: text/x-diff, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20131218032329.GA6044@hacker.(null)>]
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs [not found] ` <20131218032329.GA6044@hacker.(null)> @ 2013-12-18 3:32 ` Sasha Levin 2013-12-18 4:12 ` Wanpeng Li 0 siblings, 1 reply; 29+ messages in thread From: Sasha Levin @ 2013-12-18 3:32 UTC (permalink / raw) To: Wanpeng Li, Andrew Morton Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel On 12/17/2013 10:23 PM, Wanpeng Li wrote: > - mlock_vma_page(page); /* no-op if already mlocked */ > - if (page == check_page) > + if (page != check_page && trylock_page(page)) { > + mlock_vma_page(page); /* no-op if already mlocked */ > + unlock_page(page); > + } else if (page == check_page) { > + mlock_vma_page(page); /* no-op if already mlocked */ > ret = SWAP_MLOCK; > + } Previously, if page != check_page and the page was locked, we'd call mlock_vma_page() anyways. With this change, we don't. In fact, we'll just skip that entire block not doing anything. If that's something that's never supposed to happen, can we add a VM_BUG_ON(page != check_page && PageLocked(page)) Just to cover this new code path? Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 3:32 ` Sasha Levin @ 2013-12-18 4:12 ` Wanpeng Li 2013-12-18 9:11 ` Vlastimil Babka 0 siblings, 1 reply; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 4:12 UTC (permalink / raw) To: Sasha Levin, Andrew Morton Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel [-- Attachment #1: Type: text/plain, Size: 888 bytes --] Hi Sasha, On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote: >On 12/17/2013 10:23 PM, Wanpeng Li wrote: >>- mlock_vma_page(page); /* no-op if already mlocked */ >>- if (page == check_page) >>+ if (page != check_page && trylock_page(page)) { >>+ mlock_vma_page(page); /* no-op if already mlocked */ >>+ unlock_page(page); >>+ } else if (page == check_page) { >>+ mlock_vma_page(page); /* no-op if already mlocked */ >> ret = SWAP_MLOCK; >>+ } > >Previously, if page != check_page and the page was locked, we'd call mlock_vma_page() >anyways. With this change, we don't. In fact, we'll just skip that entire block not doing >anything. Thanks for pointing out. ;-) > >If that's something that's never supposed to happen, can we add a > > VM_BUG_ON(page != check_page && PageLocked(page)) > >Just to cover this new code path? > How about this one? [-- Attachment #2: 0001-3.patch --] [-- Type: text/x-diff, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 4:12 ` Wanpeng Li @ 2013-12-18 9:11 ` Vlastimil Babka 2013-12-18 9:23 ` Wanpeng Li [not found] ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> 0 siblings, 2 replies; 29+ messages in thread From: Vlastimil Babka @ 2013-12-18 9:11 UTC (permalink / raw) To: Wanpeng Li, Sasha Levin, Andrew Morton Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel On 12/18/2013 05:12 AM, Wanpeng Li wrote: > Hi Sasha, > On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote: >> On 12/17/2013 10:23 PM, Wanpeng Li wrote: >>> - mlock_vma_page(page); /* no-op if already mlocked */ >>> - if (page == check_page) >>> + if (page != check_page && trylock_page(page)) { >>> + mlock_vma_page(page); /* no-op if already mlocked */ >>> + unlock_page(page); >>> + } else if (page == check_page) { >>> + mlock_vma_page(page); /* no-op if already mlocked */ >>> ret = SWAP_MLOCK; >>> + } >> >> Previously, if page != check_page and the page was locked, we'd call mlock_vma_page() >> anyways. With this change, we don't. In fact, we'll just skip that entire block not doing >> anything. > > Thanks for pointing out. ;-) > >> >> If that's something that's never supposed to happen, can we add a >> >> VM_BUG_ON(page != check_page && PageLocked(page)) >> >> Just to cover this new code path? >> > > How about this one? > > > 0001-3.patch > > > From eab57d94c82fb3ab74b607a3ede0f5ce765ff2cc Mon Sep 17 00:00:00 2001 > From: Wanpeng Li <liwanp@linux.vnet.ibm.com> > Date: Wed, 18 Dec 2013 12:05:59 +0800 > Subject: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs > > objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file > correlates with offset-into-virtual-addresses does not hold. Hence what > try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps > the file to which the target page belongs. If vma locked, mlock the pages in the > cluster, rather than unmapping them. However, not all pages are guarantee page > locked instead of the check page. This patch fix the BUG by try to lock !check page > if them are unlocked. > > [ 253.869145] kernel BUG at mm/mlock.c:82! > [ 253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 253.870098] Dumping ftrace buffer: > [ 253.870098] (ftrace buffer empty) > [ 253.870098] Modules linked in: > [ 253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137 > [ 253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000 > [ 253.873310] RIP: 0010:[<ffffffff81281f28>] [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 > [ 253.873310] RSP: 0000:ffff8804d34e99e8 EFLAGS: 00010246 > [ 253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000 > [ 253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0 > [ 253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000 > [ 253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000 > [ 253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000 > [ 253.873310] FS: 00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000 > [ 253.873310] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0 > [ 253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > [ 253.873310] Stack: > [ 253.873310] 0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31 > [ 253.873310] 0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238 > [ 253.873310] ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348 > [ 253.873310] Call Trace: > [ 253.873310] [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340 > [ 253.873310] [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0 > [ 253.873310] [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90 > [ 253.873310] [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250 > [ 253.873310] [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180 > [ 253.873310] [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0 > [ 253.873310] [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0 > [ 253.873310] [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0 > [ 253.873310] [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240 > [ 253.873310] [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380 > [ 253.873310] [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380 > [ 253.873310] [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0 > [ 253.873310] [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10 > [ 253.873310] [<ffffffff843c4990>] tracesys+0xdd/0xe2 > [ 253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08 > 5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb > a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba > 2f 15 > [ 253.873310] RIP [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 > [ 253.873310] RSP <ffff8804d34e99e8> > [ 253.904194] ---[ end trace be59c4a7f8edab3f ]--- > > Reported-by: Sasha Levin <sasha.levin@oracle.com> > Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> > --- > mm/rmap.c | 11 +++++++++-- > 1 file changed, 9 insertions(+), 2 deletions(-) > > diff --git a/mm/rmap.c b/mm/rmap.c > index 55c8b8d..1e24813 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > unsigned long end; > int ret = SWAP_AGAIN; > int locked_vma = 0; > + int we_locked = 0; > > address = (vma->vm_start + cursor) & CLUSTER_MASK; > end = address + CLUSTER_SIZE; > @@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > BUG_ON(!page || PageAnon(page)); > > if (locked_vma) { > - mlock_vma_page(page); /* no-op if already mlocked */ > - if (page == check_page) > + if (page != check_page) { > + we_locked = trylock_page(page); If it's not us who has the page already locked, but somebody else, he might unlock it at this point and then the BUG_ON in mlock_vma_page() will trigger again. > + mlock_vma_page(page); /* no-op if already mlocked */ > + if (we_locked) > + unlock_page(page); > + } else if (page == check_page) { > + mlock_vma_page(page); /* no-op if already mlocked */ > ret = SWAP_MLOCK; > + } > continue; /* don't unmap */ > } > > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 9:11 ` Vlastimil Babka @ 2013-12-18 9:23 ` Wanpeng Li [not found] ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> 1 sibling, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 9:23 UTC (permalink / raw) To: Vlastimil Babka, Sasha Levin, Andrew Morton Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel On Wed, Dec 18, 2013 at 10:11:43AM +0100, Vlastimil Babka wrote: >On 12/18/2013 05:12 AM, Wanpeng Li wrote: >>Hi Sasha, >>On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote: >>>On 12/17/2013 10:23 PM, Wanpeng Li wrote: >>>>- mlock_vma_page(page); /* no-op if already mlocked */ >>>>- if (page == check_page) >>>>+ if (page != check_page && trylock_page(page)) { >>>>+ mlock_vma_page(page); /* no-op if already mlocked */ >>>>+ unlock_page(page); >>>>+ } else if (page == check_page) { >>>>+ mlock_vma_page(page); /* no-op if already mlocked */ >>>> ret = SWAP_MLOCK; >>>>+ } >>> >>>Previously, if page != check_page and the page was locked, we'd call mlock_vma_page() >>>anyways. With this change, we don't. In fact, we'll just skip that entire block not doing >>>anything. >> >>Thanks for pointing out. ;-) >> >>> >>>If that's something that's never supposed to happen, can we add a >>> >>> VM_BUG_ON(page != check_page && PageLocked(page)) >>> >>>Just to cover this new code path? >>> >> >>How about this one? >> >> >>0001-3.patch >> >> >> From eab57d94c82fb3ab74b607a3ede0f5ce765ff2cc Mon Sep 17 00:00:00 2001 >>From: Wanpeng Li <liwanp@linux.vnet.ibm.com> >>Date: Wed, 18 Dec 2013 12:05:59 +0800 >>Subject: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs >> >>objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file >>correlates with offset-into-virtual-addresses does not hold. Hence what >>try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps >>the file to which the target page belongs. If vma locked, mlock the pages in the >>cluster, rather than unmapping them. However, not all pages are guarantee page >>locked instead of the check page. This patch fix the BUG by try to lock !check page >>if them are unlocked. >> >>[ 253.869145] kernel BUG at mm/mlock.c:82! >>[ 253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >>[ 253.870098] Dumping ftrace buffer: >>[ 253.870098] (ftrace buffer empty) >>[ 253.870098] Modules linked in: >>[ 253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137 >>[ 253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000 >>[ 253.873310] RIP: 0010:[<ffffffff81281f28>] [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 >>[ 253.873310] RSP: 0000:ffff8804d34e99e8 EFLAGS: 00010246 >>[ 253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000 >>[ 253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0 >>[ 253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000 >>[ 253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000 >>[ 253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000 >>[ 253.873310] FS: 00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000 >>[ 253.873310] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>[ 253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0 >>[ 253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000 >>[ 253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >>[ 253.873310] Stack: >>[ 253.873310] 0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31 >>[ 253.873310] 0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238 >>[ 253.873310] ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348 >>[ 253.873310] Call Trace: >>[ 253.873310] [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340 >>[ 253.873310] [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0 >>[ 253.873310] [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90 >>[ 253.873310] [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250 >>[ 253.873310] [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180 >>[ 253.873310] [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0 >>[ 253.873310] [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0 >>[ 253.873310] [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0 >>[ 253.873310] [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240 >>[ 253.873310] [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380 >>[ 253.873310] [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380 >>[ 253.873310] [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0 >>[ 253.873310] [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10 >>[ 253.873310] [<ffffffff843c4990>] tracesys+0xdd/0xe2 >>[ 253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08 >>5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb >>a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba >>2f 15 >>[ 253.873310] RIP [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 >>[ 253.873310] RSP <ffff8804d34e99e8> >>[ 253.904194] ---[ end trace be59c4a7f8edab3f ]--- >> >>Reported-by: Sasha Levin <sasha.levin@oracle.com> >>Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> >>--- >> mm/rmap.c | 11 +++++++++-- >> 1 file changed, 9 insertions(+), 2 deletions(-) >> >>diff --git a/mm/rmap.c b/mm/rmap.c >>index 55c8b8d..1e24813 100644 >>--- a/mm/rmap.c >>+++ b/mm/rmap.c >>@@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, >> unsigned long end; >> int ret = SWAP_AGAIN; >> int locked_vma = 0; >>+ int we_locked = 0; >> >> address = (vma->vm_start + cursor) & CLUSTER_MASK; >> end = address + CLUSTER_SIZE; >>@@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, >> BUG_ON(!page || PageAnon(page)); >> >> if (locked_vma) { >>- mlock_vma_page(page); /* no-op if already mlocked */ >>- if (page == check_page) >>+ if (page != check_page) { >>+ we_locked = trylock_page(page); > >If it's not us who has the page already locked, but somebody else, he >might unlock it at this point and then the BUG_ON in mlock_vma_page() >will trigger again. Any better idea is appreciated. ;-) Regards, Wanpeng Li > >>+ mlock_vma_page(page); /* no-op if already mlocked */ >>+ if (we_locked) >>+ unlock_page(page); >>+ } else if (page == check_page) { >>+ mlock_vma_page(page); /* no-op if already mlocked */ >> ret = SWAP_MLOCK; >>+ } >> continue; /* don't unmap */ >> } >> >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>]
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs [not found] ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> @ 2013-12-18 21:43 ` Andrew Morton 2014-01-03 20:17 ` Sasha Levin 0 siblings, 1 reply; 29+ messages in thread From: Andrew Morton @ 2013-12-18 21:43 UTC (permalink / raw) To: Wanpeng Li Cc: Vlastimil Babka, Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel On Wed, 18 Dec 2013 17:23:03 +0800 Wanpeng Li <liwanp@linux.vnet.ibm.com> wrote: > >>diff --git a/mm/rmap.c b/mm/rmap.c > >>index 55c8b8d..1e24813 100644 > >>--- a/mm/rmap.c > >>+++ b/mm/rmap.c > >>@@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > >> unsigned long end; > >> int ret = SWAP_AGAIN; > >> int locked_vma = 0; > >>+ int we_locked = 0; > >> > >> address = (vma->vm_start + cursor) & CLUSTER_MASK; > >> end = address + CLUSTER_SIZE; > >>@@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > >> BUG_ON(!page || PageAnon(page)); > >> > >> if (locked_vma) { > >>- mlock_vma_page(page); /* no-op if already mlocked */ > >>- if (page == check_page) > >>+ if (page != check_page) { > >>+ we_locked = trylock_page(page); > > > >If it's not us who has the page already locked, but somebody else, he > >might unlock it at this point and then the BUG_ON in mlock_vma_page() > >will trigger again. yes, this patch is pretty weak. > Any better idea is appreciated. ;-) Remove the BUG_ON() from mlock_vma_page()? Why was it added? isolate_lru_page() and putback_lru_page() and *might* require the page be locked, but I don't immediately see issues? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 21:43 ` Andrew Morton @ 2014-01-03 20:17 ` Sasha Levin 2014-01-03 20:52 ` Linus Torvalds 0 siblings, 1 reply; 29+ messages in thread From: Sasha Levin @ 2014-01-03 20:17 UTC (permalink / raw) To: Andrew Morton, Wanpeng Li Cc: Vlastimil Babka, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel, Linus Torvalds On 12/18/2013 04:43 PM, Andrew Morton wrote: > On Wed, 18 Dec 2013 17:23:03 +0800 Wanpeng Li <liwanp@linux.vnet.ibm.com> wrote: > >>>> diff --git a/mm/rmap.c b/mm/rmap.c >>>> index 55c8b8d..1e24813 100644 >>>> --- a/mm/rmap.c >>>> +++ b/mm/rmap.c >>>> @@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, >>>> unsigned long end; >>>> int ret = SWAP_AGAIN; >>>> int locked_vma = 0; >>>> + int we_locked = 0; >>>> >>>> address = (vma->vm_start + cursor) & CLUSTER_MASK; >>>> end = address + CLUSTER_SIZE; >>>> @@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, >>>> BUG_ON(!page || PageAnon(page)); >>>> >>>> if (locked_vma) { >>>> - mlock_vma_page(page); /* no-op if already mlocked */ >>>> - if (page == check_page) >>>> + if (page != check_page) { >>>> + we_locked = trylock_page(page); >>> >>> If it's not us who has the page already locked, but somebody else, he >>> might unlock it at this point and then the BUG_ON in mlock_vma_page() >>> will trigger again. > > yes, this patch is pretty weak. > >> Any better idea is appreciated. ;-) > > Remove the BUG_ON() from mlock_vma_page()? Why was it added? > isolate_lru_page() and putback_lru_page() and *might* require > the page be locked, but I don't immediately see issues? Ping? This BUG() is triggerable in 3.13-rc6 right now. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 20:17 ` Sasha Levin @ 2014-01-03 20:52 ` Linus Torvalds 2014-01-03 23:36 ` Vlastimil Babka 2014-01-04 3:31 ` Bob Liu 0 siblings, 2 replies; 29+ messages in thread From: Linus Torvalds @ 2014-01-03 20:52 UTC (permalink / raw) To: Sasha Levin Cc: Andrew Morton, Wanpeng Li, Vlastimil Babka, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote: > > Ping? This BUG() is triggerable in 3.13-rc6 right now. So Andrew suggested just removing the BUG_ON(), but it's been there for a *long* time. And I detest the patch that was sent out that said "Should I check?" Maybe we should just remove that mlock_vma_page() thing instead in try_to_unmap_cluster()? Or maybe actually lock the page around calling it? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 20:52 ` Linus Torvalds @ 2014-01-03 23:36 ` Vlastimil Babka 2014-01-03 23:56 ` Andrew Morton 2014-01-04 0:18 ` Linus Torvalds 2014-01-04 3:31 ` Bob Liu 1 sibling, 2 replies; 29+ messages in thread From: Vlastimil Babka @ 2014-01-03 23:36 UTC (permalink / raw) To: Linus Torvalds, Sasha Levin Cc: Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On 01/03/2014 09:52 PM, Linus Torvalds wrote: > On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote: >> >> Ping? This BUG() is triggerable in 3.13-rc6 right now. > > So Andrew suggested just removing the BUG_ON(), but it's been there > for a *long* time. Yes, Andrew also merged this patch for that: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch But there wasn't enough confidence in the fix to sent it to you yet, I guess. The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html > And I detest the patch that was sent out that said "Should I check?" > > Maybe we should just remove that mlock_vma_page() thing instead in You mean that it it's already undeterministic because it can be already skipped when mmap_sem can't be acquired for read? I think the assumption for this case is that mmap_sem is already held for write which means VM_LOCKED is unset anyway (per comments at try_to_unmap_file(), which calls try_to_unmap_cluster()). I'm however not sure how it is protected from somebody else holding the semaphore... > try_to_unmap_cluster()? Or maybe actually lock the page around calling > it? check_page is already locked, see try_to_munlock() which calls try_to_unmap_file(). So this might smell of potential deadlock? I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough race protection. > Linus > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 23:36 ` Vlastimil Babka @ 2014-01-03 23:56 ` Andrew Morton 2014-01-04 3:03 ` Sasha Levin 2014-01-04 0:18 ` Linus Torvalds 1 sibling, 1 reply; 29+ messages in thread From: Andrew Morton @ 2014-01-03 23:56 UTC (permalink / raw) To: Vlastimil Babka Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On Sat, 04 Jan 2014 00:36:18 +0100 Vlastimil Babka <vbabka@suse.cz> wrote: > On 01/03/2014 09:52 PM, Linus Torvalds wrote: > > On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote: > >> > >> Ping? This BUG() is triggerable in 3.13-rc6 right now. > > > > So Andrew suggested just removing the BUG_ON(), but it's been there > > for a *long* time. > > Yes, Andrew also merged this patch for that: > http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch > > But there wasn't enough confidence in the fix to sent it to you yet, I guess. > > The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html Yes, I'd taken the cowardly approach of scheduling it for 3.14, with a 3.13.x backport. Nobody answered my question! Is this a new bug or is it a five-year-old bug which we only just discovered? I guess it doesn't matter much - we should fix it in 3.13. I'll include it in the next for-3.13 batch. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 23:56 ` Andrew Morton @ 2014-01-04 3:03 ` Sasha Levin 0 siblings, 0 replies; 29+ messages in thread From: Sasha Levin @ 2014-01-04 3:03 UTC (permalink / raw) To: Andrew Morton, Vlastimil Babka Cc: Linus Torvalds, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On 01/03/2014 06:56 PM, Andrew Morton wrote: > Nobody answered my question! Is this a new bug or is it a > five-year-old bug which we only just discovered? I've rolled trinity back and was unable to reproduce this issue anymore. This seems to be a 5 year old bug. Thanks, Sasha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 23:36 ` Vlastimil Babka 2014-01-03 23:56 ` Andrew Morton @ 2014-01-04 0:18 ` Linus Torvalds 2014-01-04 8:09 ` Vlastimil Babka ` (2 more replies) 1 sibling, 3 replies; 29+ messages in thread From: Linus Torvalds @ 2014-01-04 0:18 UTC (permalink / raw) To: Vlastimil Babka Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: > > I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough > race protection. Maybe. But dammit, that's subtle, and I don't think you're even right. It basically depends on mlock_vma_page() and munlock_vma_page() being able to run CONCURRENTLY on the same page. In particular, you could have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page() immediately clearing it on another, and then the rest of those functions could run with a totally arbitrary interleaving when working with the exact same page. They both do basically if (!isolate_lru_page(page)) putback_lru_page(page); but one or the other would randomly win the race (it's internally protected by the lru lock), and *if* the munlock_vma_page() wins it, it would also do try_to_munlock(page); but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely broken - you end up with the PageMlocked bit clear, but try_to_munlock() was never called on that page, because mlock_vma_page() got to the page isolation before the "subsequent" munlock_vma_page(). And this is very much what the page lock serialization would prevent. So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, yes, but that only "serializes" in one direction, not when you can have a mix of bit setting and clearing. So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least enforces some kind of ordering. And try_to_unmap_cluster() is just broken in calling that without the page being locked. That's my opinion. There may be some *other* reason why it all happens to work, but no, "TestSetPageMlocked should provide enough race protection" is simply not true, and even if it were, it's way too subtle and odd to be a good rule. So I really object to just removing the BUG_ON(). Not with a *lot* more explanation as to why these kinds of issues wouldn't matter. Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-04 0:18 ` Linus Torvalds @ 2014-01-04 8:09 ` Vlastimil Babka 2014-01-05 0:27 ` Wanpeng Li 2014-01-06 16:47 ` Motohiro Kosaki 2 siblings, 0 replies; 29+ messages in thread From: Vlastimil Babka @ 2014-01-04 8:09 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On 01/04/2014 01:18 AM, Linus Torvalds wrote: > On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >> >> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough >> race protection. > > Maybe. But dammit, that's subtle, and I don't think you're even right. > > It basically depends on mlock_vma_page() and munlock_vma_page() being > able to run CONCURRENTLY on the same page. In particular, you could > have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page() > immediately clearing it on another, and then the rest of those > functions could run with a totally arbitrary interleaving when working > with the exact same page. > > They both do basically > > if (!isolate_lru_page(page)) > putback_lru_page(page); > > but one or the other would randomly win the race (it's internally > protected by the lru lock), and *if* the munlock_vma_page() wins it, > it would also do > > try_to_munlock(page); > > but if mlock_vma_page() wins it, that wouldn't happen. That looks > entirely broken - you end up with the PageMlocked bit clear, but > try_to_munlock() was never called on that page, because > mlock_vma_page() got to the page isolation before the "subsequent" > munlock_vma_page(). I got the impression (see e.g. munlock_vma_page() comments) that the whole thing is designed with this possibility in mind. isolate_lru_page() may fail (presumably also in other scenarios than this) and if try_to_munlock() was not called here, then yes the page might lose the PageMlocked bit and go to LRU instead of inevictable list, but try_to_unmap() should catch and fix this. That would also explain why mlock_vma_page() is called from try_to_unmap_cluster(). So if I understand correctly, PageMlocked bit is not something that has to be correctly set 100% of the time, but when it's set correctly most of the time, then most of these pages will go to inevictable list and spare vmscan's time. > And this is very much what the page lock serialization would prevent. > So no, the PageMlocked in *no* way gives serialization. It's an atomic > bit op, yes, but that only "serializes" in one direction, not when you > can have a mix of bit setting and clearing. > > So quite frankly, I think you're wrong. The BUG_ON() is correct, or at > least enforces some kind of ordering. And try_to_unmap_cluster() is > just broken in calling that without the page being locked. That's my > opinion. There may be some *other* reason why it all happens to work, > but no, "TestSetPageMlocked should provide enough race protection" is > simply not true, and even if it were, it's way too subtle and odd to > be a good rule. Right, it was stupid of me to write such strong statement without any details. I wanted to review that patch when back at work next week, but since it came up now, I just wanted to point out that it's in the pipeline for this bug. > So I really object to just removing the BUG_ON(). Not with a *lot* > more explanation as to why these kinds of issues wouldn't matter. > > Linus > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-04 0:18 ` Linus Torvalds 2014-01-04 8:09 ` Vlastimil Babka @ 2014-01-05 0:27 ` Wanpeng Li 2014-01-06 16:47 ` Motohiro Kosaki 2 siblings, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2014-01-05 0:27 UTC (permalink / raw) To: Andrew Morton, Vlastimil Babka Cc: Linus Torvalds, Sasha Levin, Michel Lespinasse, Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List Hi Andrew, Vlastimil On Fri, Jan 03, 2014 at 04:18:05PM -0800, Linus Torvalds wrote: >On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >> >> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough >> race protection. > >Maybe. But dammit, that's subtle, and I don't think you're even right. > >It basically depends on mlock_vma_page() and munlock_vma_page() being >able to run CONCURRENTLY on the same page. In particular, you could >have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >immediately clearing it on another, and then the rest of those >functions could run with a totally arbitrary interleaving when working >with the exact same page. > >They both do basically > > if (!isolate_lru_page(page)) > putback_lru_page(page); > >but one or the other would randomly win the race (it's internally >protected by the lru lock), and *if* the munlock_vma_page() wins it, >it would also do > > try_to_munlock(page); > >but if mlock_vma_page() wins it, that wouldn't happen. That looks >entirely broken - you end up with the PageMlocked bit clear, but >try_to_munlock() was never called on that page, because >mlock_vma_page() got to the page isolation before the "subsequent" >munlock_vma_page(). > >And this is very much what the page lock serialization would prevent. >So no, the PageMlocked in *no* way gives serialization. It's an atomic >bit op, yes, but that only "serializes" in one direction, not when you >can have a mix of bit setting and clearing. > >So quite frankly, I think you're wrong. The BUG_ON() is correct, or at >least enforces some kind of ordering. And try_to_unmap_cluster() is >just broken in calling that without the page being locked. That's my >opinion. There may be some *other* reason why it all happens to work, >but no, "TestSetPageMlocked should provide enough race protection" is >simply not true, and even if it were, it's way too subtle and odd to >be a good rule. > >So I really object to just removing the BUG_ON(). Not with a *lot* >more explanation as to why these kinds of issues wouldn't matter. Any better idea to improve my patch which in the "lock the page around calling it" direction. ;-) http://marc.info/?l=linux-mm&m=138733994417230&w=2 Regards, Wanpeng Li > > Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-04 0:18 ` Linus Torvalds 2014-01-04 8:09 ` Vlastimil Babka 2014-01-05 0:27 ` Wanpeng Li @ 2014-01-06 16:47 ` Motohiro Kosaki 2014-01-06 22:01 ` KOSAKI Motohiro ` (2 more replies) 2 siblings, 3 replies; 29+ messages in thread From: Motohiro Kosaki @ 2014-01-06 16:47 UTC (permalink / raw) To: Linus Torvalds, Vlastimil Babka Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List > -----Original Message----- > From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus > Torvalds > Sent: Friday, January 03, 2014 7:18 PM > To: Vlastimil Babka > Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; > Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; > Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing > List > Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear > VMAs > > On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: > > > > I'm for going with the removal of BUG_ON. The TestSetPageMlocked > > should provide enough race protection. > > Maybe. But dammit, that's subtle, and I don't think you're even right. > > It basically depends on mlock_vma_page() and munlock_vma_page() being > able to run CONCURRENTLY on the same page. In particular, you could have a > mlock_vma_page() set the bit on one CPU, and munlock_vma_page() > immediately clearing it on another, and then the rest of those functions > could run with a totally arbitrary interleaving when working with the exact > same page. > > They both do basically > > if (!isolate_lru_page(page)) > putback_lru_page(page); > > but one or the other would randomly win the race (it's internally protected > by the lru lock), and *if* the munlock_vma_page() wins it, it would also do > > try_to_munlock(page); > > but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely > broken - you end up with the PageMlocked bit clear, but > try_to_munlock() was never called on that page, because > mlock_vma_page() got to the page isolation before the "subsequent" > munlock_vma_page(). > > And this is very much what the page lock serialization would prevent. > So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, > yes, but that only "serializes" in one direction, not when you can have a mix > of bit setting and clearing. > > So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least > enforces some kind of ordering. And try_to_unmap_cluster() is just broken > in calling that without the page being locked. That's my opinion. There may > be some *other* reason why it all happens to work, but no, > "TestSetPageMlocked should provide enough race protection" is simply not > true, and even if it were, it's way too subtle and odd to be a good rule. > > So I really object to just removing the BUG_ON(). Not with a *lot* more > explanation as to why these kinds of issues wouldn't matter. I don't have a perfect answer. But I can explain a bit history. Let's me try. First off, 5 years ago, Lee's original putback_lru_page() implementation required page-lock, but I removed the restriction months later. That's why we can see strange BUG_ON here. 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by mmap_sem (write mdoe). Then, mlock and munlock had no race. Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to protect against munlock. Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1), reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. By the way, page isolation is still necessary because we need to protect another page modification like page migration. My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out, If I am overlooking something. Thanks. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-06 16:47 ` Motohiro Kosaki @ 2014-01-06 22:01 ` KOSAKI Motohiro 2014-01-07 5:27 ` Wanpeng Li 2014-01-07 15:01 ` Vlastimil Babka 2 siblings, 0 replies; 29+ messages in thread From: KOSAKI Motohiro @ 2014-01-06 22:01 UTC (permalink / raw) To: Motohiro Kosaki Cc: Linus Torvalds, Vlastimil Babka, Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On Mon, Jan 6, 2014 at 11:47 AM, Motohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com> wrote: > > >> -----Original Message----- >> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus >> Torvalds >> Sent: Friday, January 03, 2014 7:18 PM >> To: Vlastimil Babka >> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; >> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; >> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing >> List >> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear >> VMAs >> >> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >> > >> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked >> > should provide enough race protection. >> >> Maybe. But dammit, that's subtle, and I don't think you're even right. >> >> It basically depends on mlock_vma_page() and munlock_vma_page() being >> able to run CONCURRENTLY on the same page. In particular, you could have a >> mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >> immediately clearing it on another, and then the rest of those functions >> could run with a totally arbitrary interleaving when working with the exact >> same page. >> >> They both do basically >> >> if (!isolate_lru_page(page)) >> putback_lru_page(page); >> >> but one or the other would randomly win the race (it's internally protected >> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do >> >> try_to_munlock(page); >> >> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely >> broken - you end up with the PageMlocked bit clear, but >> try_to_munlock() was never called on that page, because >> mlock_vma_page() got to the page isolation before the "subsequent" >> munlock_vma_page(). >> >> And this is very much what the page lock serialization would prevent. >> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, >> yes, but that only "serializes" in one direction, not when you can have a mix >> of bit setting and clearing. >> >> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least >> enforces some kind of ordering. And try_to_unmap_cluster() is just broken >> in calling that without the page being locked. That's my opinion. There may >> be some *other* reason why it all happens to work, but no, >> "TestSetPageMlocked should provide enough race protection" is simply not >> true, and even if it were, it's way too subtle and odd to be a good rule. >> >> So I really object to just removing the BUG_ON(). Not with a *lot* more >> explanation as to why these kinds of issues wouldn't matter. > > I don't have a perfect answer. But I can explain a bit history. Let's me try. > > First off, 5 years ago, Lee's original putback_lru_page() implementation required > page-lock, but I removed the restriction months later. That's why we can see > strange BUG_ON here. > > 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by > mmap_sem (write mdoe). Then, mlock and munlock had no race. > Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to > protect against munlock. > > Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under > mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1), > reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect > VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. > > By the way, page isolation is still necessary because we need to protect another page modification like page migration. > > > My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out, > If I am overlooking something. No. I did talk about completely different issue. My memory is completely broken as I said. I need to read latest code and dig past discussion. Sorry again, please ignore my last mail. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-06 16:47 ` Motohiro Kosaki 2014-01-06 22:01 ` KOSAKI Motohiro @ 2014-01-07 5:27 ` Wanpeng Li 2014-01-07 15:01 ` Vlastimil Babka 2 siblings, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2014-01-07 5:27 UTC (permalink / raw) To: Motohiro Kosaki Cc: Linus Torvalds, Vlastimil Babka, Sasha Levin, Andrew Morton, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List Hi Motohiro, On Mon, Jan 06, 2014 at 08:47:23AM -0800, Motohiro Kosaki wrote: > > >> -----Original Message----- >> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus >> Torvalds >> Sent: Friday, January 03, 2014 7:18 PM >> To: Vlastimil Babka >> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; >> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; >> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing >> List >> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear >> VMAs >> >> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >> > >> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked >> > should provide enough race protection. >> >> Maybe. But dammit, that's subtle, and I don't think you're even right. >> >> It basically depends on mlock_vma_page() and munlock_vma_page() being >> able to run CONCURRENTLY on the same page. In particular, you could have a >> mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >> immediately clearing it on another, and then the rest of those functions >> could run with a totally arbitrary interleaving when working with the exact >> same page. >> >> They both do basically >> >> if (!isolate_lru_page(page)) >> putback_lru_page(page); >> >> but one or the other would randomly win the race (it's internally protected >> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do >> >> try_to_munlock(page); >> >> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely >> broken - you end up with the PageMlocked bit clear, but >> try_to_munlock() was never called on that page, because >> mlock_vma_page() got to the page isolation before the "subsequent" >> munlock_vma_page(). >> >> And this is very much what the page lock serialization would prevent. >> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, >> yes, but that only "serializes" in one direction, not when you can have a mix >> of bit setting and clearing. >> >> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least >> enforces some kind of ordering. And try_to_unmap_cluster() is just broken >> in calling that without the page being locked. That's my opinion. There may >> be some *other* reason why it all happens to work, but no, >> "TestSetPageMlocked should provide enough race protection" is simply not >> true, and even if it were, it's way too subtle and odd to be a good rule. >> >> So I really object to just removing the BUG_ON(). Not with a *lot* more >> explanation as to why these kinds of issues wouldn't matter. > >I don't have a perfect answer. But I can explain a bit history. Let's me try. > >First off, 5 years ago, Lee's original putback_lru_page() implementation required >page-lock, but I removed the restriction months later. That's why we can see >strange BUG_ON here. > >5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by >mmap_sem (write mdoe). Then, mlock and munlock had no race. >Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to >protect against munlock. > >Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under >mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. | If reclaim race against step (1), reclaim must lose because it uses trylock. Could you point out when page is locked during 1)? Regards, Wanpeng Li >On the other hand, if reclaim race against step (2), reclaim must detect >VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. > >By the way, page isolation is still necessary because we need to protect another page modification like page migration. > > >My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out, >If I am overlooking something. > >Thanks. > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-06 16:47 ` Motohiro Kosaki 2014-01-06 22:01 ` KOSAKI Motohiro 2014-01-07 5:27 ` Wanpeng Li @ 2014-01-07 15:01 ` Vlastimil Babka 2014-01-08 1:06 ` Bob Liu ` (2 more replies) 2 siblings, 3 replies; 29+ messages in thread From: Vlastimil Babka @ 2014-01-07 15:01 UTC (permalink / raw) To: Motohiro Kosaki, Andrew Morton Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim On 01/06/2014 05:47 PM, Motohiro Kosaki wrote: > > >> -----Original Message----- >> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus >> Torvalds >> Sent: Friday, January 03, 2014 7:18 PM >> To: Vlastimil Babka >> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; >> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; >> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing >> List >> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear >> VMAs >> >> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >>> >>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked >>> should provide enough race protection. >> >> Maybe. But dammit, that's subtle, and I don't think you're even right. >> >> It basically depends on mlock_vma_page() and munlock_vma_page() being >> able to run CONCURRENTLY on the same page. In particular, you could have a >> mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >> immediately clearing it on another, and then the rest of those functions >> could run with a totally arbitrary interleaving when working with the exact >> same page. >> >> They both do basically >> >> if (!isolate_lru_page(page)) >> putback_lru_page(page); >> >> but one or the other would randomly win the race (it's internally protected >> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do >> >> try_to_munlock(page); >> >> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely >> broken - you end up with the PageMlocked bit clear, but >> try_to_munlock() was never called on that page, because >> mlock_vma_page() got to the page isolation before the "subsequent" >> munlock_vma_page(). >> >> And this is very much what the page lock serialization would prevent. >> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, >> yes, but that only "serializes" in one direction, not when you can have a mix >> of bit setting and clearing. >> >> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least >> enforces some kind of ordering. And try_to_unmap_cluster() is just broken >> in calling that without the page being locked. That's my opinion. There may >> be some *other* reason why it all happens to work, but no, >> "TestSetPageMlocked should provide enough race protection" is simply not >> true, and even if it were, it's way too subtle and odd to be a good rule. >> >> So I really object to just removing the BUG_ON(). Not with a *lot* more >> explanation as to why these kinds of issues wouldn't matter. > > I don't have a perfect answer. But I can explain a bit history. Let's me try. > > First off, 5 years ago, Lee's original putback_lru_page() implementation required > page-lock, but I removed the restriction months later. That's why we can see > strange BUG_ON here. > > 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by > mmap_sem (write mdoe). Then, mlock and munlock had no race. > Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to > protect against munlock. > > Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under > mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1), > reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect > VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. > > By the way, page isolation is still necessary because we need to protect another page modification like page migration. I guess you meant page locking, not (lru) isolation. Indeed, the documentation at Documentation/vm/unevictable-lru.txt also discusses races with page migration. I've checked and it seems really the case that mlock_migrate_page() could race with mlock_vma_page() so that PG_mlocked is set on the old page again after deciding that the new page will be without the flag. (or after the flag was transferred to the new page) That would not be fatal, but distort accounting anyway. So here's a patch that if accepted should replace the removal of BUG_ON patch in -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch The idea is that try_to_unmap_cluster() will try locking the page for mlock, and just leave it alone if lock cannot be obtained. Again that's not fatal, as eventually something will encounter and mlock the page. -----8<----- From: Vlastimil Babka <vbabka@suse.cz> Date: Tue, 7 Jan 2014 14:59:58 +0100 Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing with trinity. The call site try_to_unmap_cluster() does not lock the pages other than its check_page parameter (which is already locked). The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat unclear, but apparently it serializes against page migration, which could otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the page would be eventually encountered again, but NR_MLOCK accounting would become distorted nevertheless. This patch adds a comment to the BUG_ON in mlock_vma_page() and munlock_vma_page() to that effect. The call site try_to_unmap_cluster() is fixed so that for page != check_page, trylock_page() is attempted (to avoid possible deadlocks as we already have check_page locked) and mlock_vma_page() is performed only upon success. If the page lock cannot be obtained, the page is left without PG_mlocked, which is again not a problem in the whole unevictable memory design. Reported-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Bob Liu <bob.liu@oracle.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: <stable@vger.kernel.org> Signed-off-by: Vlastimil Babka <vbabka@suse.cz> --- mm/mlock.c | 2 ++ mm/rmap.c | 14 ++++++++++++-- 2 files changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/mlock.c b/mm/mlock.c index 192e6ee..1b12dfa 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page) */ void mlock_vma_page(struct page *page) { + /* Serialize with page migration */ BUG_ON(!PageLocked(page)); if (!TestSetPageMlocked(page)) { @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page) { unsigned int nr_pages; + /* For try_to_munlock() and to serialize with page migration */ BUG_ON(!PageLocked(page)); if (TestClearPageMlocked(page)) { diff --git a/mm/rmap.c b/mm/rmap.c index 068522d..b99c742 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, BUG_ON(!page || PageAnon(page)); if (locked_vma) { - mlock_vma_page(page); /* no-op if already mlocked */ - if (page == check_page) + if (page == check_page) { + /* we know we have check_page locked */ + mlock_vma_page(page); ret = SWAP_MLOCK; + } else if (trylock_page(page)) { + /* + * If we can lock the page, perform mlock. + * Otherwise leave the page alone, it will be + * eventually encountered again later. + */ + mlock_vma_page(page); + unlock_page(page); + } continue; /* don't unmap */ } -- 1.8.4 -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-07 15:01 ` Vlastimil Babka @ 2014-01-08 1:06 ` Bob Liu 2014-01-08 2:44 ` Linus Torvalds 2014-01-10 17:48 ` Motohiro Kosaki 2 siblings, 0 replies; 29+ messages in thread From: Bob Liu @ 2014-01-08 1:06 UTC (permalink / raw) To: Vlastimil Babka Cc: Motohiro Kosaki, Andrew Morton, Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim On 01/07/2014 11:01 PM, Vlastimil Babka wrote: > On 01/06/2014 05:47 PM, Motohiro Kosaki wrote: >> >> >>> -----Original Message----- >>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus >>> Torvalds >>> Sent: Friday, January 03, 2014 7:18 PM >>> To: Vlastimil Babka >>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu; >>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman; >>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing >>> List >>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear >>> VMAs >>> >>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote: >>>> >>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked >>>> should provide enough race protection. >>> >>> Maybe. But dammit, that's subtle, and I don't think you're even right. >>> >>> It basically depends on mlock_vma_page() and munlock_vma_page() being >>> able to run CONCURRENTLY on the same page. In particular, you could have a >>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page() >>> immediately clearing it on another, and then the rest of those functions >>> could run with a totally arbitrary interleaving when working with the exact >>> same page. >>> >>> They both do basically >>> >>> if (!isolate_lru_page(page)) >>> putback_lru_page(page); >>> >>> but one or the other would randomly win the race (it's internally protected >>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do >>> >>> try_to_munlock(page); >>> >>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely >>> broken - you end up with the PageMlocked bit clear, but >>> try_to_munlock() was never called on that page, because >>> mlock_vma_page() got to the page isolation before the "subsequent" >>> munlock_vma_page(). >>> >>> And this is very much what the page lock serialization would prevent. >>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op, >>> yes, but that only "serializes" in one direction, not when you can have a mix >>> of bit setting and clearing. >>> >>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least >>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken >>> in calling that without the page being locked. That's my opinion. There may >>> be some *other* reason why it all happens to work, but no, >>> "TestSetPageMlocked should provide enough race protection" is simply not >>> true, and even if it were, it's way too subtle and odd to be a good rule. >>> >>> So I really object to just removing the BUG_ON(). Not with a *lot* more >>> explanation as to why these kinds of issues wouldn't matter. >> >> I don't have a perfect answer. But I can explain a bit history. Let's me try. >> >> First off, 5 years ago, Lee's original putback_lru_page() implementation required >> page-lock, but I removed the restriction months later. That's why we can see >> strange BUG_ON here. >> >> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected by >> mmap_sem (write mdoe). Then, mlock and munlock had no race. >> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to >> protect against munlock. >> >> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under >> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1), >> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect >> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem. >> >> By the way, page isolation is still necessary because we need to protect another page modification like page migration. > > I guess you meant page locking, not (lru) isolation. Indeed, the documentation > at Documentation/vm/unevictable-lru.txt also discusses races with page migration. > > I've checked and it seems really the case that mlock_migrate_page() > could race with mlock_vma_page() so that PG_mlocked is set on the > old page again after deciding that the new page will be without the flag. > (or after the flag was transferred to the new page) > That would not be fatal, but distort accounting anyway. > > So here's a patch that if accepted should replace the removal of BUG_ON patch in > -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch > Make sense to me. > The idea is that try_to_unmap_cluster() will try locking the page > for mlock, and just leave it alone if lock cannot be obtained. Again > that's not fatal, as eventually something will encounter and mlock the page. > > -----8<----- > > From: Vlastimil Babka <vbabka@suse.cz> > Date: Tue, 7 Jan 2014 14:59:58 +0100 > Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking > > A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing > with trinity. The call site try_to_unmap_cluster() does not lock the pages > other than its check_page parameter (which is already locked). > > The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat > unclear, but apparently it serializes against page migration, which could > otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the > page would be eventually encountered again, but NR_MLOCK accounting would > become distorted nevertheless. This patch adds a comment to the BUG_ON in > mlock_vma_page() and munlock_vma_page() to that effect. > > The call site try_to_unmap_cluster() is fixed so that for page != check_page, > trylock_page() is attempted (to avoid possible deadlocks as we already have > check_page locked) and mlock_vma_page() is performed only upon success. If the > page lock cannot be obtained, the page is left without PG_mlocked, which is > again not a problem in the whole unevictable memory design. > > Reported-by: Sasha Levin <sasha.levin@oracle.com> > Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> > Cc: Michel Lespinasse <walken@google.com> > Cc: Bob Liu <bob.liu@oracle.com> > Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> > Cc: Rik van Riel <riel@redhat.com> > Cc: David Rientjes <rientjes@google.com> > Cc: Mel Gorman <mgorman@suse.de> > Cc: Hugh Dickins <hughd@google.com> > Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> > Cc: <stable@vger.kernel.org> > Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Bob Liu <bob.liu@oracle.com> > --- > mm/mlock.c | 2 ++ > mm/rmap.c | 14 ++++++++++++-- > 2 files changed, 14 insertions(+), 2 deletions(-) > > diff --git a/mm/mlock.c b/mm/mlock.c > index 192e6ee..1b12dfa 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page) > */ > void mlock_vma_page(struct page *page) > { > + /* Serialize with page migration */ > BUG_ON(!PageLocked(page)); > > if (!TestSetPageMlocked(page)) { > @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page) > { > unsigned int nr_pages; > > + /* For try_to_munlock() and to serialize with page migration */ > BUG_ON(!PageLocked(page)); > > if (TestClearPageMlocked(page)) { > diff --git a/mm/rmap.c b/mm/rmap.c > index 068522d..b99c742 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > BUG_ON(!page || PageAnon(page)); > > if (locked_vma) { > - mlock_vma_page(page); /* no-op if already mlocked */ > - if (page == check_page) > + if (page == check_page) { > + /* we know we have check_page locked */ > + mlock_vma_page(page); > ret = SWAP_MLOCK; > + } else if (trylock_page(page)) { > + /* > + * If we can lock the page, perform mlock. > + * Otherwise leave the page alone, it will be > + * eventually encountered again later. > + */ > + mlock_vma_page(page); > + unlock_page(page); > + } > continue; /* don't unmap */ > } > > -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-07 15:01 ` Vlastimil Babka 2014-01-08 1:06 ` Bob Liu @ 2014-01-08 2:44 ` Linus Torvalds 2014-01-10 17:48 ` Motohiro Kosaki 2 siblings, 0 replies; 29+ messages in thread From: Linus Torvalds @ 2014-01-08 2:44 UTC (permalink / raw) To: Vlastimil Babka Cc: Motohiro Kosaki, Andrew Morton, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim On Tue, Jan 7, 2014 at 11:01 PM, Vlastimil Babka <vbabka@suse.cz> wrote: > > So here's a patch that if accepted should replace the removal of BUG_ON patch in > -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch > > The idea is that try_to_unmap_cluster() will try locking the page > for mlock, and just leave it alone if lock cannot be obtained. Again > that's not fatal, as eventually something will encounter and mlock the page. This looks sane to me. Andrew? Linus -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-07 15:01 ` Vlastimil Babka 2014-01-08 1:06 ` Bob Liu 2014-01-08 2:44 ` Linus Torvalds @ 2014-01-10 17:48 ` Motohiro Kosaki 2014-01-13 14:03 ` Vlastimil Babka 2 siblings, 1 reply; 29+ messages in thread From: Motohiro Kosaki @ 2014-01-10 17:48 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim > diff --git a/mm/rmap.c b/mm/rmap.c > index 068522d..b99c742 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long > cursor, unsigned int *mapcount, > BUG_ON(!page || PageAnon(page)); > > if (locked_vma) { > - mlock_vma_page(page); /* no-op if already > mlocked */ > - if (page == check_page) > + if (page == check_page) { > + /* we know we have check_page locked */ > + mlock_vma_page(page); > ret = SWAP_MLOCK; > + } else if (trylock_page(page)) { > + /* > + * If we can lock the page, perform mlock. > + * Otherwise leave the page alone, it will be > + * eventually encountered again later. > + */ > + mlock_vma_page(page); > + unlock_page(page); > + } > continue; /* don't unmap */ > } I audited all related mm code. However I couldn't find any race that it can close. First off, current munlock code is crazy tricky. munlock down_write(mmap_sem) do_mlock() mlock_fixup munlock_vma_pages_range __munlock_pagevec spin_lock_irq(zone->lru_lock) TestClearPageMlocked(page) del_page_from_lru_list spin_unlock_irq(zone->lru_lock) lock_page __munlock_isolated_page unlock_page up_write(mmap_sem) Look, TestClearPageMlocked(page) is not protected by lock_page. But this is still safe because Page_mocked mean one or more vma marked VM_LOCKED then we only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them. And, spin_lock_irq(zone->lru_lock) del_page_from_lru_list spin_unlock_irq(zone->lru_lock) This idiom ensures I or someone else isolate the page from lru and isolated pages will be put back by putback_lru_page in anycase. So, the pages will move the right lru eventually. And then, taking page-lock doesn't help to close vs munlock race. On the other hands, page migration has the following call stack. some-caller [isolate_lru_page] unmap_and_move __unmap_and_move trylock_page try_to_unmap move_to_new_page migrate_page migrate_page_copy unlock_page The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock. Page fault path take lock page and doesn't use page isolation. This is correct. try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too. Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against page migration, but not lru manipulation. if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) { lock_page(old_page); /* LRU manipulation */ munlock_vma_page(old_page); unlock_page(old_page); } But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected page lock. If so, I propose to add PageMlocked() like this } else if (!PageMlocked() && trylock_page(page)) { /* * If we can lock the page, perform mlock. * Otherwise leave the page alone, it will be * eventually encountered again later. */ mlock_vma_page(page); unlock_page(page); This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may do the right job and will fix the mistake. Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush. ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-10 17:48 ` Motohiro Kosaki @ 2014-01-13 14:03 ` Vlastimil Babka 2014-01-14 11:05 ` Vlastimil Babka 0 siblings, 1 reply; 29+ messages in thread From: Vlastimil Babka @ 2014-01-13 14:03 UTC (permalink / raw) To: Motohiro Kosaki, Andrew Morton Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim On 01/10/2014 06:48 PM, Motohiro Kosaki wrote: >> diff --git a/mm/rmap.c b/mm/rmap.c >> index 068522d..b99c742 100644 >> --- a/mm/rmap.c >> +++ b/mm/rmap.c >> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long >> cursor, unsigned int *mapcount, >> BUG_ON(!page || PageAnon(page)); >> >> if (locked_vma) { >> - mlock_vma_page(page); /* no-op if already >> mlocked */ >> - if (page == check_page) >> + if (page == check_page) { >> + /* we know we have check_page locked */ >> + mlock_vma_page(page); >> ret = SWAP_MLOCK; >> + } else if (trylock_page(page)) { >> + /* >> + * If we can lock the page, perform mlock. >> + * Otherwise leave the page alone, it will be >> + * eventually encountered again later. >> + */ >> + mlock_vma_page(page); >> + unlock_page(page); >> + } >> continue; /* don't unmap */ >> } > > I audited all related mm code. However I couldn't find any race that it can close. Well, I would say the lock here closes the race with page migration, no? (As discussed below) > First off, current munlock code is crazy tricky. Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12). > munlock > down_write(mmap_sem) > do_mlock() > mlock_fixup > munlock_vma_pages_range > __munlock_pagevec > spin_lock_irq(zone->lru_lock) > TestClearPageMlocked(page) > del_page_from_lru_list > spin_unlock_irq(zone->lru_lock) > lock_page > __munlock_isolated_page > unlock_page > > up_write(mmap_sem) > > Look, TestClearPageMlocked(page) is not protected by lock_page. Right :( That's my fault, when developing the patch series I didn't see the page migration race, and it seemed that lock is only needed to protect the rmap operations in __munlock_isolated_page() > But this is still > safe because Page_mocked mean one or more vma marked VM_LOCKED then we > only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them. > > And, > > spin_lock_irq(zone->lru_lock) > del_page_from_lru_list > spin_unlock_irq(zone->lru_lock) > > This idiom ensures I or someone else isolate the page from lru and isolated pages > will be put back by putback_lru_page in anycase. So, the pages will move the right > lru eventually. > > And then, taking page-lock doesn't help to close vs munlock race. > > On the other hands, page migration has the following call stack. > > some-caller [isolate_lru_page] > unmap_and_move > __unmap_and_move > trylock_page > try_to_unmap > move_to_new_page > migrate_page > migrate_page_copy > unlock_page > > The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock. However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(), or clearing PG_mlocked in __munlock_pagevec(). The race I have in mind for munlock is: CPU1 does page migration some-caller [isolate_lru_page] unmap_and_move __unmap_and_move trylock_page try_to_unmap move_to_new_page migrate_page migrate_page_copy mlock_migrate_page - transfers PG_mlocked from old page to new page CPU2 does munlock: munlock down_write(mmap_sem) do_mlock() mlock_fixup munlock_vma_pages_range __munlock_pagevec spin_lock_irq(zone->lru_lock) TestClearPageMlocked(page) - here it finds PG_mlocked already cleared so it stops, but meanwhile new page already has PG_mlocked set and will stay inevictable > Page fault path take lock page and doesn't use page isolation. This is correct. > try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too. I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated? (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster() and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking, but a distorted accounting anyway. > Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against > page migration, but not lru manipulation. > > if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) { > lock_page(old_page); /* LRU manipulation */ > munlock_vma_page(old_page); > unlock_page(old_page); > } > > > But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected > page lock. If so, I propose to add PageMlocked() like this > > } else if (!PageMlocked() && trylock_page(page)) { > /* > * If we can lock the page, perform mlock. > * Otherwise leave the page alone, it will be > * eventually encountered again later. > */ > mlock_vma_page(page); > unlock_page(page); > > This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may > do the right job and will fix the mistake. > > Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush. Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking. Another option would be that page migration would somehow deal with the race, or just leave the target pages without PG_mlocked and let them be dealt with later. But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider. And finally, we could then at least replace the atomic test and set with faster non-atomic variants. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-13 14:03 ` Vlastimil Babka @ 2014-01-14 11:05 ` Vlastimil Babka 0 siblings, 0 replies; 29+ messages in thread From: Vlastimil Babka @ 2014-01-14 11:05 UTC (permalink / raw) To: Motohiro Kosaki, Andrew Morton Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim On 01/13/2014 03:03 PM, Vlastimil Babka wrote: > On 01/10/2014 06:48 PM, Motohiro Kosaki wrote: >>> diff --git a/mm/rmap.c b/mm/rmap.c >>> index 068522d..b99c742 100644 >>> --- a/mm/rmap.c >>> +++ b/mm/rmap.c >>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long >>> cursor, unsigned int *mapcount, >>> BUG_ON(!page || PageAnon(page)); >>> >>> if (locked_vma) { >>> - mlock_vma_page(page); /* no-op if already >>> mlocked */ >>> - if (page == check_page) >>> + if (page == check_page) { >>> + /* we know we have check_page locked */ >>> + mlock_vma_page(page); >>> ret = SWAP_MLOCK; >>> + } else if (trylock_page(page)) { >>> + /* >>> + * If we can lock the page, perform mlock. >>> + * Otherwise leave the page alone, it will be >>> + * eventually encountered again later. >>> + */ >>> + mlock_vma_page(page); >>> + unlock_page(page); >>> + } >>> continue; /* don't unmap */ >>> } >> >> I audited all related mm code. However I couldn't find any race that it can close. > > Well, I would say the lock here closes the race with page migration, no? (As discussed below) > >> First off, current munlock code is crazy tricky. > > Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12). > >> munlock >> down_write(mmap_sem) >> do_mlock() >> mlock_fixup >> munlock_vma_pages_range >> __munlock_pagevec >> spin_lock_irq(zone->lru_lock) >> TestClearPageMlocked(page) >> del_page_from_lru_list >> spin_unlock_irq(zone->lru_lock) >> lock_page >> __munlock_isolated_page >> unlock_page >> >> up_write(mmap_sem) >> >> Look, TestClearPageMlocked(page) is not protected by lock_page. > > Right :( That's my fault, when developing the patch series I didn't see the page > migration race, and it seemed that lock is only needed to protect the rmap operations > in __munlock_isolated_page() > >> But this is still >> safe because Page_mocked mean one or more vma marked VM_LOCKED then we >> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them. >> >> And, >> >> spin_lock_irq(zone->lru_lock) >> del_page_from_lru_list >> spin_unlock_irq(zone->lru_lock) >> >> This idiom ensures I or someone else isolate the page from lru and isolated pages >> will be put back by putback_lru_page in anycase. So, the pages will move the right >> lru eventually. >> >> And then, taking page-lock doesn't help to close vs munlock race. >> >> On the other hands, page migration has the following call stack. >> >> some-caller [isolate_lru_page] >> unmap_and_move >> __unmap_and_move >> trylock_page >> try_to_unmap >> move_to_new_page >> migrate_page >> migrate_page_copy >> unlock_page >> >> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock. > > However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(), > or clearing PG_mlocked in __munlock_pagevec(). > > The race I have in mind for munlock is: > > CPU1 does page migration > some-caller [isolate_lru_page] > unmap_and_move > __unmap_and_move > trylock_page > try_to_unmap > move_to_new_page > migrate_page > migrate_page_copy > mlock_migrate_page - transfers PG_mlocked from old page to new page > > CPU2 does munlock: > munlock > down_write(mmap_sem) > do_mlock() > mlock_fixup > munlock_vma_pages_range > __munlock_pagevec > spin_lock_irq(zone->lru_lock) > TestClearPageMlocked(page) - here it finds PG_mlocked already cleared > so it stops, but meanwhile new page already has PG_mlocked set and will > stay inevictable As Mel pointed out to me, page lock in munlock alone would not help here anyway, because munlock would just wait for migration to unlock the old page and then still fail to clear a flag that has been transferred by migration to the new page. So if there was a race, it would be older than __munlock_pagevec() removing TestClearPageMlocked(page) from the page_lock protected section. But then I noticed that migration bails out for pages that have elevated pins, and this is done after making the page unreachable for mlock/munlock via pte migration entries. So this alone should prevent a race with m(un)lock, with three possible scenarios: - m(un)lock sees the migration entry and waits for migration to complete (via follow_page_mask), operates on the new page - m(un)lock gets the page reference before migration entry is set, finishes before migration checks the page pin count, migration transfers the new PG_mlocked state to the new page - m(un)lock doesn't finish that quickly, migration bails out, m(un)lock changes the old page that is mapped back >> Page fault path take lock page and doesn't use page isolation. This is correct. >> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too. > > I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated? > (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster() > and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking, > but a distorted accounting anyway. > >> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against >> page migration, but not lru manipulation. >> >> if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) { >> lock_page(old_page); /* LRU manipulation */ >> munlock_vma_page(old_page); >> unlock_page(old_page); >> } >> So the protection is actually for rmap operations in try_to_munlock. The comment could be removed from the caller here and appear just at the PageLocked assertion in munlock_vma_page(). >> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected >> page lock. If so, I propose to add PageMlocked() like this >> >> } else if (!PageMlocked() && trylock_page(page)) { >> /* >> * If we can lock the page, perform mlock. >> * Otherwise leave the page alone, it will be >> * eventually encountered again later. >> */ >> mlock_vma_page(page); >> unlock_page(page); >> >> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may >> do the right job and will fix the mistake. >> >> Anyway, I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush. If we agree that page migration is safe as explained above, then removing the PageLocked assertion from mlock_vma_page is again an option, instead of making sure that all callers lock? > Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable > when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking. > Another option would be that page migration would somehow deal with the race, or just leave the target pages without > PG_mlocked and let them be dealt with later. > But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider. > And finally, we could then at least replace the atomic test and set with faster non-atomic variants. > > -- > To unsubscribe, send a message with 'unsubscribe linux-mm' in > the body to majordomo@kvack.org. For more info on Linux MM, > see: http://www.linux-mm.org/ . > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2014-01-03 20:52 ` Linus Torvalds 2014-01-03 23:36 ` Vlastimil Babka @ 2014-01-04 3:31 ` Bob Liu 1 sibling, 0 replies; 29+ messages in thread From: Bob Liu @ 2014-01-04 3:31 UTC (permalink / raw) To: Linus Torvalds Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Vlastimil Babka, Michel Lespinasse, Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm, Linux Kernel Mailing List On 01/04/2014 04:52 AM, Linus Torvalds wrote: > On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote: >> >> Ping? This BUG() is triggerable in 3.13-rc6 right now. > > So Andrew suggested just removing the BUG_ON(), but it's been there > for a *long* time. > > And I detest the patch that was sent out that said "Should I check?" > > Maybe we should just remove that mlock_vma_page() thing instead in > try_to_unmap_cluster()? Or maybe actually lock the page around calling I didn't get the reason why we have to call mlock_vma_page() in try_to_unmap_cluster() and I agree to just remove it. > it? > > Linus > -- Regards, -Bob -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-17 8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li 2013-12-18 3:14 ` Wanpeng Li 2013-12-18 3:16 ` Wanpeng Li @ 2013-12-18 8:54 ` Vlastimil Babka 2013-12-18 9:01 ` Wanpeng Li 2 siblings, 1 reply; 29+ messages in thread From: Vlastimil Babka @ 2013-12-18 8:54 UTC (permalink / raw) To: Wanpeng Li, Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, linux-mm, linux-kernel On 12/17/2013 09:05 AM, Wanpeng Li wrote: > objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file > correlates with offset-into-virtual-addresses does not hold. Hence what > try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps > the file to which the target page belongs. If vma locked, mlock the pages in the > cluster, rather than unmapping them. However, not all pages are guarantee page > locked instead of the check page. This patch fix the BUG by just confirm check page > hold page lock instead of all pages in the virtual scan window against nolinear VMAs. This may fix the symptom, but I don't understand from the description why in this case is it ok not to have page locked for mlock_vma_page(), while in the other cases it's not ok. > [ 253.869145] kernel BUG at mm/mlock.c:82! > [ 253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC > [ 253.870098] Dumping ftrace buffer: > [ 253.870098] (ftrace buffer empty) > [ 253.870098] Modules linked in: > [ 253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137 > [ 253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000 > [ 253.873310] RIP: 0010:[<ffffffff81281f28>] [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 > [ 253.873310] RSP: 0000:ffff8804d34e99e8 EFLAGS: 00010246 > [ 253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000 > [ 253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0 > [ 253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000 > [ 253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000 > [ 253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000 > [ 253.873310] FS: 00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000 > [ 253.873310] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0 > [ 253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 > [ 253.873310] Stack: > [ 253.873310] 0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31 > [ 253.873310] 0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238 > [ 253.873310] ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348 > [ 253.873310] Call Trace: > [ 253.873310] [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340 > [ 253.873310] [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0 > [ 253.873310] [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90 > [ 253.873310] [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250 > [ 253.873310] [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180 > [ 253.873310] [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0 > [ 253.873310] [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0 > [ 253.873310] [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0 > [ 253.873310] [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240 > [ 253.873310] [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380 > [ 253.873310] [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380 > [ 253.873310] [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0 > [ 253.873310] [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10 > [ 253.873310] [<ffffffff843c4990>] tracesys+0xdd/0xe2 > [ 253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08 > 5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb > a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba > 2f 15 > [ 253.873310] RIP [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 > [ 253.873310] RSP <ffff8804d34e99e8> > [ 253.904194] ---[ end trace be59c4a7f8edab3f ]--- > > Reported-by: Sasha Levin <sasha.levin@oracle.com> > Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> > --- > mm/huge_memory.c | 2 +- > mm/internal.h | 4 ++-- > mm/ksm.c | 2 +- > mm/memory.c | 2 +- > mm/mlock.c | 5 +++-- > mm/rmap.c | 4 ++-- > 6 files changed, 10 insertions(+), 9 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index 33a5dc4..7a15b04 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > if (page->mapping && trylock_page(page)) { > lru_add_drain(); > if (page->mapping) > - mlock_vma_page(page); > + mlock_vma_page(page, true); > unlock_page(page); > } > } > diff --git a/mm/internal.h b/mm/internal.h > index a85a3ab..4ea2d4e 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, > /* > * must be called with vma's mmap_sem held for read or write, and page locked. > */ > -extern void mlock_vma_page(struct page *page); > +extern void mlock_vma_page(struct page *page, bool check_page); > extern unsigned int munlock_vma_page(struct page *page); > > /* > @@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p) > return 0; > } > static inline void clear_page_mlock(struct page *page) { } > -static inline void mlock_vma_page(struct page *page) { } > +static inline void mlock_vma_page(struct page *page, bool check_page) { } > static inline void mlock_migrate_page(struct page *new, struct page *old) { } > > #endif /* !CONFIG_MMU */ > diff --git a/mm/ksm.c b/mm/ksm.c > index 175fff7..ec36f04 100644 > --- a/mm/ksm.c > +++ b/mm/ksm.c > @@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, > if (!PageMlocked(kpage)) { > unlock_page(page); > lock_page(kpage); > - mlock_vma_page(kpage); > + mlock_vma_page(kpage, true); > page = kpage; /* for final unlock */ > } > } > diff --git a/mm/memory.c b/mm/memory.c > index cf6098c..a41df6a 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1602,7 +1602,7 @@ split_fallthrough: > * know the page is still mapped, we don't even > * need to check for file-cache page truncation. > */ > - mlock_vma_page(page); > + mlock_vma_page(page, true); > unlock_page(page); > } > } > diff --git a/mm/mlock.c b/mm/mlock.c > index d480cd6..c395ec5 100644 > --- a/mm/mlock.c > +++ b/mm/mlock.c > @@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page) > * Mark page as mlocked if not already. > * If page on LRU, isolate and putback to move to unevictable list. > */ > -void mlock_vma_page(struct page *page) > +void mlock_vma_page(struct page *page, bool check_page) > { > - BUG_ON(!PageLocked(page)); > + if (check_page) > + BUG_ON(!PageLocked(page)); > > if (!TestSetPageMlocked(page)) { > mod_zone_page_state(page_zone(page), NR_MLOCK, > diff --git a/mm/rmap.c b/mm/rmap.c > index 55c8b8d..79d456f 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1297,7 +1297,7 @@ out_mlock: > */ > if (down_read_trylock(&vma->vm_mm->mmap_sem)) { > if (vma->vm_flags & VM_LOCKED) { > - mlock_vma_page(page); > + mlock_vma_page(page, true); > ret = SWAP_MLOCK; > } > up_read(&vma->vm_mm->mmap_sem); > @@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, > BUG_ON(!page || PageAnon(page)); > > if (locked_vma) { > - mlock_vma_page(page); /* no-op if already mlocked */ > + mlock_vma_page(page, page == check_page); /* no-op if already mlocked */ > if (page == check_page) > ret = SWAP_MLOCK; > continue; /* don't unmap */ > -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs 2013-12-18 8:54 ` Vlastimil Babka @ 2013-12-18 9:01 ` Wanpeng Li 0 siblings, 0 replies; 29+ messages in thread From: Wanpeng Li @ 2013-12-18 9:01 UTC (permalink / raw) To: Vlastimil Babka, Andrew Morton Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel, linux-mm, linux-kernel Hi Vlastimil, On Wed, Dec 18, 2013 at 09:54:16AM +0100, Vlastimil Babka wrote: >On 12/17/2013 09:05 AM, Wanpeng Li wrote: >>objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file >>correlates with offset-into-virtual-addresses does not hold. Hence what >>try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps >>the file to which the target page belongs. If vma locked, mlock the pages in the >>cluster, rather than unmapping them. However, not all pages are guarantee page >>locked instead of the check page. This patch fix the BUG by just confirm check page >>hold page lock instead of all pages in the virtual scan window against nolinear VMAs. > >This may fix the symptom, but I don't understand from the description >why in this case is it ok not to have page locked for >mlock_vma_page(), while in the other cases it's not ok. > Here is a latest version of the bugfix patch. http://marc.info/?l=linux-mm&m=138733994417230&w=2 Regards, Wanpeng Li >>[ 253.869145] kernel BUG at mm/mlock.c:82! >>[ 253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC >>[ 253.870098] Dumping ftrace buffer: >>[ 253.870098] (ftrace buffer empty) >>[ 253.870098] Modules linked in: >>[ 253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G W 3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137 >>[ 253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000 >>[ 253.873310] RIP: 0010:[<ffffffff81281f28>] [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 >>[ 253.873310] RSP: 0000:ffff8804d34e99e8 EFLAGS: 00010246 >>[ 253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000 >>[ 253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0 >>[ 253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000 >>[ 253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000 >>[ 253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000 >>[ 253.873310] FS: 00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000 >>[ 253.873310] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b >>[ 253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0 >>[ 253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000 >>[ 253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600 >>[ 253.873310] Stack: >>[ 253.873310] 0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31 >>[ 253.873310] 0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238 >>[ 253.873310] ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348 >>[ 253.873310] Call Trace: >>[ 253.873310] [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340 >>[ 253.873310] [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0 >>[ 253.873310] [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90 >>[ 253.873310] [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250 >>[ 253.873310] [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180 >>[ 253.873310] [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0 >>[ 253.873310] [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0 >>[ 253.873310] [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0 >>[ 253.873310] [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240 >>[ 253.873310] [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380 >>[ 253.873310] [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380 >>[ 253.873310] [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0 >>[ 253.873310] [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10 >>[ 253.873310] [<ffffffff843c4990>] tracesys+0xdd/0xe2 >>[ 253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08 >>5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb >>a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba >>2f 15 >>[ 253.873310] RIP [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0 >>[ 253.873310] RSP <ffff8804d34e99e8> >>[ 253.904194] ---[ end trace be59c4a7f8edab3f ]--- >> >>Reported-by: Sasha Levin <sasha.levin@oracle.com> >>Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com> >>--- >> mm/huge_memory.c | 2 +- >> mm/internal.h | 4 ++-- >> mm/ksm.c | 2 +- >> mm/memory.c | 2 +- >> mm/mlock.c | 5 +++-- >> mm/rmap.c | 4 ++-- >> 6 files changed, 10 insertions(+), 9 deletions(-) >> >>diff --git a/mm/huge_memory.c b/mm/huge_memory.c >>index 33a5dc4..7a15b04 100644 >>--- a/mm/huge_memory.c >>+++ b/mm/huge_memory.c >>@@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, >> if (page->mapping && trylock_page(page)) { >> lru_add_drain(); >> if (page->mapping) >>- mlock_vma_page(page); >>+ mlock_vma_page(page, true); >> unlock_page(page); >> } >> } >>diff --git a/mm/internal.h b/mm/internal.h >>index a85a3ab..4ea2d4e 100644 >>--- a/mm/internal.h >>+++ b/mm/internal.h >>@@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma, >> /* >> * must be called with vma's mmap_sem held for read or write, and page locked. >> */ >>-extern void mlock_vma_page(struct page *page); >>+extern void mlock_vma_page(struct page *page, bool check_page); >> extern unsigned int munlock_vma_page(struct page *page); >> >> /* >>@@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p) >> return 0; >> } >> static inline void clear_page_mlock(struct page *page) { } >>-static inline void mlock_vma_page(struct page *page) { } >>+static inline void mlock_vma_page(struct page *page, bool check_page) { } >> static inline void mlock_migrate_page(struct page *new, struct page *old) { } >> >> #endif /* !CONFIG_MMU */ >>diff --git a/mm/ksm.c b/mm/ksm.c >>index 175fff7..ec36f04 100644 >>--- a/mm/ksm.c >>+++ b/mm/ksm.c >>@@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma, >> if (!PageMlocked(kpage)) { >> unlock_page(page); >> lock_page(kpage); >>- mlock_vma_page(kpage); >>+ mlock_vma_page(kpage, true); >> page = kpage; /* for final unlock */ >> } >> } >>diff --git a/mm/memory.c b/mm/memory.c >>index cf6098c..a41df6a 100644 >>--- a/mm/memory.c >>+++ b/mm/memory.c >>@@ -1602,7 +1602,7 @@ split_fallthrough: >> * know the page is still mapped, we don't even >> * need to check for file-cache page truncation. >> */ >>- mlock_vma_page(page); >>+ mlock_vma_page(page, true); >> unlock_page(page); >> } >> } >>diff --git a/mm/mlock.c b/mm/mlock.c >>index d480cd6..c395ec5 100644 >>--- a/mm/mlock.c >>+++ b/mm/mlock.c >>@@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page) >> * Mark page as mlocked if not already. >> * If page on LRU, isolate and putback to move to unevictable list. >> */ >>-void mlock_vma_page(struct page *page) >>+void mlock_vma_page(struct page *page, bool check_page) >> { >>- BUG_ON(!PageLocked(page)); >>+ if (check_page) >>+ BUG_ON(!PageLocked(page)); >> >> if (!TestSetPageMlocked(page)) { >> mod_zone_page_state(page_zone(page), NR_MLOCK, >>diff --git a/mm/rmap.c b/mm/rmap.c >>index 55c8b8d..79d456f 100644 >>--- a/mm/rmap.c >>+++ b/mm/rmap.c >>@@ -1297,7 +1297,7 @@ out_mlock: >> */ >> if (down_read_trylock(&vma->vm_mm->mmap_sem)) { >> if (vma->vm_flags & VM_LOCKED) { >>- mlock_vma_page(page); >>+ mlock_vma_page(page, true); >> ret = SWAP_MLOCK; >> } >> up_read(&vma->vm_mm->mmap_sem); >>@@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, >> BUG_ON(!page || PageAnon(page)); >> >> if (locked_vma) { >>- mlock_vma_page(page); /* no-op if already mlocked */ >>+ mlock_vma_page(page, page == check_page); /* no-op if already mlocked */ >> if (page == check_page) >> ret = SWAP_MLOCK; >> continue; /* don't unmap */ >> -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2014-01-14 11:05 UTC | newest] Thread overview: 29+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-17 8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li 2013-12-18 3:14 ` Wanpeng Li 2013-12-18 3:16 ` Wanpeng Li 2013-12-18 3:23 ` Wanpeng Li [not found] ` <20131218032329.GA6044@hacker.(null)> 2013-12-18 3:32 ` Sasha Levin 2013-12-18 4:12 ` Wanpeng Li 2013-12-18 9:11 ` Vlastimil Babka 2013-12-18 9:23 ` Wanpeng Li [not found] ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com> 2013-12-18 21:43 ` Andrew Morton 2014-01-03 20:17 ` Sasha Levin 2014-01-03 20:52 ` Linus Torvalds 2014-01-03 23:36 ` Vlastimil Babka 2014-01-03 23:56 ` Andrew Morton 2014-01-04 3:03 ` Sasha Levin 2014-01-04 0:18 ` Linus Torvalds 2014-01-04 8:09 ` Vlastimil Babka 2014-01-05 0:27 ` Wanpeng Li 2014-01-06 16:47 ` Motohiro Kosaki 2014-01-06 22:01 ` KOSAKI Motohiro 2014-01-07 5:27 ` Wanpeng Li 2014-01-07 15:01 ` Vlastimil Babka 2014-01-08 1:06 ` Bob Liu 2014-01-08 2:44 ` Linus Torvalds 2014-01-10 17:48 ` Motohiro Kosaki 2014-01-13 14:03 ` Vlastimil Babka 2014-01-14 11:05 ` Vlastimil Babka 2014-01-04 3:31 ` Bob Liu 2013-12-18 8:54 ` Vlastimil Babka 2013-12-18 9:01 ` Wanpeng Li
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).