[PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
@ 2013-12-17  8:05 Wanpeng Li
  2013-12-18  3:14 ` Wanpeng Li
                   ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-17  8:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, linux-mm, linux-kernel, Wanpeng Li

objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file 
correlates with offset-into-virtual-addresses does not hold. Hence what 
try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps 
the file to which the target page belongs. If vma locked, mlock the pages in the 
cluster, rather than unmapping them. However, not all pages are guarantee page 
locked instead of the check page. This patch fix the BUG by just confirm check page 
hold page lock instead of all pages in the virtual scan window against nolinear VMAs.

[  253.869145] kernel BUG at mm/mlock.c:82!
[  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[  253.870098] Dumping ftrace buffer:
[  253.870098]    (ftrace buffer empty)
[  253.870098] Modules linked in:
[  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W    3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
[  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
[  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
[  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
[  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
[  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
[  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
[  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
[  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
[  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
[  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
[  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
[  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[  253.873310] Stack:
[  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
[  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
[  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
[  253.873310] Call Trace:
[  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
[  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
[  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
[  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
[  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
[  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
[  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
[  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
[  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
[  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
[  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
[  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
[  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
[  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
[  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
2f 15
[  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
[  253.873310]  RSP <ffff8804d34e99e8>
[  253.904194] ---[ end trace be59c4a7f8edab3f ]---

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
---
 mm/huge_memory.c | 2 +-
 mm/internal.h    | 4 ++--
 mm/ksm.c         | 2 +-
 mm/memory.c      | 2 +-
 mm/mlock.c       | 5 +++--
 mm/rmap.c        | 4 ++--
 6 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 33a5dc4..7a15b04 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
 		if (page->mapping && trylock_page(page)) {
 			lru_add_drain();
 			if (page->mapping)
-				mlock_vma_page(page);
+				mlock_vma_page(page, true);
 			unlock_page(page);
 		}
 	}
diff --git a/mm/internal.h b/mm/internal.h
index a85a3ab..4ea2d4e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
 /*
  * must be called with vma's mmap_sem held for read or write, and page locked.
  */
-extern void mlock_vma_page(struct page *page);
+extern void mlock_vma_page(struct page *page, bool check_page);
 extern unsigned int munlock_vma_page(struct page *page);
 
 /*
@@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
 	return 0;
 }
 static inline void clear_page_mlock(struct page *page) { }
-static inline void mlock_vma_page(struct page *page) { }
+static inline void mlock_vma_page(struct page *page, bool check_page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
 
 #endif /* !CONFIG_MMU */
diff --git a/mm/ksm.c b/mm/ksm.c
index 175fff7..ec36f04 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
 		if (!PageMlocked(kpage)) {
 			unlock_page(page);
 			lock_page(kpage);
-			mlock_vma_page(kpage);
+			mlock_vma_page(kpage, true);
 			page = kpage;		/* for final unlock */
 		}
 	}
diff --git a/mm/memory.c b/mm/memory.c
index cf6098c..a41df6a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1602,7 +1602,7 @@ split_fallthrough:
 			 * know the page is still mapped, we don't even
 			 * need to check for file-cache page truncation.
 			 */
-			mlock_vma_page(page);
+			mlock_vma_page(page, true);
 			unlock_page(page);
 		}
 	}
diff --git a/mm/mlock.c b/mm/mlock.c
index d480cd6..c395ec5 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page)
  * Mark page as mlocked if not already.
  * If page on LRU, isolate and putback to move to unevictable list.
  */
-void mlock_vma_page(struct page *page)
+void mlock_vma_page(struct page *page, bool check_page)
 {
-	BUG_ON(!PageLocked(page));
+	if (check_page)
+		BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
diff --git a/mm/rmap.c b/mm/rmap.c
index 55c8b8d..79d456f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1297,7 +1297,7 @@ out_mlock:
 	 */
 	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
 		if (vma->vm_flags & VM_LOCKED) {
-			mlock_vma_page(page);
+			mlock_vma_page(page, true);
 			ret = SWAP_MLOCK;
 		}
 		up_read(&vma->vm_mm->mmap_sem);
@@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		BUG_ON(!page || PageAnon(page));
 
 		if (locked_vma) {
-			mlock_vma_page(page);   /* no-op if already mlocked */
+			mlock_vma_page(page, page == check_page);   /* no-op if already mlocked */
 			if (page == check_page)
 				ret = SWAP_MLOCK;
 			continue;	/* don't unmap */
-- 
1.8.3.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-17  8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li
@ 2013-12-18  3:14 ` Wanpeng Li
  2013-12-18  3:16 ` Wanpeng Li
  2013-12-18  8:54 ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  3:14 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote:

Another alternative method to fix this bug.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-17  8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li
  2013-12-18  3:14 ` Wanpeng Li
@ 2013-12-18  3:16 ` Wanpeng Li
  2013-12-18  3:23   ` Wanpeng Li
       [not found]   ` <20131218032329.GA6044@hacker.(null)>
  2013-12-18  8:54 ` Vlastimil Babka
  2 siblings, 2 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  3:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 105 bytes --]

On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote:

Another alternative method to fix this bug.

[-- Attachment #2: 0001-2.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18  3:16 ` Wanpeng Li
@ 2013-12-18  3:23   ` Wanpeng Li
       [not found]   ` <20131218032329.GA6044@hacker.(null)>
  1 sibling, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  3:23 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 199 bytes --]

On Wed, Dec 18, 2013 at 11:16:19AM +0800, Wanpeng Li wrote:
>On Tue, Dec 17, 2013 at 04:05:50PM +0800, Wanpeng Li wrote:
>
>Another alternative method to fix this bug.

Sorry for the wrong version. 

[-- Attachment #2: 0001-2.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 29+ messages in thread

[parent not found: <20131218032329.GA6044@hacker.(null)>]

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
       [not found]   ` <20131218032329.GA6044@hacker.(null)>
@ 2013-12-18  3:32     ` Sasha Levin
  2013-12-18  4:12       ` Wanpeng Li
  0 siblings, 1 reply; 29+ messages in thread
From: Sasha Levin @ 2013-12-18  3:32 UTC (permalink / raw)
  To: Wanpeng Li, Andrew Morton
  Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel,
	David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

On 12/17/2013 10:23 PM, Wanpeng Li wrote:
> -			mlock_vma_page(page);   /* no-op if already mlocked */
> -			if (page == check_page)
> +			if (page != check_page && trylock_page(page)) {
> +				mlock_vma_page(page);   /* no-op if already mlocked */
> +				unlock_page(page);
> +			} else if (page == check_page) {
> +				mlock_vma_page(page);  /* no-op if already mlocked */
>   				ret = SWAP_MLOCK;
> +			}

Previously, if page != check_page and the page was locked, we'd call mlock_vma_page()
anyways. With this change, we don't. In fact, we'll just skip that entire block not doing
anything.

If that's something that's never supposed to happen, can we add a

	VM_BUG_ON(page != check_page && PageLocked(page))

Just to cover this new code path?


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18  3:32     ` Sasha Levin
@ 2013-12-18  4:12       ` Wanpeng Li
  2013-12-18  9:11         ` Vlastimil Babka
  0 siblings, 1 reply; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  4:12 UTC (permalink / raw)
  To: Sasha Levin, Andrew Morton
  Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel,
	David Rientjes, Mel Gorman, Minchan Kim, Vlastimil Babka,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 888 bytes --]

Hi Sasha,
On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote:
>On 12/17/2013 10:23 PM, Wanpeng Li wrote:
>>-			mlock_vma_page(page);   /* no-op if already mlocked */
>>-			if (page == check_page)
>>+			if (page != check_page && trylock_page(page)) {
>>+				mlock_vma_page(page);   /* no-op if already mlocked */
>>+				unlock_page(page);
>>+			} else if (page == check_page) {
>>+				mlock_vma_page(page);  /* no-op if already mlocked */
>>  				ret = SWAP_MLOCK;
>>+			}
>
>Previously, if page != check_page and the page was locked, we'd call mlock_vma_page()
>anyways. With this change, we don't. In fact, we'll just skip that entire block not doing
>anything.

Thanks for pointing out. ;-)

>
>If that's something that's never supposed to happen, can we add a
>
>	VM_BUG_ON(page != check_page && PageLocked(page))
>
>Just to cover this new code path?
>

How about this one? 


[-- Attachment #2: 0001-3.patch --]
[-- Type: text/x-diff, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18  4:12       ` Wanpeng Li
@ 2013-12-18  9:11         ` Vlastimil Babka
  2013-12-18  9:23           ` Wanpeng Li
       [not found]           ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>
  0 siblings, 2 replies; 29+ messages in thread
From: Vlastimil Babka @ 2013-12-18  9:11 UTC (permalink / raw)
  To: Wanpeng Li, Sasha Levin, Andrew Morton
  Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, linux-kernel

On 12/18/2013 05:12 AM, Wanpeng Li wrote:
> Hi Sasha,
> On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote:
>> On 12/17/2013 10:23 PM, Wanpeng Li wrote:
>>> -			mlock_vma_page(page);   /* no-op if already mlocked */
>>> -			if (page == check_page)
>>> +			if (page != check_page && trylock_page(page)) {
>>> +				mlock_vma_page(page);   /* no-op if already mlocked */
>>> +				unlock_page(page);
>>> +			} else if (page == check_page) {
>>> +				mlock_vma_page(page);  /* no-op if already mlocked */
>>>   				ret = SWAP_MLOCK;
>>> +			}
>>
>> Previously, if page != check_page and the page was locked, we'd call mlock_vma_page()
>> anyways. With this change, we don't. In fact, we'll just skip that entire block not doing
>> anything.
>
> Thanks for pointing out. ;-)
>
>>
>> If that's something that's never supposed to happen, can we add a
>>
>> 	VM_BUG_ON(page != check_page && PageLocked(page))
>>
>> Just to cover this new code path?
>>
>
> How about this one?
>
>
> 0001-3.patch
>
>
>  From eab57d94c82fb3ab74b607a3ede0f5ce765ff2cc Mon Sep 17 00:00:00 2001
> From: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Date: Wed, 18 Dec 2013 12:05:59 +0800
> Subject: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
>
> objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file
> correlates with offset-into-virtual-addresses does not hold. Hence what
> try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps
> the file to which the target page belongs. If vma locked, mlock the pages in the
> cluster, rather than unmapping them. However, not all pages are guarantee page
> locked instead of the check page. This patch fix the BUG by try to lock !check page
> if them are unlocked.
>
> [  253.869145] kernel BUG at mm/mlock.c:82!
> [  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  253.870098] Dumping ftrace buffer:
> [  253.870098]    (ftrace buffer empty)
> [  253.870098] Modules linked in:
> [  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W    3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
> [  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
> [  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
> [  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
> [  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
> [  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
> [  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
> [  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
> [  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
> [  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
> [  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
> [  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
> [  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [  253.873310] Stack:
> [  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
> [  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
> [  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
> [  253.873310] Call Trace:
> [  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
> [  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
> [  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
> [  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
> [  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
> [  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
> [  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
> [  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
> [  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
> [  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
> [  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
> [  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
> [  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
> [  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
> [  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
> 5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
> a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
> 2f 15
> [  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
> [  253.873310]  RSP <ffff8804d34e99e8>
> [  253.904194] ---[ end trace be59c4a7f8edab3f ]---
>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> ---
>   mm/rmap.c | 11 +++++++++--
>   1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 55c8b8d..1e24813 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>   	unsigned long end;
>   	int ret = SWAP_AGAIN;
>   	int locked_vma = 0;
> +	int we_locked = 0;
>
>   	address = (vma->vm_start + cursor) & CLUSTER_MASK;
>   	end = address + CLUSTER_SIZE;
> @@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>   		BUG_ON(!page || PageAnon(page));
>
>   		if (locked_vma) {
> -			mlock_vma_page(page);   /* no-op if already mlocked */
> -			if (page == check_page)
> +			if (page != check_page) {
> +				we_locked = trylock_page(page);

If it's not us who has the page already locked, but somebody else, he 
might unlock it at this point and then the BUG_ON in mlock_vma_page() 
will trigger again.

> +				mlock_vma_page(page);   /* no-op if already mlocked */
> +				if (we_locked)
> +					unlock_page(page);
> +			} else if (page == check_page) {
> +				mlock_vma_page(page);  /* no-op if already mlocked */
>   				ret = SWAP_MLOCK;
> +			}
>   			continue;	/* don't unmap */
>   		}
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18  9:11         ` Vlastimil Babka
@ 2013-12-18  9:23           ` Wanpeng Li
       [not found]           ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>
  1 sibling, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  9:23 UTC (permalink / raw)
  To: Vlastimil Babka, Sasha Levin, Andrew Morton
  Cc: Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro, riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, linux-kernel

On Wed, Dec 18, 2013 at 10:11:43AM +0100, Vlastimil Babka wrote:
>On 12/18/2013 05:12 AM, Wanpeng Li wrote:
>>Hi Sasha,
>>On Tue, Dec 17, 2013 at 10:32:53PM -0500, Sasha Levin wrote:
>>>On 12/17/2013 10:23 PM, Wanpeng Li wrote:
>>>>-			mlock_vma_page(page);   /* no-op if already mlocked */
>>>>-			if (page == check_page)
>>>>+			if (page != check_page && trylock_page(page)) {
>>>>+				mlock_vma_page(page);   /* no-op if already mlocked */
>>>>+				unlock_page(page);
>>>>+			} else if (page == check_page) {
>>>>+				mlock_vma_page(page);  /* no-op if already mlocked */
>>>>  				ret = SWAP_MLOCK;
>>>>+			}
>>>
>>>Previously, if page != check_page and the page was locked, we'd call mlock_vma_page()
>>>anyways. With this change, we don't. In fact, we'll just skip that entire block not doing
>>>anything.
>>
>>Thanks for pointing out. ;-)
>>
>>>
>>>If that's something that's never supposed to happen, can we add a
>>>
>>>	VM_BUG_ON(page != check_page && PageLocked(page))
>>>
>>>Just to cover this new code path?
>>>
>>
>>How about this one?
>>
>>
>>0001-3.patch
>>
>>
>> From eab57d94c82fb3ab74b607a3ede0f5ce765ff2cc Mon Sep 17 00:00:00 2001
>>From: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>>Date: Wed, 18 Dec 2013 12:05:59 +0800
>>Subject: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
>>
>>objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file
>>correlates with offset-into-virtual-addresses does not hold. Hence what
>>try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps
>>the file to which the target page belongs. If vma locked, mlock the pages in the
>>cluster, rather than unmapping them. However, not all pages are guarantee page
>>locked instead of the check page. This patch fix the BUG by try to lock !check page
>>if them are unlocked.
>>
>>[  253.869145] kernel BUG at mm/mlock.c:82!
>>[  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>[  253.870098] Dumping ftrace buffer:
>>[  253.870098]    (ftrace buffer empty)
>>[  253.870098] Modules linked in:
>>[  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W    3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
>>[  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
>>[  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
>>[  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
>>[  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
>>[  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
>>[  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
>>[  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
>>[  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
>>[  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
>>[  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>[  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
>>[  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
>>[  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>[  253.873310] Stack:
>>[  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
>>[  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
>>[  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
>>[  253.873310] Call Trace:
>>[  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
>>[  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
>>[  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
>>[  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
>>[  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
>>[  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
>>[  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
>>[  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
>>[  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
>>[  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
>>[  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
>>[  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
>>[  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
>>[  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
>>[  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
>>5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
>>a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
>>2f 15
>>[  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
>>[  253.873310]  RSP <ffff8804d34e99e8>
>>[  253.904194] ---[ end trace be59c4a7f8edab3f ]---
>>
>>Reported-by: Sasha Levin <sasha.levin@oracle.com>
>>Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>>---
>>  mm/rmap.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>>diff --git a/mm/rmap.c b/mm/rmap.c
>>index 55c8b8d..1e24813 100644
>>--- a/mm/rmap.c
>>+++ b/mm/rmap.c
>>@@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>  	unsigned long end;
>>  	int ret = SWAP_AGAIN;
>>  	int locked_vma = 0;
>>+	int we_locked = 0;
>>
>>  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
>>  	end = address + CLUSTER_SIZE;
>>@@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>  		BUG_ON(!page || PageAnon(page));
>>
>>  		if (locked_vma) {
>>-			mlock_vma_page(page);   /* no-op if already mlocked */
>>-			if (page == check_page)
>>+			if (page != check_page) {
>>+				we_locked = trylock_page(page);
>
>If it's not us who has the page already locked, but somebody else, he
>might unlock it at this point and then the BUG_ON in mlock_vma_page()
>will trigger again.

Any better idea is appreciated. ;-)

Regards,
Wanpeng Li 

>
>>+				mlock_vma_page(page);   /* no-op if already mlocked */
>>+				if (we_locked)
>>+					unlock_page(page);
>>+			} else if (page == check_page) {
>>+				mlock_vma_page(page);  /* no-op if already mlocked */
>>  				ret = SWAP_MLOCK;
>>+			}
>>  			continue;	/* don't unmap */
>>  		}
>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

[parent not found: <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>]

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
       [not found]           ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>
@ 2013-12-18 21:43             ` Andrew Morton
  2014-01-03 20:17               ` Sasha Levin
  0 siblings, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2013-12-18 21:43 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Vlastimil Babka, Sasha Levin, Michel Lespinasse, Bob Liu, npiggin,
	kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel

On Wed, 18 Dec 2013 17:23:03 +0800 Wanpeng Li <liwanp@linux.vnet.ibm.com> wrote:

> >>diff --git a/mm/rmap.c b/mm/rmap.c
> >>index 55c8b8d..1e24813 100644
> >>--- a/mm/rmap.c
> >>+++ b/mm/rmap.c
> >>@@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >>  	unsigned long end;
> >>  	int ret = SWAP_AGAIN;
> >>  	int locked_vma = 0;
> >>+	int we_locked = 0;
> >>
> >>  	address = (vma->vm_start + cursor) & CLUSTER_MASK;
> >>  	end = address + CLUSTER_SIZE;
> >>@@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
> >>  		BUG_ON(!page || PageAnon(page));
> >>
> >>  		if (locked_vma) {
> >>-			mlock_vma_page(page);   /* no-op if already mlocked */
> >>-			if (page == check_page)
> >>+			if (page != check_page) {
> >>+				we_locked = trylock_page(page);
> >
> >If it's not us who has the page already locked, but somebody else, he
> >might unlock it at this point and then the BUG_ON in mlock_vma_page()
> >will trigger again.

yes, this patch is pretty weak.

> Any better idea is appreciated. ;-)

Remove the BUG_ON() from mlock_vma_page()?  Why was it added?
isolate_lru_page() and putback_lru_page() and *might* require
the page be locked, but I don't immediately see issues?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18 21:43             ` Andrew Morton
@ 2014-01-03 20:17               ` Sasha Levin
  2014-01-03 20:52                 ` Linus Torvalds
  0 siblings, 1 reply; 29+ messages in thread
From: Sasha Levin @ 2014-01-03 20:17 UTC (permalink / raw)
  To: Andrew Morton, Wanpeng Li
  Cc: Vlastimil Babka, Michel Lespinasse, Bob Liu, npiggin,
	kosaki.motohiro, riel, David Rientjes, Mel Gorman, Minchan Kim,
	Hugh Dickins, Johannes Weiner, linux-mm, linux-kernel,
	Linus Torvalds

On 12/18/2013 04:43 PM, Andrew Morton wrote:
> On Wed, 18 Dec 2013 17:23:03 +0800 Wanpeng Li <liwanp@linux.vnet.ibm.com> wrote:
>
>>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>>> index 55c8b8d..1e24813 100644
>>>> --- a/mm/rmap.c
>>>> +++ b/mm/rmap.c
>>>> @@ -1347,6 +1347,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>>>   	unsigned long end;
>>>>   	int ret = SWAP_AGAIN;
>>>>   	int locked_vma = 0;
>>>> +	int we_locked = 0;
>>>>
>>>>   	address = (vma->vm_start + cursor) & CLUSTER_MASK;
>>>>   	end = address + CLUSTER_SIZE;
>>>> @@ -1385,9 +1386,15 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>>>   		BUG_ON(!page || PageAnon(page));
>>>>
>>>>   		if (locked_vma) {
>>>> -			mlock_vma_page(page);   /* no-op if already mlocked */
>>>> -			if (page == check_page)
>>>> +			if (page != check_page) {
>>>> +				we_locked = trylock_page(page);
>>>
>>> If it's not us who has the page already locked, but somebody else, he
>>> might unlock it at this point and then the BUG_ON in mlock_vma_page()
>>> will trigger again.
>
> yes, this patch is pretty weak.
>
>> Any better idea is appreciated. ;-)
>
> Remove the BUG_ON() from mlock_vma_page()?  Why was it added?
> isolate_lru_page() and putback_lru_page() and *might* require
> the page be locked, but I don't immediately see issues?

Ping? This BUG() is triggerable in 3.13-rc6 right now.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 20:17               ` Sasha Levin
@ 2014-01-03 20:52                 ` Linus Torvalds
  2014-01-03 23:36                   ` Vlastimil Babka
  2014-01-04  3:31                   ` Bob Liu
  0 siblings, 2 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-01-03 20:52 UTC (permalink / raw)
  To: Sasha Levin
  Cc: Andrew Morton, Wanpeng Li, Vlastimil Babka, Michel Lespinasse,
	Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List

On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>
> Ping? This BUG() is triggerable in 3.13-rc6 right now.

So Andrew suggested just removing the BUG_ON(), but it's been there
for a *long* time.

And I detest the patch that was sent out that said "Should I check?"

Maybe we should just remove that mlock_vma_page() thing instead in
try_to_unmap_cluster()? Or maybe actually lock the page around calling
it?

             Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 20:52                 ` Linus Torvalds
@ 2014-01-03 23:36                   ` Vlastimil Babka
  2014-01-03 23:56                     ` Andrew Morton
  2014-01-04  0:18                     ` Linus Torvalds
  2014-01-04  3:31                   ` Bob Liu
  1 sibling, 2 replies; 29+ messages in thread
From: Vlastimil Babka @ 2014-01-03 23:36 UTC (permalink / raw)
  To: Linus Torvalds, Sasha Levin
  Cc: Andrew Morton, Wanpeng Li, Michel Lespinasse, Bob Liu,
	Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes,
	Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

On 01/03/2014 09:52 PM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>
>> Ping? This BUG() is triggerable in 3.13-rc6 right now.
> 
> So Andrew suggested just removing the BUG_ON(), but it's been there
> for a *long* time.

Yes, Andrew also merged this patch for that:
 http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch

But there wasn't enough confidence in the fix to sent it to you yet, I guess.

The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html

> And I detest the patch that was sent out that said "Should I check?"
> 
> Maybe we should just remove that mlock_vma_page() thing instead in

You mean that it it's already undeterministic because it can be already skipped when
mmap_sem can't be acquired for read? I think the assumption for this case is that mmap_sem
is already held for write which means VM_LOCKED is unset anyway (per comments at
try_to_unmap_file(), which calls try_to_unmap_cluster()). I'm however not sure how it is
protected from somebody else holding the semaphore...

> try_to_unmap_cluster()? Or maybe actually lock the page around calling
> it?

check_page is already locked, see try_to_munlock() which calls try_to_unmap_file(). So
this might smell of potential deadlock?

I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
race protection.

>              Linus
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 23:36                   ` Vlastimil Babka
@ 2014-01-03 23:56                     ` Andrew Morton
  2014-01-04  3:03                       ` Sasha Levin
  2014-01-04  0:18                     ` Linus Torvalds
  1 sibling, 1 reply; 29+ messages in thread
From: Andrew Morton @ 2014-01-03 23:56 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List

On Sat, 04 Jan 2014 00:36:18 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> On 01/03/2014 09:52 PM, Linus Torvalds wrote:
> > On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
> >>
> >> Ping? This BUG() is triggerable in 3.13-rc6 right now.
> > 
> > So Andrew suggested just removing the BUG_ON(), but it's been there
> > for a *long* time.
> 
> Yes, Andrew also merged this patch for that:
>  http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
> 
> But there wasn't enough confidence in the fix to sent it to you yet, I guess.
> 
> The related thread: http://www.spinics.net/lists/linux-mm/msg66972.html

Yes, I'd taken the cowardly approach of scheduling it for 3.14, with a
3.13.x backport.

Nobody answered my question!  Is this a new bug or is it a
five-year-old bug which we only just discovered?

I guess it doesn't matter much - we should fix it in 3.13.  I'll
include it in the next for-3.13 batch.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 23:56                     ` Andrew Morton
@ 2014-01-04  3:03                       ` Sasha Levin
  0 siblings, 0 replies; 29+ messages in thread
From: Sasha Levin @ 2014-01-04  3:03 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Linus Torvalds, Wanpeng Li, Michel Lespinasse, Bob Liu,
	Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes,
	Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

On 01/03/2014 06:56 PM, Andrew Morton wrote:
> Nobody answered my question!  Is this a new bug or is it a
> five-year-old bug which we only just discovered?

I've rolled trinity back and was unable to reproduce this issue anymore. This
seems to be a 5 year old bug.


Thanks,
Sasha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 23:36                   ` Vlastimil Babka
  2014-01-03 23:56                     ` Andrew Morton
@ 2014-01-04  0:18                     ` Linus Torvalds
  2014-01-04  8:09                       ` Vlastimil Babka
                                         ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-01-04  0:18 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List

On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
> race protection.

Maybe. But dammit, that's subtle, and I don't think you're even right.

It basically depends on mlock_vma_page() and munlock_vma_page() being
able to run CONCURRENTLY on the same page. In particular, you could
have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
immediately clearing it on another, and then the rest of those
functions could run with a totally arbitrary interleaving when working
with the exact same page.

They both do basically

    if (!isolate_lru_page(page))
        putback_lru_page(page);

but one or the other would randomly win the race (it's internally
protected by the lru lock), and *if* the munlock_vma_page() wins it,
it would also do

    try_to_munlock(page);

but if mlock_vma_page() wins it, that wouldn't happen. That looks
entirely broken - you end up with the PageMlocked bit clear, but
try_to_munlock() was never called on that page, because
mlock_vma_page() got to the page isolation before the "subsequent"
munlock_vma_page().

And this is very much what the page lock serialization would prevent.
So no, the PageMlocked in *no* way gives serialization. It's an atomic
bit op, yes, but that only "serializes" in one direction, not when you
can have a mix of bit setting and clearing.

So quite frankly, I think you're wrong. The BUG_ON() is correct, or at
least enforces some kind of ordering. And try_to_unmap_cluster() is
just broken in calling that without the page being locked. That's my
opinion. There may be some *other* reason why it all happens to work,
but no, "TestSetPageMlocked should provide enough race protection" is
simply not true, and even if it were, it's way too subtle and odd to
be a good rule.

So I really object to just removing the BUG_ON(). Not with a *lot*
more explanation as to why these kinds of issues wouldn't matter.

                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-04  0:18                     ` Linus Torvalds
@ 2014-01-04  8:09                       ` Vlastimil Babka
  2014-01-05  0:27                       ` Wanpeng Li
  2014-01-06 16:47                       ` Motohiro Kosaki
  2 siblings, 0 replies; 29+ messages in thread
From: Vlastimil Babka @ 2014-01-04  8:09 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, KOSAKI Motohiro, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List

On 01/04/2014 01:18 AM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
>> race protection.
> 
> Maybe. But dammit, that's subtle, and I don't think you're even right.
> 
> It basically depends on mlock_vma_page() and munlock_vma_page() being
> able to run CONCURRENTLY on the same page. In particular, you could
> have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
> immediately clearing it on another, and then the rest of those
> functions could run with a totally arbitrary interleaving when working
> with the exact same page.
> 
> They both do basically
> 
>     if (!isolate_lru_page(page))
>         putback_lru_page(page);
> 
> but one or the other would randomly win the race (it's internally
> protected by the lru lock), and *if* the munlock_vma_page() wins it,
> it would also do
> 
>     try_to_munlock(page);
> 
> but if mlock_vma_page() wins it, that wouldn't happen. That looks
> entirely broken - you end up with the PageMlocked bit clear, but
> try_to_munlock() was never called on that page, because
> mlock_vma_page() got to the page isolation before the "subsequent"
> munlock_vma_page().

I got the impression (see e.g. munlock_vma_page() comments) that the
whole thing is designed with this possibility in mind. isolate_lru_page()
may fail (presumably also in other scenarios than this) and if
try_to_munlock() was not called here, then yes the page might lose the
PageMlocked bit and go to LRU instead of inevictable list, but
try_to_unmap() should catch and fix this. That would also explain why
mlock_vma_page() is called from try_to_unmap_cluster().
So if I understand correctly, PageMlocked bit is not something that has
to be correctly set 100% of the time, but when it's set correctly most
of the time, then most of these pages will go to inevictable list and spare
vmscan's time.

> And this is very much what the page lock serialization would prevent.
> So no, the PageMlocked in *no* way gives serialization. It's an atomic
> bit op, yes, but that only "serializes" in one direction, not when you
> can have a mix of bit setting and clearing.
> 
> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at
> least enforces some kind of ordering. And try_to_unmap_cluster() is
> just broken in calling that without the page being locked. That's my
> opinion. There may be some *other* reason why it all happens to work,
> but no, "TestSetPageMlocked should provide enough race protection" is
> simply not true, and even if it were, it's way too subtle and odd to
> be a good rule.

Right, it was stupid of me to write such strong statement without any
details. I wanted to review that patch when back at work next week, but
since it came up now, I just wanted to point out that it's in the pipeline
for this bug.

> So I really object to just removing the BUG_ON(). Not with a *lot*
> more explanation as to why these kinds of issues wouldn't matter.
> 
>                  Linus
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-04  0:18                     ` Linus Torvalds
  2014-01-04  8:09                       ` Vlastimil Babka
@ 2014-01-05  0:27                       ` Wanpeng Li
  2014-01-06 16:47                       ` Motohiro Kosaki
  2 siblings, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2014-01-05  0:27 UTC (permalink / raw)
  To: Andrew Morton, Vlastimil Babka
  Cc: Linus Torvalds, Sasha Levin, Michel Lespinasse, Bob Liu,
	Nick Piggin, KOSAKI Motohiro, Rik van Riel, David Rientjes,
	Mel Gorman, Minchan Kim, Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

Hi Andrew, Vlastimil 
On Fri, Jan 03, 2014 at 04:18:05PM -0800, Linus Torvalds wrote:
>On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>
>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked should provide enough
>> race protection.
>
>Maybe. But dammit, that's subtle, and I don't think you're even right.
>
>It basically depends on mlock_vma_page() and munlock_vma_page() being
>able to run CONCURRENTLY on the same page. In particular, you could
>have a mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>immediately clearing it on another, and then the rest of those
>functions could run with a totally arbitrary interleaving when working
>with the exact same page.
>
>They both do basically
>
>    if (!isolate_lru_page(page))
>        putback_lru_page(page);
>
>but one or the other would randomly win the race (it's internally
>protected by the lru lock), and *if* the munlock_vma_page() wins it,
>it would also do
>
>    try_to_munlock(page);
>
>but if mlock_vma_page() wins it, that wouldn't happen. That looks
>entirely broken - you end up with the PageMlocked bit clear, but
>try_to_munlock() was never called on that page, because
>mlock_vma_page() got to the page isolation before the "subsequent"
>munlock_vma_page().
>
>And this is very much what the page lock serialization would prevent.
>So no, the PageMlocked in *no* way gives serialization. It's an atomic
>bit op, yes, but that only "serializes" in one direction, not when you
>can have a mix of bit setting and clearing.
>
>So quite frankly, I think you're wrong. The BUG_ON() is correct, or at
>least enforces some kind of ordering. And try_to_unmap_cluster() is
>just broken in calling that without the page being locked. That's my
>opinion. There may be some *other* reason why it all happens to work,
>but no, "TestSetPageMlocked should provide enough race protection" is
>simply not true, and even if it were, it's way too subtle and odd to
>be a good rule.
>
>So I really object to just removing the BUG_ON(). Not with a *lot*
>more explanation as to why these kinds of issues wouldn't matter.

Any better idea to improve my patch which in the "lock the page around
calling it" direction. ;-)
http://marc.info/?l=linux-mm&m=138733994417230&w=2

Regards,
Wanpeng Li 

>
>                 Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-04  0:18                     ` Linus Torvalds
  2014-01-04  8:09                       ` Vlastimil Babka
  2014-01-05  0:27                       ` Wanpeng Li
@ 2014-01-06 16:47                       ` Motohiro Kosaki
  2014-01-06 22:01                         ` KOSAKI Motohiro
                                           ` (2 more replies)
  2 siblings, 3 replies; 29+ messages in thread
From: Motohiro Kosaki @ 2014-01-06 16:47 UTC (permalink / raw)
  To: Linus Torvalds, Vlastimil Babka
  Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List



> -----Original Message-----
> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
> Torvalds
> Sent: Friday, January 03, 2014 7:18 PM
> To: Vlastimil Babka
> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
> List
> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
> VMAs
> 
> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
> >
> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
> > should provide enough race protection.
> 
> Maybe. But dammit, that's subtle, and I don't think you're even right.
> 
> It basically depends on mlock_vma_page() and munlock_vma_page() being
> able to run CONCURRENTLY on the same page. In particular, you could have a
> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
> immediately clearing it on another, and then the rest of those functions
> could run with a totally arbitrary interleaving when working with the exact
> same page.
> 
> They both do basically
> 
>     if (!isolate_lru_page(page))
>         putback_lru_page(page);
> 
> but one or the other would randomly win the race (it's internally protected
> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
> 
>     try_to_munlock(page);
> 
> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
> broken - you end up with the PageMlocked bit clear, but
> try_to_munlock() was never called on that page, because
> mlock_vma_page() got to the page isolation before the "subsequent"
> munlock_vma_page().
> 
> And this is very much what the page lock serialization would prevent.
> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
> yes, but that only "serializes" in one direction, not when you can have a mix
> of bit setting and clearing.
> 
> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
> in calling that without the page being locked. That's my opinion. There may
> be some *other* reason why it all happens to work, but no,
> "TestSetPageMlocked should provide enough race protection" is simply not
> true, and even if it were, it's way too subtle and odd to be a good rule.
> 
> So I really object to just removing the BUG_ON(). Not with a *lot* more
> explanation as to why these kinds of issues wouldn't matter.

I don't have a perfect answer. But I can explain a bit history. Let's me try.

First off, 5 years ago, Lee's original putback_lru_page() implementation required
page-lock, but I removed the restriction months later. That's why we can see 
strange BUG_ON here. 

5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by 
mmap_sem (write mdoe). Then, mlock and munlock had no race. 
Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
protect against munlock.

Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.

By the way, page isolation is still necessary because we need to protect another page modification like page migration.


My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out,
If I am overlooking something.

Thanks.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-06 16:47                       ` Motohiro Kosaki
@ 2014-01-06 22:01                         ` KOSAKI Motohiro
  2014-01-07  5:27                         ` Wanpeng Li
  2014-01-07 15:01                         ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: KOSAKI Motohiro @ 2014-01-06 22:01 UTC (permalink / raw)
  To: Motohiro Kosaki
  Cc: Linus Torvalds, Vlastimil Babka, Sasha Levin, Andrew Morton,
	Wanpeng Li, Michel Lespinasse, Bob Liu, Nick Piggin, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List

On Mon, Jan 6, 2014 at 11:47 AM, Motohiro Kosaki
<Motohiro.Kosaki@us.fujitsu.com> wrote:
>
>
>> -----Original Message-----
>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>>
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>> > should provide enough race protection.
>>
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>>
>> They both do basically
>>
>>     if (!isolate_lru_page(page))
>>         putback_lru_page(page);
>>
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>
>>     try_to_munlock(page);
>>
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>>
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>>
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
>
> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>
> First off, 5 years ago, Lee's original putback_lru_page() implementation required
> page-lock, but I removed the restriction months later. That's why we can see
> strange BUG_ON here.
>
> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by
> mmap_sem (write mdoe). Then, mlock and munlock had no race.
> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
> protect against munlock.
>
> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>
> By the way, page isolation is still necessary because we need to protect another page modification like page migration.
>
>
> My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out,
> If I am overlooking something.

No. I did talk about completely different issue. My memory is
completely broken as I said. I need to read latest code and dig past
discussion. Sorry again, please ignore my last mail.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-06 16:47                       ` Motohiro Kosaki
  2014-01-06 22:01                         ` KOSAKI Motohiro
@ 2014-01-07  5:27                         ` Wanpeng Li
  2014-01-07 15:01                         ` Vlastimil Babka
  2 siblings, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2014-01-07  5:27 UTC (permalink / raw)
  To: Motohiro Kosaki
  Cc: Linus Torvalds, Vlastimil Babka, Sasha Levin, Andrew Morton,
	Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP,
	Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim,
	Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List

Hi Motohiro,
On Mon, Jan 06, 2014 at 08:47:23AM -0800, Motohiro Kosaki wrote:
>
>
>> -----Original Message-----
>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>> 
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>> >
>> > I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>> > should provide enough race protection.
>> 
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>> 
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>> 
>> They both do basically
>> 
>>     if (!isolate_lru_page(page))
>>         putback_lru_page(page);
>> 
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>> 
>>     try_to_munlock(page);
>> 
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>> 
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>> 
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>> 
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
>
>I don't have a perfect answer. But I can explain a bit history. Let's me try.
>
>First off, 5 years ago, Lee's original putback_lru_page() implementation required
>page-lock, but I removed the restriction months later. That's why we can see 
>strange BUG_ON here. 
>
>5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by 
>mmap_sem (write mdoe). Then, mlock and munlock had no race. 
>Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
>protect against munlock.
>
>Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
>mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. 

| If reclaim race against step (1), reclaim must lose because it uses trylock. 

Could you point out when page is locked during 1)? 

Regards,
Wanpeng Li 

>On the other hand, if reclaim race against step (2), reclaim must detect
>VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>
>By the way, page isolation is still necessary because we need to protect another page modification like page migration.
>
>
>My memory was alomostly flushed and I might lost some technical concern and past discussion. Please point me out,
>If I am overlooking something.
>
>Thanks.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-06 16:47                       ` Motohiro Kosaki
  2014-01-06 22:01                         ` KOSAKI Motohiro
  2014-01-07  5:27                         ` Wanpeng Li
@ 2014-01-07 15:01                         ` Vlastimil Babka
  2014-01-08  1:06                           ` Bob Liu
                                             ` (2 more replies)
  2 siblings, 3 replies; 29+ messages in thread
From: Vlastimil Babka @ 2014-01-07 15:01 UTC (permalink / raw)
  To: Motohiro Kosaki, Andrew Morton
  Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim

On 01/06/2014 05:47 PM, Motohiro Kosaki wrote:
> 
> 
>> -----Original Message-----
>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
>> Torvalds
>> Sent: Friday, January 03, 2014 7:18 PM
>> To: Vlastimil Babka
>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>> List
>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>> VMAs
>>
>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>
>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>>> should provide enough race protection.
>>
>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>
>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>> able to run CONCURRENTLY on the same page. In particular, you could have a
>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>> immediately clearing it on another, and then the rest of those functions
>> could run with a totally arbitrary interleaving when working with the exact
>> same page.
>>
>> They both do basically
>>
>>      if (!isolate_lru_page(page))
>>          putback_lru_page(page);
>>
>> but one or the other would randomly win the race (it's internally protected
>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>
>>      try_to_munlock(page);
>>
>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>> broken - you end up with the PageMlocked bit clear, but
>> try_to_munlock() was never called on that page, because
>> mlock_vma_page() got to the page isolation before the "subsequent"
>> munlock_vma_page().
>>
>> And this is very much what the page lock serialization would prevent.
>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>> yes, but that only "serializes" in one direction, not when you can have a mix
>> of bit setting and clearing.
>>
>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>> in calling that without the page being locked. That's my opinion. There may
>> be some *other* reason why it all happens to work, but no,
>> "TestSetPageMlocked should provide enough race protection" is simply not
>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>
>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>> explanation as to why these kinds of issues wouldn't matter.
> 
> I don't have a perfect answer. But I can explain a bit history. Let's me try.
> 
> First off, 5 years ago, Lee's original putback_lru_page() implementation required
> page-lock, but I removed the restriction months later. That's why we can see
> strange BUG_ON here.
> 
> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by
> mmap_sem (write mdoe). Then, mlock and munlock had no race.
> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
> protect against munlock.
> 
> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
> 
> By the way, page isolation is still necessary because we need to protect another page modification like page migration.

I guess you meant page locking, not (lru) isolation. Indeed, the documentation
at Documentation/vm/unevictable-lru.txt also discusses races with page migration.

I've checked and it seems really the case that mlock_migrate_page()
could race with mlock_vma_page() so that PG_mlocked is set on the
old page again after deciding that the new page will be without the flag.
(or after the flag was transferred to the new page)
That would not be fatal, but distort accounting anyway.

So here's a patch that if accepted should replace the removal of BUG_ON patch in
-mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch

The idea is that try_to_unmap_cluster() will try locking the page
for mlock, and just leave it alone if lock cannot be obtained. Again
that's not fatal, as eventually something will encounter and mlock the page.

-----8<-----

From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 7 Jan 2014 14:59:58 +0100
Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking

A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing
with trinity. The call site try_to_unmap_cluster() does not lock the pages
other than its check_page parameter (which is already locked).

The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat
unclear, but apparently it serializes against page migration, which could
otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the
page would be eventually encountered again, but NR_MLOCK accounting would
become distorted nevertheless. This patch adds a comment to the BUG_ON in
mlock_vma_page() and munlock_vma_page() to that effect.

The call site try_to_unmap_cluster() is fixed so that for page != check_page,
trylock_page() is attempted (to avoid possible deadlocks as we already have
check_page locked) and mlock_vma_page() is performed only upon success. If the
page lock cannot be obtained, the page is left without PG_mlocked, which is
again not a problem in the whole unevictable memory design.

Reported-by: Sasha Levin <sasha.levin@oracle.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Bob Liu <bob.liu@oracle.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/mlock.c |  2 ++
 mm/rmap.c  | 14 ++++++++++++--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 192e6ee..1b12dfa 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page)
  */
 void mlock_vma_page(struct page *page)
 {
+	/* Serialize with page migration */
 	BUG_ON(!PageLocked(page));
 
 	if (!TestSetPageMlocked(page)) {
@@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page)
 {
 	unsigned int nr_pages;
 
+	/* For try_to_munlock() and to serialize with page migration */
 	BUG_ON(!PageLocked(page));
 
 	if (TestClearPageMlocked(page)) {
diff --git a/mm/rmap.c b/mm/rmap.c
index 068522d..b99c742 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
 		BUG_ON(!page || PageAnon(page));
 
 		if (locked_vma) {
-			mlock_vma_page(page);   /* no-op if already mlocked */
-			if (page == check_page)
+			if (page == check_page) {
+				/* we know we have check_page locked */
+				mlock_vma_page(page);
 				ret = SWAP_MLOCK;
+			} else if (trylock_page(page)) {
+				/*
+				 * If we can lock the page, perform mlock.
+				 * Otherwise leave the page alone, it will be
+				 * eventually encountered again later.
+				 */
+				mlock_vma_page(page);
+				unlock_page(page);
+			}
 			continue;	/* don't unmap */
 		}
 
-- 
1.8.4



--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-07 15:01                         ` Vlastimil Babka
@ 2014-01-08  1:06                           ` Bob Liu
  2014-01-08  2:44                           ` Linus Torvalds
  2014-01-10 17:48                           ` Motohiro Kosaki
  2 siblings, 0 replies; 29+ messages in thread
From: Bob Liu @ 2014-01-08  1:06 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Motohiro Kosaki, Andrew Morton, Linus Torvalds, Sasha Levin,
	Wanpeng Li, Michel Lespinasse, Nick Piggin, Motohiro Kosaki JP,
	Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim,
	Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List, Joonsoo Kim


On 01/07/2014 11:01 PM, Vlastimil Babka wrote:
> On 01/06/2014 05:47 PM, Motohiro Kosaki wrote:
>>
>>
>>> -----Original Message-----
>>> From: linus971@gmail.com [mailto:linus971@gmail.com] On Behalf Of Linus
>>> Torvalds
>>> Sent: Friday, January 03, 2014 7:18 PM
>>> To: Vlastimil Babka
>>> Cc: Sasha Levin; Andrew Morton; Wanpeng Li; Michel Lespinasse; Bob Liu;
>>> Nick Piggin; Motohiro Kosaki JP; Rik van Riel; David Rientjes; Mel Gorman;
>>> Minchan Kim; Hugh Dickins; Johannes Weiner; linux-mm; Linux Kernel Mailing
>>> List
>>> Subject: Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear
>>> VMAs
>>>
>>> On Fri, Jan 3, 2014 at 3:36 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>>>>
>>>> I'm for going with the removal of BUG_ON. The TestSetPageMlocked
>>>> should provide enough race protection.
>>>
>>> Maybe. But dammit, that's subtle, and I don't think you're even right.
>>>
>>> It basically depends on mlock_vma_page() and munlock_vma_page() being
>>> able to run CONCURRENTLY on the same page. In particular, you could have a
>>> mlock_vma_page() set the bit on one CPU, and munlock_vma_page()
>>> immediately clearing it on another, and then the rest of those functions
>>> could run with a totally arbitrary interleaving when working with the exact
>>> same page.
>>>
>>> They both do basically
>>>
>>>      if (!isolate_lru_page(page))
>>>          putback_lru_page(page);
>>>
>>> but one or the other would randomly win the race (it's internally protected
>>> by the lru lock), and *if* the munlock_vma_page() wins it, it would also do
>>>
>>>      try_to_munlock(page);
>>>
>>> but if mlock_vma_page() wins it, that wouldn't happen. That looks entirely
>>> broken - you end up with the PageMlocked bit clear, but
>>> try_to_munlock() was never called on that page, because
>>> mlock_vma_page() got to the page isolation before the "subsequent"
>>> munlock_vma_page().
>>>
>>> And this is very much what the page lock serialization would prevent.
>>> So no, the PageMlocked in *no* way gives serialization. It's an atomic bit op,
>>> yes, but that only "serializes" in one direction, not when you can have a mix
>>> of bit setting and clearing.
>>>
>>> So quite frankly, I think you're wrong. The BUG_ON() is correct, or at least
>>> enforces some kind of ordering. And try_to_unmap_cluster() is just broken
>>> in calling that without the page being locked. That's my opinion. There may
>>> be some *other* reason why it all happens to work, but no,
>>> "TestSetPageMlocked should provide enough race protection" is simply not
>>> true, and even if it were, it's way too subtle and odd to be a good rule.
>>>
>>> So I really object to just removing the BUG_ON(). Not with a *lot* more
>>> explanation as to why these kinds of issues wouldn't matter.
>>
>> I don't have a perfect answer. But I can explain a bit history. Let's me try.
>>
>> First off, 5 years ago, Lee's original putback_lru_page() implementation required
>> page-lock, but I removed the restriction months later. That's why we can see
>> strange BUG_ON here.
>>
>> 5 years ago, both mlock(2) and munlock(2) called do_mlock() and it was protected  by
>> mmap_sem (write mdoe). Then, mlock and munlock had no race.
>> Now, __mm_populate() (called by mlock(2)) is only protected by mmap_sem read-mode. However it is enough to
>> protect against munlock.
>>
>> Next, In case of mlock vs reclaim, the key is that mlock(2) has two step operation. 1) turn on VM_LOCKED under
>> mmap_sem write-mode, 2) turn on Page_Mlocked under mmap_sem read-mode. If reclaim race against step (1),
>> reclaim must lose because it uses trylock. On the other hand, if reclaim race against step (2), reclaim must detect
>> VM_LOCKED because both VM_LOCKED modifier and observer take mmap-sem.
>>
>> By the way, page isolation is still necessary because we need to protect another page modification like page migration.
> 
> I guess you meant page locking, not (lru) isolation. Indeed, the documentation
> at Documentation/vm/unevictable-lru.txt also discusses races with page migration.
> 
> I've checked and it seems really the case that mlock_migrate_page()
> could race with mlock_vma_page() so that PG_mlocked is set on the
> old page again after deciding that the new page will be without the flag.
> (or after the flag was transferred to the new page)
> That would not be fatal, but distort accounting anyway.
> 
> So here's a patch that if accepted should replace the removal of BUG_ON patch in
> -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
> 

Make sense to me.

> The idea is that try_to_unmap_cluster() will try locking the page
> for mlock, and just leave it alone if lock cannot be obtained. Again
> that's not fatal, as eventually something will encounter and mlock the page.
> 
> -----8<-----
> 
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Tue, 7 Jan 2014 14:59:58 +0100
> Subject: [PATCH] mm: try_to_unmap_cluster() should lock_page() before mlocking
> 
> A BUG_ON(!PageLocked) was triggered in mlock_vma_page() by Sasha Levin fuzzing
> with trinity. The call site try_to_unmap_cluster() does not lock the pages
> other than its check_page parameter (which is already locked).
> 
> The BUG_ON in mlock_vma_page() is not documented and its purpose is somewhat
> unclear, but apparently it serializes against page migration, which could
> otherwise fail to transfer the PG_mlocked flag. This would not be fatal, as the
> page would be eventually encountered again, but NR_MLOCK accounting would
> become distorted nevertheless. This patch adds a comment to the BUG_ON in
> mlock_vma_page() and munlock_vma_page() to that effect.
> 
> The call site try_to_unmap_cluster() is fixed so that for page != check_page,
> trylock_page() is attempted (to avoid possible deadlocks as we already have
> check_page locked) and mlock_vma_page() is performed only upon success. If the
> page lock cannot be obtained, the page is left without PG_mlocked, which is
> again not a problem in the whole unevictable memory design.
> 
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Bob Liu <bob.liu@oracle.com>
> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: David Rientjes <rientjes@google.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> Cc: <stable@vger.kernel.org>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Reviewed-by: Bob Liu <bob.liu@oracle.com>

> ---
>  mm/mlock.c |  2 ++
>  mm/rmap.c  | 14 ++++++++++++--
>  2 files changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/mlock.c b/mm/mlock.c
> index 192e6ee..1b12dfa 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -79,6 +79,7 @@ void clear_page_mlock(struct page *page)
>   */
>  void mlock_vma_page(struct page *page)
>  {
> +	/* Serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  
>  	if (!TestSetPageMlocked(page)) {
> @@ -153,6 +154,7 @@ unsigned int munlock_vma_page(struct page *page)
>  {
>  	unsigned int nr_pages;
>  
> +	/* For try_to_munlock() and to serialize with page migration */
>  	BUG_ON(!PageLocked(page));
>  
>  	if (TestClearPageMlocked(page)) {
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>  		BUG_ON(!page || PageAnon(page));
>  
>  		if (locked_vma) {
> -			mlock_vma_page(page);   /* no-op if already mlocked */
> -			if (page == check_page)
> +			if (page == check_page) {
> +				/* we know we have check_page locked */
> +				mlock_vma_page(page);
>  				ret = SWAP_MLOCK;
> +			} else if (trylock_page(page)) {
> +				/*
> +				 * If we can lock the page, perform mlock.
> +				 * Otherwise leave the page alone, it will be
> +				 * eventually encountered again later.
> +				 */
> +				mlock_vma_page(page);
> +				unlock_page(page);
> +			}
>  			continue;	/* don't unmap */
>  		}
>  
> 

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-07 15:01                         ` Vlastimil Babka
  2014-01-08  1:06                           ` Bob Liu
@ 2014-01-08  2:44                           ` Linus Torvalds
  2014-01-10 17:48                           ` Motohiro Kosaki
  2 siblings, 0 replies; 29+ messages in thread
From: Linus Torvalds @ 2014-01-08  2:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Motohiro Kosaki, Andrew Morton, Sasha Levin, Wanpeng Li,
	Michel Lespinasse, Bob Liu, Nick Piggin, Motohiro Kosaki JP,
	Rik van Riel, David Rientjes, Mel Gorman, Minchan Kim,
	Hugh Dickins, Johannes Weiner, linux-mm,
	Linux Kernel Mailing List, Joonsoo Kim

On Tue, Jan 7, 2014 at 11:01 PM, Vlastimil Babka <vbabka@suse.cz> wrote:
>
> So here's a patch that if accepted should replace the removal of BUG_ON patch in
> -mm tree: http://ozlabs.org/~akpm/mmots/broken-out/mm-remove-bug_on-from-mlock_vma_page.patch
>
> The idea is that try_to_unmap_cluster() will try locking the page
> for mlock, and just leave it alone if lock cannot be obtained. Again
> that's not fatal, as eventually something will encounter and mlock the page.

This looks sane to me. Andrew?

            Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-07 15:01                         ` Vlastimil Babka
  2014-01-08  1:06                           ` Bob Liu
  2014-01-08  2:44                           ` Linus Torvalds
@ 2014-01-10 17:48                           ` Motohiro Kosaki
  2014-01-13 14:03                             ` Vlastimil Babka
  2 siblings, 1 reply; 29+ messages in thread
From: Motohiro Kosaki @ 2014-01-10 17:48 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim

> diff --git a/mm/rmap.c b/mm/rmap.c
> index 068522d..b99c742 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
> cursor, unsigned int *mapcount,
>  		BUG_ON(!page || PageAnon(page));
> 
>  		if (locked_vma) {
> -			mlock_vma_page(page);   /* no-op if already
> mlocked */
> -			if (page == check_page)
> +			if (page == check_page) {
> +				/* we know we have check_page locked */
> +				mlock_vma_page(page);
>  				ret = SWAP_MLOCK;
> +			} else if (trylock_page(page)) {
> +				/*
> +				 * If we can lock the page, perform mlock.
> +				 * Otherwise leave the page alone, it will be
> +				 * eventually encountered again later.
> +				 */
> +				mlock_vma_page(page);
> +				unlock_page(page);
> +			}
>  			continue;	/* don't unmap */
>  		}

I audited all related mm code. However I couldn't find any race that it can close.

First off,  current munlock code is crazy tricky.

munlock
	down_write(mmap_sem)
	do_mlock()
		mlock_fixup
			munlock_vma_pages_range
				__munlock_pagevec
					spin_lock_irq(zone->lru_lock)
					TestClearPageMlocked(page)
					del_page_from_lru_list
					spin_unlock_irq(zone->lru_lock)
					lock_page
					__munlock_isolated_page
					unlock_page

	up_write(mmap_sem)

Look, TestClearPageMlocked(page) is not protected by lock_page. But this is still
safe because Page_mocked mean one or more vma marked VM_LOCKED then we
only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.

And,

					spin_lock_irq(zone->lru_lock)
					del_page_from_lru_list
					spin_unlock_irq(zone->lru_lock)

This idiom ensures I or someone else isolate the page from lru and isolated pages
will be put back by putback_lru_page in anycase. So, the pages will move the right
lru eventually.

And then, taking page-lock doesn't help to close vs munlock race.

On the other hands, page migration has the following call stack. 

some-caller [isolate_lru_page]
	unmap_and_move
		__unmap_and_move
		trylock_page
		try_to_unmap
		move_to_new_page
			migrate_page
				migrate_page_copy
		unlock_page

The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.
Page fault path take lock page and doesn't use page isolation. This is correct.
try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.

Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
page migration, but not lru manipulation.

		if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
			lock_page(old_page);	/* LRU manipulation */
			munlock_vma_page(old_page);
			unlock_page(old_page);
		}

But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
page lock. If so, I propose to add PageMlocked() like this

			} else if (!PageMlocked() && trylock_page(page)) {
				/*
				 * If we can lock the page, perform mlock.
				 * Otherwise leave the page alone, it will be
				 * eventually encountered again later.
				 */
				mlock_vma_page(page);
				unlock_page(page);

This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
do the right job and will fix the mistake.

Anyway,  I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-10 17:48                           ` Motohiro Kosaki
@ 2014-01-13 14:03                             ` Vlastimil Babka
  2014-01-14 11:05                               ` Vlastimil Babka
  0 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2014-01-13 14:03 UTC (permalink / raw)
  To: Motohiro Kosaki, Andrew Morton
  Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim

On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 068522d..b99c742 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
>> cursor, unsigned int *mapcount,
>>   		BUG_ON(!page || PageAnon(page));
>>
>>   		if (locked_vma) {
>> -			mlock_vma_page(page);   /* no-op if already
>> mlocked */
>> -			if (page == check_page)
>> +			if (page == check_page) {
>> +				/* we know we have check_page locked */
>> +				mlock_vma_page(page);
>>   				ret = SWAP_MLOCK;
>> +			} else if (trylock_page(page)) {
>> +				/*
>> +				 * If we can lock the page, perform mlock.
>> +				 * Otherwise leave the page alone, it will be
>> +				 * eventually encountered again later.
>> +				 */
>> +				mlock_vma_page(page);
>> +				unlock_page(page);
>> +			}
>>   			continue;	/* don't unmap */
>>   		}
> 
> I audited all related mm code. However I couldn't find any race that it can close.

Well, I would say the lock here closes the race with page migration, no? (As discussed below)

> First off,  current munlock code is crazy tricky.

Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).

> munlock
> 	down_write(mmap_sem)
> 	do_mlock()
> 		mlock_fixup
> 			munlock_vma_pages_range
> 				__munlock_pagevec
> 					spin_lock_irq(zone->lru_lock)
> 					TestClearPageMlocked(page)
> 					del_page_from_lru_list
> 					spin_unlock_irq(zone->lru_lock)
> 					lock_page
> 					__munlock_isolated_page
> 					unlock_page
> 				
> 	up_write(mmap_sem)
> 
> Look, TestClearPageMlocked(page) is not protected by lock_page.

Right :( That's my fault, when developing the patch series I didn't see the page
migration race, and it seemed that lock is only needed to protect the rmap operations
in __munlock_isolated_page()

> But this is still
> safe because Page_mocked mean one or more vma marked VM_LOCKED then we
> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.
> 
> And,
> 
> 					spin_lock_irq(zone->lru_lock)
> 					del_page_from_lru_list
> 					spin_unlock_irq(zone->lru_lock)
> 
> This idiom ensures I or someone else isolate the page from lru and isolated pages
> will be put back by putback_lru_page in anycase. So, the pages will move the right
> lru eventually.
> 
> And then, taking page-lock doesn't help to close vs munlock race.
> 
> On the other hands, page migration has the following call stack.
> 
> some-caller [isolate_lru_page]
> 	unmap_and_move
> 		__unmap_and_move
> 		trylock_page
> 		try_to_unmap
> 		move_to_new_page
> 			migrate_page
> 				migrate_page_copy
> 		unlock_page
> 
> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.

However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
or clearing PG_mlocked in __munlock_pagevec().

The race I have in mind for munlock is:

CPU1 does page migration
 some-caller [isolate_lru_page]
 	unmap_and_move
 		__unmap_and_move
 		trylock_page
 		try_to_unmap
 		move_to_new_page
 			migrate_page
 				migrate_page_copy
					mlock_migrate_page - transfers PG_mlocked from old page to new page

CPU2 does munlock:
 munlock
 	down_write(mmap_sem)
 	do_mlock()
 		mlock_fixup
 			munlock_vma_pages_range
 				__munlock_pagevec
 					spin_lock_irq(zone->lru_lock)
 					TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
					so it stops, but meanwhile new page already has PG_mlocked set and will
					stay inevictable

> Page fault path take lock page and doesn't use page isolation. This is correct.
> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.

I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
(I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
but a distorted accounting anyway.

> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
> page migration, but not lru manipulation.
> 
> 		if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
> 			lock_page(old_page);	/* LRU manipulation */
> 			munlock_vma_page(old_page);
> 			unlock_page(old_page);
> 		}
> 
> 
> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
> page lock. If so, I propose to add PageMlocked() like this
> 
> 			} else if (!PageMlocked() && trylock_page(page)) {
> 				/*
> 				 * If we can lock the page, perform mlock.
> 				 * Otherwise leave the page alone, it will be
> 				 * eventually encountered again later.
> 				 */
> 				mlock_vma_page(page);
> 				unlock_page(page);
> 
> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
> do the right job and will fix the mistake.
> 
> Anyway,  I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.
 
Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
Another option would be that page migration would somehow deal with the race, or just leave the target pages without
PG_mlocked and let them be dealt with later.
But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
And finally, we could then at least replace the atomic test and set with faster non-atomic variants.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-13 14:03                             ` Vlastimil Babka
@ 2014-01-14 11:05                               ` Vlastimil Babka
  0 siblings, 0 replies; 29+ messages in thread
From: Vlastimil Babka @ 2014-01-14 11:05 UTC (permalink / raw)
  To: Motohiro Kosaki, Andrew Morton
  Cc: Linus Torvalds, Sasha Levin, Wanpeng Li, Michel Lespinasse,
	Bob Liu, Nick Piggin, Motohiro Kosaki JP, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List, Joonsoo Kim

On 01/13/2014 03:03 PM, Vlastimil Babka wrote:
> On 01/10/2014 06:48 PM, Motohiro Kosaki wrote:
>>> diff --git a/mm/rmap.c b/mm/rmap.c
>>> index 068522d..b99c742 100644
>>> --- a/mm/rmap.c
>>> +++ b/mm/rmap.c
>>> @@ -1389,9 +1389,19 @@ static int try_to_unmap_cluster(unsigned long
>>> cursor, unsigned int *mapcount,
>>>    		BUG_ON(!page || PageAnon(page));
>>>
>>>    		if (locked_vma) {
>>> -			mlock_vma_page(page);   /* no-op if already
>>> mlocked */
>>> -			if (page == check_page)
>>> +			if (page == check_page) {
>>> +				/* we know we have check_page locked */
>>> +				mlock_vma_page(page);
>>>    				ret = SWAP_MLOCK;
>>> +			} else if (trylock_page(page)) {
>>> +				/*
>>> +				 * If we can lock the page, perform mlock.
>>> +				 * Otherwise leave the page alone, it will be
>>> +				 * eventually encountered again later.
>>> +				 */
>>> +				mlock_vma_page(page);
>>> +				unlock_page(page);
>>> +			}
>>>    			continue;	/* don't unmap */
>>>    		}
>>
>> I audited all related mm code. However I couldn't find any race that it can close.
>
> Well, I would say the lock here closes the race with page migration, no? (As discussed below)
>
>> First off,  current munlock code is crazy tricky.
>
> Oops, that's actually a result of my patches to speed up munlock by batching pages (since 3.12).
>
>> munlock
>> 	down_write(mmap_sem)
>> 	do_mlock()
>> 		mlock_fixup
>> 			munlock_vma_pages_range
>> 				__munlock_pagevec
>> 					spin_lock_irq(zone->lru_lock)
>> 					TestClearPageMlocked(page)
>> 					del_page_from_lru_list
>> 					spin_unlock_irq(zone->lru_lock)
>> 					lock_page
>> 					__munlock_isolated_page
>> 					unlock_page
>> 				
>> 	up_write(mmap_sem)
>>
>> Look, TestClearPageMlocked(page) is not protected by lock_page.
>
> Right :( That's my fault, when developing the patch series I didn't see the page
> migration race, and it seemed that lock is only needed to protect the rmap operations
> in __munlock_isolated_page()
>
>> But this is still
>> safe because Page_mocked mean one or more vma marked VM_LOCKED then we
>> only need care about turning down final VM_LOCKED. I.e. mmap_sem protect them.
>>
>> And,
>>
>> 					spin_lock_irq(zone->lru_lock)
>> 					del_page_from_lru_list
>> 					spin_unlock_irq(zone->lru_lock)
>>
>> This idiom ensures I or someone else isolate the page from lru and isolated pages
>> will be put back by putback_lru_page in anycase. So, the pages will move the right
>> lru eventually.
>>
>> And then, taking page-lock doesn't help to close vs munlock race.
>>
>> On the other hands, page migration has the following call stack.
>>
>> some-caller [isolate_lru_page]
>> 	unmap_and_move
>> 		__unmap_and_move
>> 		trylock_page
>> 		try_to_unmap
>> 		move_to_new_page
>> 			migrate_page
>> 				migrate_page_copy
>> 		unlock_page
>>
>> The critical part (i.e. migrate_page_copy) is protected by both page isolation and page lock.
>
> However, none of that protects against setting PG_mlocked in try_to_unmap_cluster() -> mlock_vma_page(),
> or clearing PG_mlocked in __munlock_pagevec().
>
> The race I have in mind for munlock is:
>
> CPU1 does page migration
>   some-caller [isolate_lru_page]
>   	unmap_and_move
>   		__unmap_and_move
>   		trylock_page
>   		try_to_unmap
>   		move_to_new_page
>   			migrate_page
>   				migrate_page_copy
> 					mlock_migrate_page - transfers PG_mlocked from old page to new page
>
> CPU2 does munlock:
>   munlock
>   	down_write(mmap_sem)
>   	do_mlock()
>   		mlock_fixup
>   			munlock_vma_pages_range
>   				__munlock_pagevec
>   					spin_lock_irq(zone->lru_lock)
>   					TestClearPageMlocked(page) - here it finds PG_mlocked already cleared
> 					so it stops, but meanwhile new page already has PG_mlocked set and will
> 					stay inevictable

As Mel pointed out to me, page lock in munlock alone would not help here
anyway, because munlock would just wait for migration to unlock the old 
page and then still fail to clear a flag that has been transferred by 
migration to the new page. So if there was a race, it would be older 
than __munlock_pagevec() removing TestClearPageMlocked(page) from the 
page_lock protected section.

But then I noticed that migration bails out for pages that have elevated 
pins, and this is done after making the page unreachable for 
mlock/munlock via pte migration entries. So this alone should prevent a 
race with m(un)lock, with three possible scenarios:
- m(un)lock sees the migration entry and waits for migration to
   complete (via follow_page_mask), operates on the new page
- m(un)lock gets the page reference before migration entry is set,
   finishes before migration checks the page pin count, migration
   transfers the new PG_mlocked state to the new page
- m(un)lock doesn't finish that quickly, migration bails out, m(un)lock
   changes the old page that is mapped back


>> Page fault path take lock page and doesn't use page isolation. This is correct.
>> try_to_unmap_cluster doesn't take page lock, but it ensure the page is isolated. This is correct too.
>
> I don't see where try_to_unmap_cluster() has guaranteed that pages other than check_page are isolated?
> (I might just be missing that). So in the race example above, CPU2 could be doing try_to_unmap_cluster()
> and set PG_mlocked on old page, with no effect on the new page. Not fatal for the design of lazy mlocking,
> but a distorted accounting anyway.
>
>> Plus, do_wp_page() has the following comment. But it is wrong. This lock is necessary to protect against
>> page migration, but not lru manipulation.
>>
>> 		if ((ret & VM_FAULT_WRITE) && (vma->vm_flags & VM_LOCKED)) {
>> 			lock_page(old_page);	/* LRU manipulation */
>> 			munlock_vma_page(old_page);
>> 			unlock_page(old_page);
>> 		}
>>

So the protection is actually for rmap operations in try_to_munlock.
The comment could be removed from the caller here and appear just at the 
PageLocked assertion in munlock_vma_page().

>> But, you know, this is crazy ugly lock design. I'm ok to change the rule to that PG_mlocked must be protected
>> page lock. If so, I propose to add PageMlocked() like this
>>
>> 			} else if (!PageMlocked() && trylock_page(page)) {
>> 				/*
>> 				 * If we can lock the page, perform mlock.
>> 				 * Otherwise leave the page alone, it will be
>> 				 * eventually encountered again later.
>> 				 */
>> 				mlock_vma_page(page);
>> 				unlock_page(page);
>>
>> This is safe. Even if race is happen and vmscan failed to mark PG_mlocked, that's no problem. Next vmscan may
>> do the right job and will fix the mistake.
>>
>> Anyway,  I'm also sure this is not recent issue. It lived 5 years. It is no reason to rush.

If we agree that page migration is safe as explained above, then 
removing the PageLocked assertion from mlock_vma_page is again an 
option, instead of making sure that all callers lock?

> Yeah but there's the new issue with __munlock_pagevec() :( Perhaps a more serious one as it could leave pages inevictable
> when they shouldn't be. I'm not sure if the performance benefits of that could be preserved with full page locking.
> Another option would be that page migration would somehow deal with the race, or just leave the target pages without
> PG_mlocked and let them be dealt with later.
> But if we go with the rule that page lock must protect PG_mlocked, then there's also clear_page_mlock() to consider.
> And finally, we could then at least replace the atomic test and set with faster non-atomic variants.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2014-01-03 20:52                 ` Linus Torvalds
  2014-01-03 23:36                   ` Vlastimil Babka
@ 2014-01-04  3:31                   ` Bob Liu
  1 sibling, 0 replies; 29+ messages in thread
From: Bob Liu @ 2014-01-04  3:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Sasha Levin, Andrew Morton, Wanpeng Li, Vlastimil Babka,
	Michel Lespinasse, Nick Piggin, KOSAKI Motohiro, Rik van Riel,
	David Rientjes, Mel Gorman, Minchan Kim, Hugh Dickins,
	Johannes Weiner, linux-mm, Linux Kernel Mailing List


On 01/04/2014 04:52 AM, Linus Torvalds wrote:
> On Fri, Jan 3, 2014 at 12:17 PM, Sasha Levin <sasha.levin@oracle.com> wrote:
>>
>> Ping? This BUG() is triggerable in 3.13-rc6 right now.
> 
> So Andrew suggested just removing the BUG_ON(), but it's been there
> for a *long* time.
> 
> And I detest the patch that was sent out that said "Should I check?"
> 
> Maybe we should just remove that mlock_vma_page() thing instead in
> try_to_unmap_cluster()? Or maybe actually lock the page around calling

I didn't get the reason why we have to call mlock_vma_page() in
try_to_unmap_cluster() and I agree to just remove it.

> it?
> 
>              Linus
> 

-- 
Regards,
-Bob

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-17  8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li
  2013-12-18  3:14 ` Wanpeng Li
  2013-12-18  3:16 ` Wanpeng Li
@ 2013-12-18  8:54 ` Vlastimil Babka
  2013-12-18  9:01   ` Wanpeng Li
  2 siblings, 1 reply; 29+ messages in thread
From: Vlastimil Babka @ 2013-12-18  8:54 UTC (permalink / raw)
  To: Wanpeng Li, Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, linux-mm, linux-kernel

On 12/17/2013 09:05 AM, Wanpeng Li wrote:
> objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file
> correlates with offset-into-virtual-addresses does not hold. Hence what
> try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps
> the file to which the target page belongs. If vma locked, mlock the pages in the
> cluster, rather than unmapping them. However, not all pages are guarantee page
> locked instead of the check page. This patch fix the BUG by just confirm check page
> hold page lock instead of all pages in the virtual scan window against nolinear VMAs.

This may fix the symptom, but I don't understand from the description 
why in this case is it ok not to have page locked for mlock_vma_page(), 
while in the other cases it's not ok.

> [  253.869145] kernel BUG at mm/mlock.c:82!
> [  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [  253.870098] Dumping ftrace buffer:
> [  253.870098]    (ftrace buffer empty)
> [  253.870098] Modules linked in:
> [  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W    3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
> [  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
> [  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
> [  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
> [  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
> [  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
> [  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
> [  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
> [  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
> [  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
> [  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
> [  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
> [  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> [  253.873310] Stack:
> [  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
> [  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
> [  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
> [  253.873310] Call Trace:
> [  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
> [  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
> [  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
> [  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
> [  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
> [  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
> [  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
> [  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
> [  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
> [  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
> [  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
> [  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
> [  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
> [  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
> [  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
> 5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
> a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
> 2f 15
> [  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
> [  253.873310]  RSP <ffff8804d34e99e8>
> [  253.904194] ---[ end trace be59c4a7f8edab3f ]---
>
> Reported-by: Sasha Levin <sasha.levin@oracle.com>
> Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
> ---
>   mm/huge_memory.c | 2 +-
>   mm/internal.h    | 4 ++--
>   mm/ksm.c         | 2 +-
>   mm/memory.c      | 2 +-
>   mm/mlock.c       | 5 +++--
>   mm/rmap.c        | 4 ++--
>   6 files changed, 10 insertions(+), 9 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 33a5dc4..7a15b04 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>   		if (page->mapping && trylock_page(page)) {
>   			lru_add_drain();
>   			if (page->mapping)
> -				mlock_vma_page(page);
> +				mlock_vma_page(page, true);
>   			unlock_page(page);
>   		}
>   	}
> diff --git a/mm/internal.h b/mm/internal.h
> index a85a3ab..4ea2d4e 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
>   /*
>    * must be called with vma's mmap_sem held for read or write, and page locked.
>    */
> -extern void mlock_vma_page(struct page *page);
> +extern void mlock_vma_page(struct page *page, bool check_page);
>   extern unsigned int munlock_vma_page(struct page *page);
>
>   /*
> @@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>   	return 0;
>   }
>   static inline void clear_page_mlock(struct page *page) { }
> -static inline void mlock_vma_page(struct page *page) { }
> +static inline void mlock_vma_page(struct page *page, bool check_page) { }
>   static inline void mlock_migrate_page(struct page *new, struct page *old) { }
>
>   #endif /* !CONFIG_MMU */
> diff --git a/mm/ksm.c b/mm/ksm.c
> index 175fff7..ec36f04 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>   		if (!PageMlocked(kpage)) {
>   			unlock_page(page);
>   			lock_page(kpage);
> -			mlock_vma_page(kpage);
> +			mlock_vma_page(kpage, true);
>   			page = kpage;		/* for final unlock */
>   		}
>   	}
> diff --git a/mm/memory.c b/mm/memory.c
> index cf6098c..a41df6a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1602,7 +1602,7 @@ split_fallthrough:
>   			 * know the page is still mapped, we don't even
>   			 * need to check for file-cache page truncation.
>   			 */
> -			mlock_vma_page(page);
> +			mlock_vma_page(page, true);
>   			unlock_page(page);
>   		}
>   	}
> diff --git a/mm/mlock.c b/mm/mlock.c
> index d480cd6..c395ec5 100644
> --- a/mm/mlock.c
> +++ b/mm/mlock.c
> @@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page)
>    * Mark page as mlocked if not already.
>    * If page on LRU, isolate and putback to move to unevictable list.
>    */
> -void mlock_vma_page(struct page *page)
> +void mlock_vma_page(struct page *page, bool check_page)
>   {
> -	BUG_ON(!PageLocked(page));
> +	if (check_page)
> +		BUG_ON(!PageLocked(page));
>
>   	if (!TestSetPageMlocked(page)) {
>   		mod_zone_page_state(page_zone(page), NR_MLOCK,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 55c8b8d..79d456f 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1297,7 +1297,7 @@ out_mlock:
>   	 */
>   	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
>   		if (vma->vm_flags & VM_LOCKED) {
> -			mlock_vma_page(page);
> +			mlock_vma_page(page, true);
>   			ret = SWAP_MLOCK;
>   		}
>   		up_read(&vma->vm_mm->mmap_sem);
> @@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>   		BUG_ON(!page || PageAnon(page));
>
>   		if (locked_vma) {
> -			mlock_vma_page(page);   /* no-op if already mlocked */
> +			mlock_vma_page(page, page == check_page);   /* no-op if already mlocked */
>   			if (page == check_page)
>   				ret = SWAP_MLOCK;
>   			continue;	/* don't unmap */
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs
  2013-12-18  8:54 ` Vlastimil Babka
@ 2013-12-18  9:01   ` Wanpeng Li
  0 siblings, 0 replies; 29+ messages in thread
From: Wanpeng Li @ 2013-12-18  9:01 UTC (permalink / raw)
  To: Vlastimil Babka, Andrew Morton
  Cc: Sasha Levin, Michel Lespinasse, Bob Liu, npiggin, kosaki.motohiro,
	riel, linux-mm, linux-kernel

Hi Vlastimil,
On Wed, Dec 18, 2013 at 09:54:16AM +0100, Vlastimil Babka wrote:
>On 12/17/2013 09:05 AM, Wanpeng Li wrote:
>>objrmap doesn't work for nonlinear VMAs because the assumption that offset-into-file
>>correlates with offset-into-virtual-addresses does not hold. Hence what
>>try_to_unmap_cluster does is a mini "virtual scan" of each nonlinear VMA which maps
>>the file to which the target page belongs. If vma locked, mlock the pages in the
>>cluster, rather than unmapping them. However, not all pages are guarantee page
>>locked instead of the check page. This patch fix the BUG by just confirm check page
>>hold page lock instead of all pages in the virtual scan window against nolinear VMAs.
>
>This may fix the symptom, but I don't understand from the description
>why in this case is it ok not to have page locked for
>mlock_vma_page(), while in the other cases it's not ok.
>

Here is a latest version of the bugfix patch. 

http://marc.info/?l=linux-mm&m=138733994417230&w=2

Regards,
Wanpeng Li 

>>[  253.869145] kernel BUG at mm/mlock.c:82!
>>[  253.869549] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>>[  253.870098] Dumping ftrace buffer:
>>[  253.870098]    (ftrace buffer empty)
>>[  253.870098] Modules linked in:
>>[  253.870098] CPU: 10 PID: 9162 Comm: trinity-child75 Tainted: G        W    3.13.0-rc4-next-20131216-sasha-00011-g5f105ec-dirty #4137
>>[  253.873310] task: ffff8800c98cb000 ti: ffff8804d34e8000 task.ti: ffff8804d34e8000
>>[  253.873310] RIP: 0010:[<ffffffff81281f28>]  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
>>[  253.873310] RSP: 0000:ffff8804d34e99e8  EFLAGS: 00010246
>>[  253.873310] RAX: 006fffff8038002c RBX: ffffea00474944c0 RCX: ffff880807636000
>>[  253.873310] RDX: ffffea0000000000 RSI: 00007f17a9bca000 RDI: ffffea00474944c0
>>[  253.873310] RBP: ffff8804d34e99f8 R08: ffff880807020000 R09: 0000000000000000
>>[  253.873310] R10: 0000000000000001 R11: 0000000000002000 R12: 00007f17a9bca000
>>[  253.873310] R13: ffffea00474944c0 R14: 00007f17a9be0000 R15: ffff880807020000
>>[  253.873310] FS:  00007f17aa31a700(0000) GS:ffff8801c9c00000(0000) knlGS:0000000000000000
>>[  253.873310] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>>[  253.873310] CR2: 00007f17a94fa000 CR3: 00000004d3b02000 CR4: 00000000000006e0
>>[  253.873310] DR0: 00007f17a74ca000 DR1: 0000000000000000 DR2: 0000000000000000
>>[  253.873310] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>>[  253.873310] Stack:
>>[  253.873310]  0000000b3de28067 ffff880b3de28e50 ffff8804d34e9aa8 ffffffff8128bc31
>>[  253.873310]  0000000000000301 ffffea0011850220 ffff8809a4039000 ffffea0011850238
>>[  253.873310]  ffff8804d34e9aa8 ffff880807636060 0000000000000001 ffff880807636348
>>[  253.873310] Call Trace:
>>[  253.873310]  [<ffffffff8128bc31>] try_to_unmap_cluster+0x1c1/0x340
>>[  253.873310]  [<ffffffff8128c60a>] try_to_unmap_file+0x20a/0x2e0
>>[  253.873310]  [<ffffffff8128c7b3>] try_to_unmap+0x73/0x90
>>[  253.873310]  [<ffffffff812b526d>] __unmap_and_move+0x18d/0x250
>>[  253.873310]  [<ffffffff812b53e9>] unmap_and_move+0xb9/0x180
>>[  253.873310]  [<ffffffff812b559b>] migrate_pages+0xeb/0x2f0
>>[  253.873310]  [<ffffffff812a0660>] ? queue_pages_pte_range+0x1a0/0x1a0
>>[  253.873310]  [<ffffffff812a193c>] migrate_to_node+0x9c/0xc0
>>[  253.873310]  [<ffffffff812a30b8>] do_migrate_pages+0x1b8/0x240
>>[  253.873310]  [<ffffffff812a3456>] SYSC_migrate_pages+0x316/0x380
>>[  253.873310]  [<ffffffff812a31ec>] ? SYSC_migrate_pages+0xac/0x380
>>[  253.873310]  [<ffffffff811763c6>] ? vtime_account_user+0x96/0xb0
>>[  253.873310]  [<ffffffff812a34ce>] SyS_migrate_pages+0xe/0x10
>>[  253.873310]  [<ffffffff843c4990>] tracesys+0xdd/0xe2
>>[  253.873310] Code: 0f 1f 00 65 48 ff 04 25 10 25 1d 00 48 83 c4 08
>>5b c9 c3 55 48 89 e5 53 48 83 ec 08 66 66 66 66 90 48 8b 07 48 89 fb
>>a8 01 75 10 <0f> 0b 66 0f 1f 44 00 00 eb fe 66 0f 1f 44 00 00 f0 0f ba
>>2f 15
>>[  253.873310] RIP  [<ffffffff81281f28>] mlock_vma_page+0x18/0xc0
>>[  253.873310]  RSP <ffff8804d34e99e8>
>>[  253.904194] ---[ end trace be59c4a7f8edab3f ]---
>>
>>Reported-by: Sasha Levin <sasha.levin@oracle.com>
>>Signed-off-by: Wanpeng Li <liwanp@linux.vnet.ibm.com>
>>---
>>  mm/huge_memory.c | 2 +-
>>  mm/internal.h    | 4 ++--
>>  mm/ksm.c         | 2 +-
>>  mm/memory.c      | 2 +-
>>  mm/mlock.c       | 5 +++--
>>  mm/rmap.c        | 4 ++--
>>  6 files changed, 10 insertions(+), 9 deletions(-)
>>
>>diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>index 33a5dc4..7a15b04 100644
>>--- a/mm/huge_memory.c
>>+++ b/mm/huge_memory.c
>>@@ -1264,7 +1264,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma,
>>  		if (page->mapping && trylock_page(page)) {
>>  			lru_add_drain();
>>  			if (page->mapping)
>>-				mlock_vma_page(page);
>>+				mlock_vma_page(page, true);
>>  			unlock_page(page);
>>  		}
>>  	}
>>diff --git a/mm/internal.h b/mm/internal.h
>>index a85a3ab..4ea2d4e 100644
>>--- a/mm/internal.h
>>+++ b/mm/internal.h
>>@@ -192,7 +192,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *vma,
>>  /*
>>   * must be called with vma's mmap_sem held for read or write, and page locked.
>>   */
>>-extern void mlock_vma_page(struct page *page);
>>+extern void mlock_vma_page(struct page *page, bool check_page);
>>  extern unsigned int munlock_vma_page(struct page *page);
>>
>>  /*
>>@@ -236,7 +236,7 @@ static inline int mlocked_vma_newpage(struct vm_area_struct *v, struct page *p)
>>  	return 0;
>>  }
>>  static inline void clear_page_mlock(struct page *page) { }
>>-static inline void mlock_vma_page(struct page *page) { }
>>+static inline void mlock_vma_page(struct page *page, bool check_page) { }
>>  static inline void mlock_migrate_page(struct page *new, struct page *old) { }
>>
>>  #endif /* !CONFIG_MMU */
>>diff --git a/mm/ksm.c b/mm/ksm.c
>>index 175fff7..ec36f04 100644
>>--- a/mm/ksm.c
>>+++ b/mm/ksm.c
>>@@ -1064,7 +1064,7 @@ static int try_to_merge_one_page(struct vm_area_struct *vma,
>>  		if (!PageMlocked(kpage)) {
>>  			unlock_page(page);
>>  			lock_page(kpage);
>>-			mlock_vma_page(kpage);
>>+			mlock_vma_page(kpage, true);
>>  			page = kpage;		/* for final unlock */
>>  		}
>>  	}
>>diff --git a/mm/memory.c b/mm/memory.c
>>index cf6098c..a41df6a 100644
>>--- a/mm/memory.c
>>+++ b/mm/memory.c
>>@@ -1602,7 +1602,7 @@ split_fallthrough:
>>  			 * know the page is still mapped, we don't even
>>  			 * need to check for file-cache page truncation.
>>  			 */
>>-			mlock_vma_page(page);
>>+			mlock_vma_page(page, true);
>>  			unlock_page(page);
>>  		}
>>  	}
>>diff --git a/mm/mlock.c b/mm/mlock.c
>>index d480cd6..c395ec5 100644
>>--- a/mm/mlock.c
>>+++ b/mm/mlock.c
>>@@ -77,9 +77,10 @@ void clear_page_mlock(struct page *page)
>>   * Mark page as mlocked if not already.
>>   * If page on LRU, isolate and putback to move to unevictable list.
>>   */
>>-void mlock_vma_page(struct page *page)
>>+void mlock_vma_page(struct page *page, bool check_page)
>>  {
>>-	BUG_ON(!PageLocked(page));
>>+	if (check_page)
>>+		BUG_ON(!PageLocked(page));
>>
>>  	if (!TestSetPageMlocked(page)) {
>>  		mod_zone_page_state(page_zone(page), NR_MLOCK,
>>diff --git a/mm/rmap.c b/mm/rmap.c
>>index 55c8b8d..79d456f 100644
>>--- a/mm/rmap.c
>>+++ b/mm/rmap.c
>>@@ -1297,7 +1297,7 @@ out_mlock:
>>  	 */
>>  	if (down_read_trylock(&vma->vm_mm->mmap_sem)) {
>>  		if (vma->vm_flags & VM_LOCKED) {
>>-			mlock_vma_page(page);
>>+			mlock_vma_page(page, true);
>>  			ret = SWAP_MLOCK;
>>  		}
>>  		up_read(&vma->vm_mm->mmap_sem);
>>@@ -1385,7 +1385,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount,
>>  		BUG_ON(!page || PageAnon(page));
>>
>>  		if (locked_vma) {
>>-			mlock_vma_page(page);   /* no-op if already mlocked */
>>+			mlock_vma_page(page, page == check_page);   /* no-op if already mlocked */
>>  			if (page == check_page)
>>  				ret = SWAP_MLOCK;
>>  			continue;	/* don't unmap */
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2014-01-14 11:05 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-12-17  8:05 [PATCH] mm/mlock: fix BUG_ON unlocked page for nolinear VMAs Wanpeng Li
2013-12-18  3:14 ` Wanpeng Li
2013-12-18  3:16 ` Wanpeng Li
2013-12-18  3:23   ` Wanpeng Li
     [not found]   ` <20131218032329.GA6044@hacker.(null)>
2013-12-18  3:32     ` Sasha Levin
2013-12-18  4:12       ` Wanpeng Li
2013-12-18  9:11         ` Vlastimil Babka
2013-12-18  9:23           ` Wanpeng Li
     [not found]           ` <52b1699f.87293c0a.75d1.34d3SMTPIN_ADDED_BROKEN@mx.google.com>
2013-12-18 21:43             ` Andrew Morton
2014-01-03 20:17               ` Sasha Levin
2014-01-03 20:52                 ` Linus Torvalds
2014-01-03 23:36                   ` Vlastimil Babka
2014-01-03 23:56                     ` Andrew Morton
2014-01-04  3:03                       ` Sasha Levin
2014-01-04  0:18                     ` Linus Torvalds
2014-01-04  8:09                       ` Vlastimil Babka
2014-01-05  0:27                       ` Wanpeng Li
2014-01-06 16:47                       ` Motohiro Kosaki
2014-01-06 22:01                         ` KOSAKI Motohiro
2014-01-07  5:27                         ` Wanpeng Li
2014-01-07 15:01                         ` Vlastimil Babka
2014-01-08  1:06                           ` Bob Liu
2014-01-08  2:44                           ` Linus Torvalds
2014-01-10 17:48                           ` Motohiro Kosaki
2014-01-13 14:03                             ` Vlastimil Babka
2014-01-14 11:05                               ` Vlastimil Babka
2014-01-04  3:31                   ` Bob Liu
2013-12-18  8:54 ` Vlastimil Babka
2013-12-18  9:01   ` Wanpeng Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).