linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm, thp: close race between mremap() and split_huge_page()
@ 2014-05-05 22:13 Kirill A. Shutemov
  2014-05-06  8:43 ` Kirill A. Shutemov
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Kirill A. Shutemov @ 2014-05-05 22:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, Kirill A. Shutemov, Andrea Arcangeli, Rik van Riel,
	Michel Lespinasse, Dave Jones, stable

It's critical for split_huge_page() (and migration) to catch and freeze
all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
mremap() since usually we copy/move page table entries on dup_mm() or
move_page_tables() without rmap lock taken. To get it work we rely on
rmap walk order to not miss any entry. We expect to see destination VMA
after source one to work correctly.

But after switching rmap implementation to interval tree it's not always
possible to preserve expected walk order.

It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
/ vma_last_pgoff() and explicitly insert dst VMA after src one with
vma_interval_tree_insert_after().

But on move_vma() destination VMA can be merged into adjacent one and as
result shifted left in interval tree. Fortunately, we can detect the
situation and prevent race with rmap walk by moving page table entries
under rmap lock. See commit 38a76013ad80.

Problem is that we miss the lock when we move transhuge PMD. Most likely
this bug caused the crash[1].

[1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Michel Lespinasse <walken@google.com>
Cc: Dave Jones <davej@redhat.com>
Cc: <stable@vger.kernel.org>        [3.7+]
Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 mm/mremap.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/mremap.c b/mm/mremap.c
index 0843feb66f3d..05f1180e9f21 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -194,10 +194,17 @@ unsigned long move_page_tables(struct vm_area_struct *vma,
 			break;
 		if (pmd_trans_huge(*old_pmd)) {
 			int err = 0;
-			if (extent == HPAGE_PMD_SIZE)
+			if (extent == HPAGE_PMD_SIZE) {
+				VM_BUG_ON(vma->vm_file || !vma->anon_vma);
+				/* See comment in move_ptes() */
+				if (need_rmap_locks)
+					anon_vma_lock_write(vma->anon_vma);
 				err = move_huge_pmd(vma, new_vma, old_addr,
 						    new_addr, old_end,
 						    old_pmd, new_pmd);
+				if (need_rmap_locks)
+					anon_vma_unlock_write(vma->anon_vma);
+			}
 			if (err > 0) {
 				need_flush = true;
 				continue;
-- 
2.0.0.rc0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
@ 2014-05-06  8:43 ` Kirill A. Shutemov
  2014-05-07 20:55   ` David Miller
  2014-05-06 13:06 ` Johannes Weiner
  2014-05-06 14:13 ` Andrea Arcangeli
  2 siblings, 1 reply; 8+ messages in thread
From: Kirill A. Shutemov @ 2014-05-06  8:43 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Andrea Arcangeli, Rik van Riel,
	Michel Lespinasse, Dave Jones, stable

On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.
> 
> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

It took a night but I was able to trigger crash which this patch fixes.

Test case:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define MB (1024UL*1024)
#define SIZE (4*MB)
#define BASE ((void *)0x400000000000)

int main()
{
	char *x1, *x2;

	for (;;) {
		x1 = mmap(BASE, 2 * SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_FIXED,
			-1, 0);
                if (x1 == MAP_FAILED)
                        perror("x1"), exit(1);
		x2 = mremap(x1 + SIZE, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + 2 * SIZE);
                if (x2 == MAP_FAILED)
                        perror("x2"), exit(1);

		if (!fork())
			return 0;

		if (!fork()) {
			if (!fork())
				return 0;

			mprotect(x2, 4096, PROT_NONE);
			return 0;
		}

		x2 = mremap(x2, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + SIZE);
		if (x2 == MAP_FAILED)
			perror("x2"), exit(1);
		munmap(x1, SIZE);
		munmap(x2, SIZE);
		while (waitpid(-1, NULL, WNOHANG) > 0);
	}
	return 0;
}

Crash:

[54438.764230] mapcount 2 page_mapcount 3
[54438.764985] ------------[ cut here ]------------
[54438.765735] kernel BUG at /home/space/kas/git/public/linux/mm/huge_memory.c:1836!
[54438.766926] invalid opcode: 0000 [#1] SMP 
[54438.767637] Modules linked in:
[54438.768078] CPU: 0 PID: 12638 Comm: test_split Not tainted 3.15.0-rc4-00001-gdb77ce6c9fe5-dirty #1282
[54438.768078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[54438.768078] task: ffff8804633c8410 ti: ffff88046376c000 task.ti: ffff88046376c000
[54438.768078] RIP: 0010:[<ffffffff81140594>]  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078] RSP: 0018:ffff88046376dcc8  EFLAGS: 00010297
[54438.768078] RAX: 0000000000000003 RBX: ffff88046881c520 RCX: 0000000000000006
[54438.768078] RDX: 0000000000000006 RSI: ffff8804633c8b18 RDI: ffff8804633c8410
[54438.768078] RBP: ffff88046376dd30 R08: 0000000000000001 R09: 0000000000000000
[54438.768078] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[54438.768078] R13: 0000400000800000 R14: ffffea000ede4000 R15: 0000000400000400
[54438.768078] FS:  00007fea6a7be700(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[54438.768078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54438.768078] CR2: 00007fea6a2db7d0 CR3: 0000000469bdf000 CR4: 00000000001407f0
[54438.768078] Stack:
[54438.768078]  ffff8804698f4020 0000400000a00000 0000000000000000 ffff880467a04900
[54438.768078]  0000000000000000 ffff880467a04880 ffff880400000002 ffff880462b5ccf8
[54438.768078]  0000400000800000 ffff88046370ac50 ffff8804698f4020 0000400000a00000
[54438.768078] Call Trace:
[54438.768078]  [<ffffffff81141050>] __split_huge_page_pmd+0xc0/0x1f0
[54438.768078]  [<ffffffff8114196e>] split_huge_page_pmd_mm+0x3e/0x40
[54438.768078]  [<ffffffff81141995>] split_huge_page_address+0x25/0x30
[54438.768078]  [<ffffffff81141a3c>] __vma_adjust_trans_huge+0x9c/0xf0
[54438.768078]  [<ffffffff8132268d>] ? __rb_insert_augmented+0xcd/0x1f0
[54438.768078]  [<ffffffff81116f06>] vma_adjust+0x626/0x6a0
[54438.768078]  [<ffffffff811170ad>] __split_vma.isra.35+0x12d/0x200
[54438.768078]  [<ffffffff81117e94>] split_vma+0x24/0x30
[54438.768078]  [<ffffffff8111a3ca>] mprotect_fixup+0x22a/0x260
[54438.768078]  [<ffffffff8111a542>] SyS_mprotect+0x142/0x230
[54438.768078]  [<ffffffff8173cb62>] system_call_fastpath+0x16/0x1b
[54438.768078] Code: 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 49 8b 16 4c 89 f0 80 
[54438.768078] RIP  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078]  RSP <ffff88046376dcc8>
[54438.805154] ---[ end trace 12d4dde45cf392c6 ]---


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
  2014-05-06  8:43 ` Kirill A. Shutemov
@ 2014-05-06 13:06 ` Johannes Weiner
  2014-05-08  0:13   ` Michel Lespinasse
  2014-05-06 14:13 ` Andrea Arcangeli
  2 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2014-05-06 13:06 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Andrea Arcangeli, Rik van Riel,
	Michel Lespinasse, Dave Jones, stable

On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.

Yeah, I think the actual bug was introduced in preparation of the
interval tree, when the optimization of moving the target anon_vma to
the tail of the chain was replaced by explicit locking again.  That
missed the THP case.

> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>        [3.7+]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
  2014-05-06  8:43 ` Kirill A. Shutemov
  2014-05-06 13:06 ` Johannes Weiner
@ 2014-05-06 14:13 ` Andrea Arcangeli
  2 siblings, 0 replies; 8+ messages in thread
From: Andrea Arcangeli @ 2014-05-06 14:13 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andrew Morton, linux-mm, Rik van Riel, Michel Lespinasse,
	Dave Jones, stable

On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.
> 
> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Michel Lespinasse <walken@google.com>
> Cc: Dave Jones <davej@redhat.com>
> Cc: <stable@vger.kernel.org>        [3.7+]
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>  mm/mremap.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

Glad the interval tree has a stable insert when the index is the same
(so fork was safe) and it already contemplated the out of order
problem in the case of mremap when the index changes. The anon_vma
lock is actually heavy and not so nice having to take during pte/pmd
mangling, but we already take it for the pte move and it is only
needed in some case so it should be ok.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-06  8:43 ` Kirill A. Shutemov
@ 2014-05-07 20:55   ` David Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2014-05-07 20:55 UTC (permalink / raw)
  To: kirill
  Cc: kirill.shutemov, akpm, linux-mm, aarcange, riel, walken, davej,
	stable

From: "Kirill A. Shutemov" <kirill@shutemov.name>
Date: Tue, 6 May 2014 11:43:33 +0300

> It took a night but I was able to trigger crash which this patch fixes.

I love test cases like this.

Can we start collecting THP stressers and bug reproducers like this
under tools/testing/selftests/thp or similar?

I find that I'm constantly writing my own THP test cases or copying
the ones from the LTP tree and adjusting them to meet my needs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-06 13:06 ` Johannes Weiner
@ 2014-05-08  0:13   ` Michel Lespinasse
  2014-05-08 18:14     ` Johannes Weiner
  0 siblings, 1 reply; 8+ messages in thread
From: Michel Lespinasse @ 2014-05-08  0:13 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andrea Arcangeli,
	Rik van Riel, Dave Jones, stable

My bad for introducing the bug, and thanks Kirill for fixing it.

Acked-by: Michel Lespinasse <walken@google.com>

On Tue, May 6, 2014 at 6:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
>> But on move_vma() destination VMA can be merged into adjacent one and as
>> result shifted left in interval tree. Fortunately, we can detect the
>> situation and prevent race with rmap walk by moving page table entries
>> under rmap lock. See commit 38a76013ad80.

Yup, forgot to take care of the THP case there...

> Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")

I think 108d6642ad81 on its own was OK (as it always took the locks);
but the attempt to not take them in the common case in 38a76013ad80 is
where I forgot to consider the THP case.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-08  0:13   ` Michel Lespinasse
@ 2014-05-08 18:14     ` Johannes Weiner
  2014-05-09  8:44       ` Michel Lespinasse
  0 siblings, 1 reply; 8+ messages in thread
From: Johannes Weiner @ 2014-05-08 18:14 UTC (permalink / raw)
  To: Michel Lespinasse
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andrea Arcangeli,
	Rik van Riel, Dave Jones, stable

On Wed, May 07, 2014 at 05:13:32PM -0700, Michel Lespinasse wrote:
> On Tue, May 6, 2014 at 6:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> > Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")
> 
> I think 108d6642ad81 on its own was OK (as it always took the locks);
> but the attempt to not take them in the common case in 38a76013ad80 is
> where I forgot to consider the THP case.

108d6642ad81 replaced the chain ordering with an explicit lock, but I
see the unconditional locking only in move_ptes(), which isn't called
for THP pmds.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
  2014-05-08 18:14     ` Johannes Weiner
@ 2014-05-09  8:44       ` Michel Lespinasse
  0 siblings, 0 replies; 8+ messages in thread
From: Michel Lespinasse @ 2014-05-09  8:44 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Kirill A. Shutemov, Andrew Morton, linux-mm, Andrea Arcangeli,
	Rik van Riel, Dave Jones, stable

On Thu, May 8, 2014 at 11:14 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
> On Wed, May 07, 2014 at 05:13:32PM -0700, Michel Lespinasse wrote:
>> On Tue, May 6, 2014 at 6:06 AM, Johannes Weiner <hannes@cmpxchg.org> wrote:
>> > Fixes: 108d6642ad81 ("mm anon rmap: remove anon_vma_moveto_tail")
>>
>> I think 108d6642ad81 on its own was OK (as it always took the locks);
>> but the attempt to not take them in the common case in 38a76013ad80 is
>> where I forgot to consider the THP case.
>
> 108d6642ad81 replaced the chain ordering with an explicit lock, but I
> see the unconditional locking only in move_ptes(), which isn't called
> for THP pmds.

Ah yes, you are right.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2014-05-09  8:44 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
2014-05-06  8:43 ` Kirill A. Shutemov
2014-05-07 20:55   ` David Miller
2014-05-06 13:06 ` Johannes Weiner
2014-05-08  0:13   ` Michel Lespinasse
2014-05-08 18:14     ` Johannes Weiner
2014-05-09  8:44       ` Michel Lespinasse
2014-05-06 14:13 ` Andrea Arcangeli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).