All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Dave Jones <davej@redhat.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
Date: Tue, 6 May 2014 11:43:33 +0300	[thread overview]
Message-ID: <20140506084333.GA5575@node.dhcp.inet.fi> (raw)
In-Reply-To: <1399328011-15317-1-git-send-email-kirill.shutemov@linux.intel.com>

On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.
> 
> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

It took a night but I was able to trigger crash which this patch fixes.

Test case:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define MB (1024UL*1024)
#define SIZE (4*MB)
#define BASE ((void *)0x400000000000)

int main()
{
	char *x1, *x2;

	for (;;) {
		x1 = mmap(BASE, 2 * SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_FIXED,
			-1, 0);
                if (x1 == MAP_FAILED)
                        perror("x1"), exit(1);
		x2 = mremap(x1 + SIZE, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + 2 * SIZE);
                if (x2 == MAP_FAILED)
                        perror("x2"), exit(1);

		if (!fork())
			return 0;

		if (!fork()) {
			if (!fork())
				return 0;

			mprotect(x2, 4096, PROT_NONE);
			return 0;
		}

		x2 = mremap(x2, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + SIZE);
		if (x2 == MAP_FAILED)
			perror("x2"), exit(1);
		munmap(x1, SIZE);
		munmap(x2, SIZE);
		while (waitpid(-1, NULL, WNOHANG) > 0);
	}
	return 0;
}

Crash:

[54438.764230] mapcount 2 page_mapcount 3
[54438.764985] ------------[ cut here ]------------
[54438.765735] kernel BUG at /home/space/kas/git/public/linux/mm/huge_memory.c:1836!
[54438.766926] invalid opcode: 0000 [#1] SMP 
[54438.767637] Modules linked in:
[54438.768078] CPU: 0 PID: 12638 Comm: test_split Not tainted 3.15.0-rc4-00001-gdb77ce6c9fe5-dirty #1282
[54438.768078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[54438.768078] task: ffff8804633c8410 ti: ffff88046376c000 task.ti: ffff88046376c000
[54438.768078] RIP: 0010:[<ffffffff81140594>]  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078] RSP: 0018:ffff88046376dcc8  EFLAGS: 00010297
[54438.768078] RAX: 0000000000000003 RBX: ffff88046881c520 RCX: 0000000000000006
[54438.768078] RDX: 0000000000000006 RSI: ffff8804633c8b18 RDI: ffff8804633c8410
[54438.768078] RBP: ffff88046376dd30 R08: 0000000000000001 R09: 0000000000000000
[54438.768078] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[54438.768078] R13: 0000400000800000 R14: ffffea000ede4000 R15: 0000000400000400
[54438.768078] FS:  00007fea6a7be700(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[54438.768078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54438.768078] CR2: 00007fea6a2db7d0 CR3: 0000000469bdf000 CR4: 00000000001407f0
[54438.768078] Stack:
[54438.768078]  ffff8804698f4020 0000400000a00000 0000000000000000 ffff880467a04900
[54438.768078]  0000000000000000 ffff880467a04880 ffff880400000002 ffff880462b5ccf8
[54438.768078]  0000400000800000 ffff88046370ac50 ffff8804698f4020 0000400000a00000
[54438.768078] Call Trace:
[54438.768078]  [<ffffffff81141050>] __split_huge_page_pmd+0xc0/0x1f0
[54438.768078]  [<ffffffff8114196e>] split_huge_page_pmd_mm+0x3e/0x40
[54438.768078]  [<ffffffff81141995>] split_huge_page_address+0x25/0x30
[54438.768078]  [<ffffffff81141a3c>] __vma_adjust_trans_huge+0x9c/0xf0
[54438.768078]  [<ffffffff8132268d>] ? __rb_insert_augmented+0xcd/0x1f0
[54438.768078]  [<ffffffff81116f06>] vma_adjust+0x626/0x6a0
[54438.768078]  [<ffffffff811170ad>] __split_vma.isra.35+0x12d/0x200
[54438.768078]  [<ffffffff81117e94>] split_vma+0x24/0x30
[54438.768078]  [<ffffffff8111a3ca>] mprotect_fixup+0x22a/0x260
[54438.768078]  [<ffffffff8111a542>] SyS_mprotect+0x142/0x230
[54438.768078]  [<ffffffff8173cb62>] system_call_fastpath+0x16/0x1b
[54438.768078] Code: 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 49 8b 16 4c 89 f0 80 
[54438.768078] RIP  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078]  RSP <ffff88046376dcc8>
[54438.805154] ---[ end trace 12d4dde45cf392c6 ]---


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-05-06  8:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
2014-05-06  8:43 ` Kirill A. Shutemov [this message]
2014-05-07 20:55   ` David Miller
2014-05-06 13:06 ` Johannes Weiner
2014-05-08  0:13   ` Michel Lespinasse
2014-05-08 18:14     ` Johannes Weiner
2014-05-09  8:44       ` Michel Lespinasse
2014-05-06 14:13 ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140506084333.GA5575@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=davej@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.