linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, Andrea Arcangeli <aarcange@redhat.com>,
	Rik van Riel <riel@redhat.com>,
	Michel Lespinasse <walken@google.com>,
	Dave Jones <davej@redhat.com>,
	stable@vger.kernel.org
Subject: Re: [PATCH] mm, thp: close race between mremap() and split_huge_page()
Date: Tue, 6 May 2014 11:43:33 +0300	[thread overview]
Message-ID: <20140506084333.GA5575@node.dhcp.inet.fi> (raw)
In-Reply-To: <1399328011-15317-1-git-send-email-kirill.shutemov@linux.intel.com>

On Tue, May 06, 2014 at 01:13:31AM +0300, Kirill A. Shutemov wrote:
> It's critical for split_huge_page() (and migration) to catch and freeze
> all PMDs on rmap walk. It gets tricky if there's concurrent fork() or
> mremap() since usually we copy/move page table entries on dup_mm() or
> move_page_tables() without rmap lock taken. To get it work we rely on
> rmap walk order to not miss any entry. We expect to see destination VMA
> after source one to work correctly.
> 
> But after switching rmap implementation to interval tree it's not always
> possible to preserve expected walk order.
> 
> It works fine for dup_mm() since new VMA has the same vma_start_pgoff()
> / vma_last_pgoff() and explicitly insert dst VMA after src one with
> vma_interval_tree_insert_after().
> 
> But on move_vma() destination VMA can be merged into adjacent one and as
> result shifted left in interval tree. Fortunately, we can detect the
> situation and prevent race with rmap walk by moving page table entries
> under rmap lock. See commit 38a76013ad80.
> 
> Problem is that we miss the lock when we move transhuge PMD. Most likely
> this bug caused the crash[1].
> 
> [1] http://thread.gmane.org/gmane.linux.kernel.mm/96473

It took a night but I was able to trigger crash which this patch fixes.

Test case:

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/wait.h>

#define MB (1024UL*1024)
#define SIZE (4*MB)
#define BASE ((void *)0x400000000000)

int main()
{
	char *x1, *x2;

	for (;;) {
		x1 = mmap(BASE, 2 * SIZE, PROT_READ | PROT_WRITE,
			MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_FIXED,
			-1, 0);
                if (x1 == MAP_FAILED)
                        perror("x1"), exit(1);
		x2 = mremap(x1 + SIZE, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + 2 * SIZE);
                if (x2 == MAP_FAILED)
                        perror("x2"), exit(1);

		if (!fork())
			return 0;

		if (!fork()) {
			if (!fork())
				return 0;

			mprotect(x2, 4096, PROT_NONE);
			return 0;
		}

		x2 = mremap(x2, SIZE, SIZE,
				MREMAP_FIXED | MREMAP_MAYMOVE,
				x1 + SIZE);
		if (x2 == MAP_FAILED)
			perror("x2"), exit(1);
		munmap(x1, SIZE);
		munmap(x2, SIZE);
		while (waitpid(-1, NULL, WNOHANG) > 0);
	}
	return 0;
}

Crash:

[54438.764230] mapcount 2 page_mapcount 3
[54438.764985] ------------[ cut here ]------------
[54438.765735] kernel BUG at /home/space/kas/git/public/linux/mm/huge_memory.c:1836!
[54438.766926] invalid opcode: 0000 [#1] SMP 
[54438.767637] Modules linked in:
[54438.768078] CPU: 0 PID: 12638 Comm: test_split Not tainted 3.15.0-rc4-00001-gdb77ce6c9fe5-dirty #1282
[54438.768078] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Bochs 01/01/2011
[54438.768078] task: ffff8804633c8410 ti: ffff88046376c000 task.ti: ffff88046376c000
[54438.768078] RIP: 0010:[<ffffffff81140594>]  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078] RSP: 0018:ffff88046376dcc8  EFLAGS: 00010297
[54438.768078] RAX: 0000000000000003 RBX: ffff88046881c520 RCX: 0000000000000006
[54438.768078] RDX: 0000000000000006 RSI: ffff8804633c8b18 RDI: ffff8804633c8410
[54438.768078] RBP: ffff88046376dd30 R08: 0000000000000001 R09: 0000000000000000
[54438.768078] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[54438.768078] R13: 0000400000800000 R14: ffffea000ede4000 R15: 0000000400000400
[54438.768078] FS:  00007fea6a7be700(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[54438.768078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[54438.768078] CR2: 00007fea6a2db7d0 CR3: 0000000469bdf000 CR4: 00000000001407f0
[54438.768078] Stack:
[54438.768078]  ffff8804698f4020 0000400000a00000 0000000000000000 ffff880467a04900
[54438.768078]  0000000000000000 ffff880467a04880 ffff880400000002 ffff880462b5ccf8
[54438.768078]  0000400000800000 ffff88046370ac50 ffff8804698f4020 0000400000a00000
[54438.768078] Call Trace:
[54438.768078]  [<ffffffff81141050>] __split_huge_page_pmd+0xc0/0x1f0
[54438.768078]  [<ffffffff8114196e>] split_huge_page_pmd_mm+0x3e/0x40
[54438.768078]  [<ffffffff81141995>] split_huge_page_address+0x25/0x30
[54438.768078]  [<ffffffff81141a3c>] __vma_adjust_trans_huge+0x9c/0xf0
[54438.768078]  [<ffffffff8132268d>] ? __rb_insert_augmented+0xcd/0x1f0
[54438.768078]  [<ffffffff81116f06>] vma_adjust+0x626/0x6a0
[54438.768078]  [<ffffffff811170ad>] __split_vma.isra.35+0x12d/0x200
[54438.768078]  [<ffffffff81117e94>] split_vma+0x24/0x30
[54438.768078]  [<ffffffff8111a3ca>] mprotect_fixup+0x22a/0x260
[54438.768078]  [<ffffffff8111a542>] SyS_mprotect+0x142/0x230
[54438.768078]  [<ffffffff8173cb62>] system_call_fastpath+0x16/0x1b
[54438.768078] Code: 0f 1f 80 00 00 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 66 0f 1f 44 00 00 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 49 8b 16 4c 89 f0 80 
[54438.768078] RIP  [<ffffffff81140594>] split_huge_page_to_list+0x434/0x6c0
[54438.768078]  RSP <ffff88046376dcc8>
[54438.805154] ---[ end trace 12d4dde45cf392c6 ]---


-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-05-06  8:43 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-05 22:13 [PATCH] mm, thp: close race between mremap() and split_huge_page() Kirill A. Shutemov
2014-05-06  8:43 ` Kirill A. Shutemov [this message]
2014-05-07 20:55   ` David Miller
2014-05-06 13:06 ` Johannes Weiner
2014-05-08  0:13   ` Michel Lespinasse
2014-05-08 18:14     ` Johannes Weiner
2014-05-09  8:44       ` Michel Lespinasse
2014-05-06 14:13 ` Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140506084333.GA5575@node.dhcp.inet.fi \
    --to=kirill@shutemov.name \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=davej@redhat.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-mm@kvack.org \
    --cc=riel@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=walken@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).