All of lore.kernel.org
 help / color / mirror / Atom feed
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>,
	Stable <stable@vger.kernel.org>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Peter Feiner <pfeiner@google.com>
Subject: [PATCH] proc/pagemap: walk page tables under pte lock
Date: Mon, 26 Jan 2015 17:52:14 +0300	[thread overview]
Message-ID: <20150126145214.11053.5670.stgit@buzz> (raw)

Lockless access to pte in pagemap_pte_range() might race with page migration
and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()
<read *pte>
pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Cc: Stable <stable@vger.kernel.org> (v3.5+)

---

------------[ cut here ]------------
kernel BUG at ../include/linux/swapops.h:131!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 702 Comm: a.out Not tainted 3.19.0-rc1+ #16
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
task: ffff880001dc0dc0 ti: ffff88003dbd0000 task.ti: ffff88003dbd0000
RIP: pagemap_pte_range (include/linux/swapops.h:131 fs/proc/task_mmu.c:1136)
RSP: 0018:ffff88003dbd3d08  EFLAGS: 00010246
RAX: ffffea0000492b00 RBX: ffff88003dbd3e60 RCX: ffff880013bbe878
RDX: 0000000000000000 RSI: 000000000000001f RDI: 3e00000000000000
RBP: ffff88003dbd3d78 R08: 000000000024959f R09: ffff880001f5d000
R10: 0000000000000000 R11: 007fffffffffffff R12: 00007f6662000000
R13: ffff880001f5d000 R14: 00007f6661e8f000 R15: 0600000000000000
FS:  00007f666381d700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f666333a520 CR3: 0000000013b7f000 CR4: 00000000000407b0
Stack:
0000000000000000 0000000062000000 ffff88003dbd3e78 ffff880013bbe878
007fffffffffffff 000000000003f65d ffffffff810d3710 0000000000000010
ffff88003dbd3d68 ffff88003dbd3e78 00007f6662000000 00007f6661e62000
Call Trace:
? find_vma (mm/mmap.c:2046)
walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:92 mm/pagewalk.c:241)
pagemap_read (fs/proc/task_mmu.c:1297)
? pid_maps_open (fs/proc/task_mmu.c:1068)
? m_stop (kernel/module.c:621)
__vfs_read (fs/read_write.c:430)
vfs_read (fs/read_write.c:455)
SyS_pread64 (fs/read_write.c:619 fs/read_write.c:606)
system_call_fastpath (arch/x86/kernel/entry_64.S:423)
Code: a0 4c 89 f6 48 8b 78 30 e8 a1 10 f9 ff 48 8b 4d b8 49 89 c4 e9
6d fc ff ff 0f 1f 44 00 00 49 39 ce 73 17 48 89 cf e9 7b fc ff ff <0f>
0b 4c 89 c7 e8 a9 bc f9 ff e9 ea fb ff ff 44 8b 55 9c 44 89
All code
========
   0:   a0 4c 89 f6 48 8b 78    movabs 0xe830788b48f6894c,%al
   7:   30 e8
   9:   a1 10 f9 ff 48 8b 4d    movabs 0x49b84d8b48fff910,%eax
  10:   b8 49
  12:   89 c4                   mov    %eax,%esp
  14:   e9 6d fc ff ff          jmpq   0xfffffffffffffc86
  19:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  1e:   49 39 ce                cmp    %rcx,%r14
  21:   73 17                   jae    0x3a
  23:   48 89 cf                mov    %rcx,%rdi
  26:   e9 7b fc ff ff          jmpq   0xfffffffffffffca6
  2b:*  0f 0b                   ud2             <-- trapping instruction
  2d:   4c 89 c7                mov    %r8,%rdi
  30:   e8 a9 bc f9 ff          callq  0xfffffffffff9bcde
  35:   e9 ea fb ff ff          jmpq   0xfffffffffffffc24
  3a:   44 8b 55 9c             mov    -0x64(%rbp),%r10d
  3e:   44                      rex.R
  3f:   89                      .byte 0x89

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2
   2:   4c 89 c7                mov    %r8,%rdi
   5:   e8 a9 bc f9 ff          callq  0xfffffffffff9bcb3
   a:   e9 ea fb ff ff          jmpq   0xfffffffffffffbf9
   f:   44 8b 55 9c             mov    -0x64(%rbp),%r10d
  13:   44                      rex.R
  14:   89                      .byte 0x89
RIP pagemap_pte_range (include/linux/swapops.h:131 fs/proc/task_mmu.c:1136)
RSP <ffff88003dbd3d08>
------------[ cut here ]------------
---
 fs/proc/task_mmu.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2464367..ff65557 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1070,7 +1070,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma;
 	struct pagemapread *pm = walk->private;
 	spinlock_t *ptl;
-	pte_t *pte;
+	pte_t *pte, *orig_pte;
 	int err = 0;
 
 	/* find the first VMA at or above 'addr' */
@@ -1131,15 +1131,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		BUG_ON(is_vm_hugetlb_page(vma));
 
 		/* Addresses in the VMA. */
-		for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
+		orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+		for (; addr < min(end, vma->vm_end); pte++, addr += PAGE_SIZE) {
 			pagemap_entry_t pme;
-			pte = pte_offset_map(pmd, addr);
+
 			pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
-			pte_unmap(pte);
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
-				return err;
+				break;
 		}
+		pte_unmap_unlock(orig_pte, ptl);
+
+		if (err)
+			return err;
 
 		if (addr == end)
 			break;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
To: linux-mm@kvack.org, Andrew Morton <akpm@linux-foundation.org>,
	linux-kernel@vger.kernel.org
Cc: Andrey Ryabinin <a.ryabinin@samsung.com>,
	Stable <stable@vger.kernel.org>,
	Cyrill Gorcunov <gorcunov@openvz.org>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>,
	Peter Feiner <pfeiner@google.com>
Subject: [PATCH] proc/pagemap: walk page tables under pte lock
Date: Mon, 26 Jan 2015 17:52:14 +0300	[thread overview]
Message-ID: <20150126145214.11053.5670.stgit@buzz> (raw)

Lockless access to pte in pagemap_pte_range() might race with page migration
and trigger BUG_ON(!PageLocked()) in migration_entry_to_page():

CPU A (pagemap)                           CPU B (migration)
                                          lock_page()
                                          try_to_unmap(page, TTU_MIGRATION...)
                                               make_migration_entry()
                                               set_pte_at()
<read *pte>
pte_to_pagemap_entry()
                                          remove_migration_ptes()
                                          unlock_page()
    if(is_migration_entry())
        migration_entry_to_page()
            BUG_ON(!PageLocked(page))

Also lockless read might be non-atomic if pte is larger than wordsize.
Other pte walkers (smaps, numa_maps, clear_refs) already lock ptes.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-by: Andrey Ryabinin <a.ryabinin@samsung.com>
Fixes: 052fb0d635df ("proc: report file/anon bit in /proc/pid/pagemap")
Cc: Stable <stable@vger.kernel.org> (v3.5+)

---

------------[ cut here ]------------
kernel BUG at ../include/linux/swapops.h:131!
invalid opcode: 0000 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 702 Comm: a.out Not tainted 3.19.0-rc1+ #16
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org
04/01/2014
task: ffff880001dc0dc0 ti: ffff88003dbd0000 task.ti: ffff88003dbd0000
RIP: pagemap_pte_range (include/linux/swapops.h:131 fs/proc/task_mmu.c:1136)
RSP: 0018:ffff88003dbd3d08  EFLAGS: 00010246
RAX: ffffea0000492b00 RBX: ffff88003dbd3e60 RCX: ffff880013bbe878
RDX: 0000000000000000 RSI: 000000000000001f RDI: 3e00000000000000
RBP: ffff88003dbd3d78 R08: 000000000024959f R09: ffff880001f5d000
R10: 0000000000000000 R11: 007fffffffffffff R12: 00007f6662000000
R13: ffff880001f5d000 R14: 00007f6661e8f000 R15: 0600000000000000
FS:  00007f666381d700(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f666333a520 CR3: 0000000013b7f000 CR4: 00000000000407b0
Stack:
0000000000000000 0000000062000000 ffff88003dbd3e78 ffff880013bbe878
007fffffffffffff 000000000003f65d ffffffff810d3710 0000000000000010
ffff88003dbd3d68 ffff88003dbd3e78 00007f6662000000 00007f6661e62000
Call Trace:
? find_vma (mm/mmap.c:2046)
walk_page_range (mm/pagewalk.c:51 mm/pagewalk.c:92 mm/pagewalk.c:241)
pagemap_read (fs/proc/task_mmu.c:1297)
? pid_maps_open (fs/proc/task_mmu.c:1068)
? m_stop (kernel/module.c:621)
__vfs_read (fs/read_write.c:430)
vfs_read (fs/read_write.c:455)
SyS_pread64 (fs/read_write.c:619 fs/read_write.c:606)
system_call_fastpath (arch/x86/kernel/entry_64.S:423)
Code: a0 4c 89 f6 48 8b 78 30 e8 a1 10 f9 ff 48 8b 4d b8 49 89 c4 e9
6d fc ff ff 0f 1f 44 00 00 49 39 ce 73 17 48 89 cf e9 7b fc ff ff <0f>
0b 4c 89 c7 e8 a9 bc f9 ff e9 ea fb ff ff 44 8b 55 9c 44 89
All code
========
   0:   a0 4c 89 f6 48 8b 78    movabs 0xe830788b48f6894c,%al
   7:   30 e8
   9:   a1 10 f9 ff 48 8b 4d    movabs 0x49b84d8b48fff910,%eax
  10:   b8 49
  12:   89 c4                   mov    %eax,%esp
  14:   e9 6d fc ff ff          jmpq   0xfffffffffffffc86
  19:   0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
  1e:   49 39 ce                cmp    %rcx,%r14
  21:   73 17                   jae    0x3a
  23:   48 89 cf                mov    %rcx,%rdi
  26:   e9 7b fc ff ff          jmpq   0xfffffffffffffca6
  2b:*  0f 0b                   ud2             <-- trapping instruction
  2d:   4c 89 c7                mov    %r8,%rdi
  30:   e8 a9 bc f9 ff          callq  0xfffffffffff9bcde
  35:   e9 ea fb ff ff          jmpq   0xfffffffffffffc24
  3a:   44 8b 55 9c             mov    -0x64(%rbp),%r10d
  3e:   44                      rex.R
  3f:   89                      .byte 0x89

Code starting with the faulting instruction
===========================================
   0:   0f 0b                   ud2
   2:   4c 89 c7                mov    %r8,%rdi
   5:   e8 a9 bc f9 ff          callq  0xfffffffffff9bcb3
   a:   e9 ea fb ff ff          jmpq   0xfffffffffffffbf9
   f:   44 8b 55 9c             mov    -0x64(%rbp),%r10d
  13:   44                      rex.R
  14:   89                      .byte 0x89
RIP pagemap_pte_range (include/linux/swapops.h:131 fs/proc/task_mmu.c:1136)
RSP <ffff88003dbd3d08>
------------[ cut here ]------------
---
 fs/proc/task_mmu.c |   14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2464367..ff65557 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -1070,7 +1070,7 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 	struct vm_area_struct *vma;
 	struct pagemapread *pm = walk->private;
 	spinlock_t *ptl;
-	pte_t *pte;
+	pte_t *pte, *orig_pte;
 	int err = 0;
 
 	/* find the first VMA at or above 'addr' */
@@ -1131,15 +1131,19 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
 		BUG_ON(is_vm_hugetlb_page(vma));
 
 		/* Addresses in the VMA. */
-		for (; addr < min(end, vma->vm_end); addr += PAGE_SIZE) {
+		orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+		for (; addr < min(end, vma->vm_end); pte++, addr += PAGE_SIZE) {
 			pagemap_entry_t pme;
-			pte = pte_offset_map(pmd, addr);
+
 			pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
-			pte_unmap(pte);
 			err = add_to_pagemap(addr, &pme, pm);
 			if (err)
-				return err;
+				break;
 		}
+		pte_unmap_unlock(orig_pte, ptl);
+
+		if (err)
+			return err;
 
 		if (addr == end)
 			break;


             reply	other threads:[~2015-01-26 14:52 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-01-26 14:52 Konstantin Khlebnikov [this message]
2015-01-26 14:52 ` [PATCH] proc/pagemap: walk page tables under pte lock Konstantin Khlebnikov
2015-01-26 15:14 ` Cyrill Gorcunov
2015-01-26 15:14   ` Cyrill Gorcunov
2015-01-27  1:34 ` Kirill A. Shutemov
2015-01-27  1:34   ` Kirill A. Shutemov
2015-01-27  1:51 ` Naoya Horiguchi
2015-01-27  1:51   ` Naoya Horiguchi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150126145214.11053.5670.stgit@buzz \
    --to=khlebnikov@yandex-team.ru \
    --cc=a.ryabinin@samsung.com \
    --cc=akpm@linux-foundation.org \
    --cc=gorcunov@openvz.org \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=n-horiguchi@ah.jp.nec.com \
    --cc=pfeiner@google.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.