From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Zi Yan <zi.yan@sent.com>, Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, mgorman@techsingularity.net,
n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Tue, 7 Feb 2017 19:37:34 +0300 [thread overview]
Message-ID: <20170207163734.GA5578@node.shutemov.name> (raw)
In-Reply-To: <5899E389.3040801@cs.rutgers.edu>
On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >
> > The problem is that numabalancing calls change_huge_pmd() under
> > down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> > It makes numabalancing the only code path beyond page fault that can turn
> > pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> >
> > This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> > pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> > case. Not so much for change_huge_pmd().
> >
> > Looks like we need pmdp_modify() or something to modify protection bits
> > inplace, without clearing pmd.
> >
> > Not sure how to get crash scenario.
> >
> > BTW, Zi, have you observed the crash? Or is it based on code inspection?
> > Any backtraces?
>
> The problem should be very rare in the upstream kernel. I discover the
> problem in my customized kernel which does very frequent page migration
> and uses numa_protnone.
>
> The crash scenario I guess is like:
> 1. A huge page pmd entry is in the middle of being changed into either a
> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>
> 2. At the same time, the application frees the vma this page belongs to.
Em... no.
This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
And we only be able to remove vma under down_write(mmap_sem), so the
scenario should be excluded.
What do I miss?
> 3. zap_pmd_range() only see pmd_none in
> "if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
> it might catch pmd_protnone in
> "if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
> it. So the deposited PTE page table page associated with the huge pmd
> entry is not withdrawn.
>
> 4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
> VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.
>
> The crash log (you will see a pmd_migration_entry is regarded as bad
> pmd, which should not be. I also saw pmd_protnone before.):
>
> [ 1945.978677] mm/pgtable-generic.c:33: bad pmd
> ffff8f07b13c1b90(0000004fed803c00)
> ^^^^^^^^^^^^^^^^ a pmd migration entry
>
> [ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
> (null) index:0x0
> [ 1946.974265] flags: 0x6ffff0000000000()
> [ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
> 00000001fffffe00
> [ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
> ffff8efdca466678
> [ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
> [ 1947.002984] page->mem_cgroup:ffff8efdca466678
> [ 1947.007927] ------------[ cut here ]------------
> [ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
> [ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
> [ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
> 4.10.0-rc5-page-migration+ #283
> [ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
> 1.5.4 10/05/2015
> [ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
> [ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
> [ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
> [ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
> 0000000000000000
> [ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
> ffff8f07c224dea8
> [ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
> 0000000000000000
> [ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
> fffffd1dd0c4f040
> [ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
> ffffc37d64adfe10
> [ 1947.152832] FS: 00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
> knlGS:0000000000000000
> [ 1947.161934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
> 00000000001406e0
> [ 1947.176393] Call Trace:
> [ 1947.179160] free_pgd_range+0x487/0x5d0
> [ 1947.183476] free_pgtables+0xc4/0x120
> [ 1947.187593] unmap_region+0xe1/0x130
> [ 1947.191620] do_munmap+0x273/0x400
> [ 1947.195452] SyS_munmap+0x53/0x70
> [ 1947.199190] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1947.204382] RIP: 0033:0x7fafeb59d387
> [ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
> 000000000000000b
> [ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
> 00007fafeb59d387
> [ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
> 00007fae247fc000
> [ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
> 0000000000000100
> [ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
> 0000000002cf1fc0
> [ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
> 00007fafbcce5810
> [ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
> df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
> 1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
> [ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
> [ 1947.285688] ---[ end trace 7864a23976d71e0a ]---
>
> --
> Best Regards,
> Yan Zi
>
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Zi Yan <zi.yan@sent.com>, Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, mgorman@techsingularity.net,
n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Tue, 7 Feb 2017 19:37:34 +0300 [thread overview]
Message-ID: <20170207163734.GA5578@node.shutemov.name> (raw)
In-Reply-To: <5899E389.3040801@cs.rutgers.edu>
On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >
> > The problem is that numabalancing calls change_huge_pmd() under
> > down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> > It makes numabalancing the only code path beyond page fault that can turn
> > pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> >
> > This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> > pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> > case. Not so much for change_huge_pmd().
> >
> > Looks like we need pmdp_modify() or something to modify protection bits
> > inplace, without clearing pmd.
> >
> > Not sure how to get crash scenario.
> >
> > BTW, Zi, have you observed the crash? Or is it based on code inspection?
> > Any backtraces?
>
> The problem should be very rare in the upstream kernel. I discover the
> problem in my customized kernel which does very frequent page migration
> and uses numa_protnone.
>
> The crash scenario I guess is like:
> 1. A huge page pmd entry is in the middle of being changed into either a
> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>
> 2. At the same time, the application frees the vma this page belongs to.
Em... no.
This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
And we only be able to remove vma under down_write(mmap_sem), so the
scenario should be excluded.
What do I miss?
> 3. zap_pmd_range() only see pmd_none in
> "if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
> it might catch pmd_protnone in
> "if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
> it. So the deposited PTE page table page associated with the huge pmd
> entry is not withdrawn.
>
> 4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
> VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.
>
> The crash log (you will see a pmd_migration_entry is regarded as bad
> pmd, which should not be. I also saw pmd_protnone before.):
>
> [ 1945.978677] mm/pgtable-generic.c:33: bad pmd
> ffff8f07b13c1b90(0000004fed803c00)
> ^^^^^^^^^^^^^^^^ a pmd migration entry
>
> [ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
> (null) index:0x0
> [ 1946.974265] flags: 0x6ffff0000000000()
> [ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
> 00000001fffffe00
> [ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
> ffff8efdca466678
> [ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
> [ 1947.002984] page->mem_cgroup:ffff8efdca466678
> [ 1947.007927] ------------[ cut here ]------------
> [ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
> [ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
> [ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
> 4.10.0-rc5-page-migration+ #283
> [ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
> 1.5.4 10/05/2015
> [ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
> [ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
> [ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
> [ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
> 0000000000000000
> [ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
> ffff8f07c224dea8
> [ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
> 0000000000000000
> [ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
> fffffd1dd0c4f040
> [ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
> ffffc37d64adfe10
> [ 1947.152832] FS: 00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
> knlGS:0000000000000000
> [ 1947.161934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
> 00000000001406e0
> [ 1947.176393] Call Trace:
> [ 1947.179160] free_pgd_range+0x487/0x5d0
> [ 1947.183476] free_pgtables+0xc4/0x120
> [ 1947.187593] unmap_region+0xe1/0x130
> [ 1947.191620] do_munmap+0x273/0x400
> [ 1947.195452] SyS_munmap+0x53/0x70
> [ 1947.199190] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1947.204382] RIP: 0033:0x7fafeb59d387
> [ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
> 000000000000000b
> [ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
> 00007fafeb59d387
> [ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
> 00007fae247fc000
> [ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
> 0000000000000100
> [ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
> 0000000002cf1fc0
> [ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
> 00007fafbcce5810
> [ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
> df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
> 1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
> [ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
> [ 1947.285688] ---[ end trace 7864a23976d71e0a ]---
>
> --
> Best Regards,
> Yan Zi
>
--
Kirill A. Shutemov
next prev parent reply other threads:[~2017-02-07 16:37 UTC|newest]
Thread overview: 87+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-05 16:12 [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-06 6:12 ` Naoya Horiguchi
2017-02-06 6:12 ` Naoya Horiguchi
2017-02-06 12:10 ` Zi Yan
2017-02-06 15:02 ` Matthew Wilcox
2017-02-06 15:02 ` Matthew Wilcox
2017-02-06 15:03 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-06 4:02 ` Hillf Danton
2017-02-06 4:02 ` Hillf Danton
2017-02-06 4:14 ` Zi Yan
2017-02-06 4:14 ` Zi Yan
2017-02-06 7:43 ` Naoya Horiguchi
2017-02-06 7:43 ` Naoya Horiguchi
2017-02-06 13:02 ` Zi Yan
2017-02-06 23:22 ` Naoya Horiguchi
2017-02-06 23:22 ` Naoya Horiguchi
2017-02-06 16:07 ` Kirill A. Shutemov
2017-02-06 16:07 ` Kirill A. Shutemov
2017-02-06 16:32 ` Zi Yan
2017-02-06 17:35 ` Kirill A. Shutemov
2017-02-06 17:35 ` Kirill A. Shutemov
2017-02-07 13:55 ` Aneesh Kumar K.V
2017-02-07 13:55 ` Aneesh Kumar K.V
2017-02-07 14:12 ` Zi Yan
2017-02-07 14:19 ` Kirill A. Shutemov
2017-02-07 14:19 ` Kirill A. Shutemov
2017-02-07 15:11 ` Zi Yan
2017-02-07 15:11 ` Zi Yan
2017-02-07 16:37 ` Kirill A. Shutemov [this message]
2017-02-07 16:37 ` Kirill A. Shutemov
2017-02-07 17:14 ` Zi Yan
2017-02-07 17:14 ` Zi Yan
2017-02-07 17:45 ` Kirill A. Shutemov
2017-02-07 17:45 ` Kirill A. Shutemov
2017-02-13 0:25 ` Zi Yan
2017-02-13 0:25 ` Zi Yan
2017-02-13 10:59 ` Kirill A. Shutemov
2017-02-13 10:59 ` Kirill A. Shutemov
2017-02-13 14:40 ` Andrea Arcangeli
2017-02-13 14:40 ` Andrea Arcangeli
2017-02-05 16:12 ` [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-09 9:14 ` Naoya Horiguchi
2017-02-09 9:14 ` Naoya Horiguchi
2017-02-09 15:07 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 08/14] mm: thp: enable thp migration in generic path Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-09 9:15 ` Naoya Horiguchi
2017-02-09 9:15 ` Naoya Horiguchi
2017-02-09 15:17 ` Zi Yan
2017-02-09 23:04 ` Naoya Horiguchi
2017-02-09 23:04 ` Naoya Horiguchi
2017-02-14 20:13 ` Zi Yan
2017-02-14 20:13 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 17:36 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 11/14] mm: hwpoison: soft offline supports " Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 13/14] mm: migrate: move_pages() supports " Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 17:37 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 14/14] mm: memory_hotplug: memory hotremove " Zi Yan
2017-02-05 16:12 ` Zi Yan
2017-02-23 16:12 ` [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-23 16:12 ` Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170207163734.GA5578@node.shutemov.name \
--to=kirill@shutemov.name \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=khandual@linux.vnet.ibm.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=minchan@kernel.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=vbabka@suse.cz \
--cc=zi.yan@cs.rutgers.edu \
--cc=zi.yan@sent.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.