From: "Kirill A. Shutemov" <kirill@shutemov.name>
To: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Zi Yan <zi.yan@sent.com>, Andrea Arcangeli <aarcange@redhat.com>,
Minchan Kim <minchan@kernel.org>,
linux-kernel@vger.kernel.org, linux-mm@kvack.org,
kirill.shutemov@linux.intel.com, akpm@linux-foundation.org,
vbabka@suse.cz, mgorman@techsingularity.net,
n-horiguchi@ah.jp.nec.com, khandual@linux.vnet.ibm.com,
Zi Yan <ziy@nvidia.com>
Subject: Re: [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range()
Date: Tue, 7 Feb 2017 19:37:34 +0300 [thread overview]
Message-ID: <20170207163734.GA5578@node.shutemov.name> (raw)
In-Reply-To: <5899E389.3040801@cs.rutgers.edu>
On Tue, Feb 07, 2017 at 09:11:05AM -0600, Zi Yan wrote:
> >> This causes memory leak or kernel crashing, if VM_BUG_ON() is enabled.
> >
> > The problem is that numabalancing calls change_huge_pmd() under
> > down_read(mmap_sem), not down_write(mmap_sem) as the rest of users do.
> > It makes numabalancing the only code path beyond page fault that can turn
> > pmd_none() into pmd_trans_huge() under down_read(mmap_sem).
> >
> > This can lead to race when MADV_DONTNEED miss THP. That's not critical for
> > pagefault vs. MADV_DONTNEED race as we will end up with clear page in that
> > case. Not so much for change_huge_pmd().
> >
> > Looks like we need pmdp_modify() or something to modify protection bits
> > inplace, without clearing pmd.
> >
> > Not sure how to get crash scenario.
> >
> > BTW, Zi, have you observed the crash? Or is it based on code inspection?
> > Any backtraces?
>
> The problem should be very rare in the upstream kernel. I discover the
> problem in my customized kernel which does very frequent page migration
> and uses numa_protnone.
>
> The crash scenario I guess is like:
> 1. A huge page pmd entry is in the middle of being changed into either a
> pmd_protnone or a pmd_migration_entry. It is cleared to pmd_none.
>
> 2. At the same time, the application frees the vma this page belongs to.
Em... no.
This shouldn't be possible: your 1. must be done under down_read(mmap_sem).
And we only be able to remove vma under down_write(mmap_sem), so the
scenario should be excluded.
What do I miss?
> 3. zap_pmd_range() only see pmd_none in
> "if (pmd_trans_huge(*pmd) || pmd_devmap(*pmd))",
> it might catch pmd_protnone in
> "if (pmd_none_or_trans_huge_or_clear_bad(pmd))". But nothing is done for
> it. So the deposited PTE page table page associated with the huge pmd
> entry is not withdrawn.
>
> 4. free_pmd_range() calls pmd_free_tlb() and in pgtable_pmd_page_dtor(),
> VM_BUG_ON_PAGE(page->pmd_huge_pte, page) is triggered.
>
> The crash log (you will see a pmd_migration_entry is regarded as bad
> pmd, which should not be. I also saw pmd_protnone before.):
>
> [ 1945.978677] mm/pgtable-generic.c:33: bad pmd
> ffff8f07b13c1b90(0000004fed803c00)
> ^^^^^^^^^^^^^^^^ a pmd migration entry
>
> [ 1946.964974] page:fffffd1dd0c4f040 count:1 mapcount:-511 mapping:
> (null) index:0x0
> [ 1946.974265] flags: 0x6ffff0000000000()
> [ 1946.978486] raw: 06ffff0000000000 0000000000000000 0000000000000000
> 00000001fffffe00
> [ 1946.987202] raw: dead000000000100 fffffd1dd0c45c80 ffff8f07aa38e340
> ffff8efdca466678
> [ 1946.995927] page dumped because: VM_BUG_ON_PAGE(page->pmd_huge_pte)
> [ 1947.002984] page->mem_cgroup:ffff8efdca466678
> [ 1947.007927] ------------[ cut here ]------------
> [ 1947.013123] kernel BUG at ./include/linux/mm.h:1733!
> [ 1947.018706] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
> [ 1947.024774] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4
> iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
> xt_conntrack nf_conntrack intel_rapl sb_edac edac_corei
> [ 1947.077814] CPU: 19 PID: 3303 Comm: python Not tainted
> 4.10.0-rc5-page-migration+ #283
> [ 1947.086721] Hardware name: Dell Inc. PowerEdge R530/0HFG24, BIOS
> 1.5.4 10/05/2015
> [ 1947.095140] task: ffff8f07a5870040 task.stack: ffffc37d64adc000
> [ 1947.101796] RIP: 0010:___pmd_free_tlb+0x83/0x90
> [ 1947.106890] RSP: 0018:ffffc37d64adfce8 EFLAGS: 00010282
> [ 1947.112762] RAX: 0000000000000021 RBX: ffffc37d64adfe10 RCX:
> 0000000000000000
> [ 1947.120770] RDX: 0000000000000000 RSI: ffff8f07c224dea8 RDI:
> ffff8f07c224dea8
> [ 1947.128809] RBP: ffffc37d64adfcf8 R08: 0000000000000001 R09:
> 0000000000000000
> [ 1947.136818] R10: 000000000000000f R11: 0000000000000001 R12:
> fffffd1dd0c4f040
> [ 1947.144825] R13: 00007fae2d7fd000 R14: ffff8f07b13c1b60 R15:
> ffffc37d64adfe10
> [ 1947.152832] FS: 00007fafbcce6700(0000) GS:ffff8f07c2240000(0000)
> knlGS:0000000000000000
> [ 1947.161934] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1947.168380] CR2: 00007fafeb851188 CR3: 0000001432184000 CR4:
> 00000000001406e0
> [ 1947.176393] Call Trace:
> [ 1947.179160] free_pgd_range+0x487/0x5d0
> [ 1947.183476] free_pgtables+0xc4/0x120
> [ 1947.187593] unmap_region+0xe1/0x130
> [ 1947.191620] do_munmap+0x273/0x400
> [ 1947.195452] SyS_munmap+0x53/0x70
> [ 1947.199190] entry_SYSCALL_64_fastpath+0x23/0xc6
> [ 1947.204382] RIP: 0033:0x7fafeb59d387
> [ 1947.208406] RSP: 002b:00007fafbcce5358 EFLAGS: 00000207 ORIG_RAX:
> 000000000000000b
> [ 1947.216924] RAX: ffffffffffffffda RBX: 00007faf2c0b6780 RCX:
> 00007fafeb59d387
> [ 1947.224933] RDX: 00007fae247fc030 RSI: 0000000009001000 RDI:
> 00007fae247fc000
> [ 1947.232940] RBP: 00007fafbcce5390 R08: 00007faf48f9ed00 R09:
> 0000000000000100
> [ 1947.240947] R10: 0000000000000020 R11: 0000000000000207 R12:
> 0000000002cf1fc0
> [ 1947.248955] R13: 0000000002cf1fc0 R14: 000000000343d530 R15:
> 00007fafbcce5810
> [ 1947.256965] Code: 4c 89 e6 48 89 df e8 0d b5 1a 00 84 c0 74 08 48 89
> df e8 91 b4 1a 00 5b 41 5c 5d c3 48 c7 c6 d8 73 c7 b8 4c 89 e7 e8 dd 7b
> 1a 00 <0f> 0b 48 8b 3d 34 80 d9 00 eb 99 66 90
> [ 1947.278200] RIP: ___pmd_free_tlb+0x83/0x90 RSP: ffffc37d64adfce8
> [ 1947.285688] ---[ end trace 7864a23976d71e0a ]---
>
> --
> Best Regards,
> Yan Zi
>
--
Kirill A. Shutemov
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-02-07 16:37 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-05 16:12 [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
2017-02-05 16:12 ` [PATCH v3 01/14] mm: thp: make __split_huge_pmd_locked visible Zi Yan
2017-02-06 6:12 ` Naoya Horiguchi
2017-02-06 12:10 ` Zi Yan
2017-02-06 15:02 ` Matthew Wilcox
2017-02-06 15:03 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 02/14] mm: thp: create new __zap_huge_pmd_locked function Zi Yan
2017-02-05 16:12 ` [PATCH v3 03/14] mm: use pmd lock instead of racy checks in zap_pmd_range() Zi Yan
2017-02-06 4:02 ` Hillf Danton
2017-02-06 4:14 ` Zi Yan
2017-02-06 7:43 ` Naoya Horiguchi
2017-02-06 13:02 ` Zi Yan
2017-02-06 23:22 ` Naoya Horiguchi
2017-02-06 16:07 ` Kirill A. Shutemov
2017-02-06 16:32 ` Zi Yan
2017-02-06 17:35 ` Kirill A. Shutemov
2017-02-07 13:55 ` Aneesh Kumar K.V
2017-02-07 14:12 ` Zi Yan
2017-02-07 14:19 ` Kirill A. Shutemov
2017-02-07 15:11 ` Zi Yan
2017-02-07 16:37 ` Kirill A. Shutemov [this message]
2017-02-07 17:14 ` Zi Yan
2017-02-07 17:45 ` Kirill A. Shutemov
2017-02-13 0:25 ` Zi Yan
2017-02-13 10:59 ` Kirill A. Shutemov
2017-02-13 14:40 ` Andrea Arcangeli
2017-02-05 16:12 ` [PATCH v3 04/14] mm: x86: move _PAGE_SWP_SOFT_DIRTY from bit 7 to bit 1 Zi Yan
2017-02-09 9:14 ` Naoya Horiguchi
2017-02-09 15:07 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 05/14] mm: mempolicy: add queue_pages_node_check() Zi Yan
2017-02-05 16:12 ` [PATCH v3 06/14] mm: thp: introduce separate TTU flag for thp freezing Zi Yan
2017-02-05 16:12 ` [PATCH v3 07/14] mm: thp: introduce CONFIG_ARCH_ENABLE_THP_MIGRATION Zi Yan
2017-02-05 16:12 ` [PATCH v3 08/14] mm: thp: enable thp migration in generic path Zi Yan
2017-02-09 9:15 ` Naoya Horiguchi
2017-02-09 15:17 ` Zi Yan
2017-02-09 23:04 ` Naoya Horiguchi
2017-02-14 20:13 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 09/14] mm: thp: check pmd migration entry in common path Zi Yan
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 17:36 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 10/14] mm: soft-dirty: keep soft-dirty bits over thp migration Zi Yan
2017-02-05 16:12 ` [PATCH v3 11/14] mm: hwpoison: soft offline supports " Zi Yan
2017-02-05 16:12 ` [PATCH v3 12/14] mm: mempolicy: mbind and migrate_pages support " Zi Yan
2017-02-05 16:12 ` [PATCH v3 13/14] mm: migrate: move_pages() supports " Zi Yan
2017-02-09 9:16 ` Naoya Horiguchi
2017-02-09 17:37 ` Zi Yan
2017-02-05 16:12 ` [PATCH v3 14/14] mm: memory_hotplug: memory hotremove " Zi Yan
2017-02-23 16:12 ` [PATCH v3 00/14] mm: page migration enhancement for thp Zi Yan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170207163734.GA5578@node.shutemov.name \
--to=kirill@shutemov.name \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=khandual@linux.vnet.ibm.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mgorman@techsingularity.net \
--cc=minchan@kernel.org \
--cc=n-horiguchi@ah.jp.nec.com \
--cc=vbabka@suse.cz \
--cc=zi.yan@cs.rutgers.edu \
--cc=zi.yan@sent.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).