From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Date: Mon, 30 Nov 2015 14:08:53 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. > > Reported-by: Mauricio Porto > Signed-off-by: Toshi Kani > Cc: Andrew Morton > Cc: Kirill A. Shutemov > Cc: Matthew Wilcox > Cc: Dan Williams > Cc: Ross Zwisler > --- Hey Toshi, I ended up fixing this differently with follow_pmd_devmap() introduced in this series: https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html Does the latest libnvdimm-pending branch [1] pass your test case? [1]: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1448316903.19320.46.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Mon, 23 Nov 2015 15:15:03 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Content-Type: multipart/mixed; boundary="=-UwOidSeeVOF63N1Bvr6o" Mime-Version: 1.0 Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: --=-UwOidSeeVOF63N1Bvr6o Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit On Mon, 2015-11-23 at 12:53 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do > something more comprehensive in 4.5. Yes, I noticed during my testing that I could not use pmd... > > Reported-by: Mauricio Porto > > Signed-off-by: Toshi Kani > > Cc: Andrew Morton > > Cc: Kirill A. Shutemov > > Cc: Matthew Wilcox > > Cc: Dan Williams > > Cc: Ross Zwisler > > --- > > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d5b8920..f56e034 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > [..] > > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct > > vm_area_struct *vma, > > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > > goto out; > > > > + /* pfn map does not have a struct page */ > > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > > + page = ERR_PTR(ret); > > + goto out; > > + } > > + > > page = pmd_page(*pmd); > > VM_BUG_ON_PAGE(!PageHead(page), page); > > if (flags & FOLL_TOUCH) { > > I think it is already problematic that dax pmd mappings are getting > confused with transparent huge pages. We had the same issue with dax pte mapping [1], and this change extends the pfn map handling to pmd. So, this problem is not specific to pmd. [1] https://lkml.org/lkml/2015/6/23/181 > They're more closely related to > a hugetlbfs pmd mappings in that they are mapping an explicit > allocation. I have some pending patches to address this dax-pmd vs > hugetlb-pmd vs thp-pmd classification that I will post shortly. Not sure which way is better, but I am certainly interested in your changes. > By the way, I'm collecting DAX pmd regression tests [1], is this just > a simple crash upon using MAP_POPULATE? > > [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c Yes, this issue is easy to reproduce with MAP_POPULATE. In case it helps, attached are the test I used for testing the patches. Sorry, the code is messy since it was only intended for my internal use... - The test was originally written for the pte change [1] and comments in test.sh (ex. mlock fail, ok) reflect the results without the pte change. - For the pmd test, I modified test-mmap.c to call posix_memalign() before mmap(). By calling free(), the 2MB-aligned address from posix_memalign() can be used for mmap(). This keeps the mmap'd address aligned on 2MB. - I created test file(s) with dd (i.e. all blocks written) in my test. - The other infinite loop issue (fixed by my other patch) was found by the test case with option "-LMSr". Thanks, -Toshi --=-UwOidSeeVOF63N1Bvr6o Content-Type: application/x-shellscript; name="test.sh" Content-Disposition: attachment; filename="test.sh" Content-Transfer-Encoding: base64 c2V0IC14IAp1bW91bnQgL21udC9wbWVtMAptb3VudCAvbW50L3BtZW0wCgojZWNobyAnZmlsZSBt bS9ndXAuYyArcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9keW5hbWljX2RlYnVnL2NvbnRyb2wKI2Vj aG8gJ2ZpbGUgbW0vaHVnZV9tZW1vcnkuYyArcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9keW5hbWlj X2RlYnVnL2NvbnRyb2wKI2VjaG8gJ2ZpbGUgbW0vbWVtb3J5LmMgK3AnID4gL3N5cy9rZXJuZWwv ZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNlY2hvICdmaWxlIGZzL2RheC5jICtwJyA+IC9z eXMva2VybmVsL2RlYnVnL2R5bmFtaWNfZGVidWcvY29udHJvbAoKIyMjIyAzMksgIyMjIwojIFNI QVJFRAojLi90ZXN0LW1tYXAgLU1yd3BzCSMgbWxvY2ssIHBvcHVsYXRlLCBzaGFyZWQgKG1sb2Nr IGZhaWwpCiMuL3Rlc3QtbW1hcCAtQXJ3cHMJIyBtbG9ja2FsbCwgcG9wdWxhdGUsIHNoYXJlZAoj Li90ZXN0LW1tYXAgLVJNcnBzCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUsIHNoYXJlZCAo bWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1yd3BzCSMgcG9wbHVhdGUsIHNoYXJlZCAocG9wbHVh dGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLVJycHMJIyByZWFkLW9ubHkgcG9wbHVhdGUsIHNo YXJlZCAocG9wbHVhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLU1yd3MJIyBtbG9jaywgc2hh cmVkIChtbG9jayBmYWlsKQojLi90ZXN0LW1tYXAgLVJNcnMJIyByZWFkLW9ubHksIG1sb2NrLCBz aGFyZWQgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtcndzCSMgc2hhcmVkIChvaykKIy4vdGVz dC1tbWFwIC1ScnMJIyByZWFkLW9ubHksIHNoYXJlZCAob2spCgojIFBSSVZBVEUKIy4vdGVzdC1t bWFwIC1NcndwCSMgbWxvY2ssIHBvcHVsYXRlLCBwcml2YXRlIChvaykKIy4vdGVzdC1tbWFwIC1S TXJwCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUsIHByaXZhdGUgKG1sb2NrIGZhaWwpCiMu L3Rlc3QtbW1hcCAtcndwCSMgcG9wdWxhdGUsIHByaXZhdGUgKG9rKQojLi90ZXN0LW1tYXAgLVJy cAkjIHJlYWQtb25seSwgcG9wdWxhdGUsIHByaXZhdGUgKHBvcHVsYXRlIG5vIGVmZmVjdCkKIy4v dGVzdC1tbWFwIC1NcncJIyBtbG9jaywgcHJpdmF0ZSAob2spCiMuL3Rlc3QtbW1hcCAtUk1yCSMg cmVhZC1vbmx5LCBtbG9jaywgcHJpdmF0ZSAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1NU3IJ IyBwcml2YXRlLCByZWFkIGJlZm9yZSBtbG9jayAob2spCiMuL3Rlc3QtbW1hcCAtcncJIyBwcml2 YXRlIChvaykKIy4vdGVzdC1tbWFwIC1ScgkjIHJlYWQtb25seSwgcHJpdmF0ZSAob2spCgojIyMj IDRHICMjIyMKIyBTSEFSRUQKIy4vdGVzdC1tbWFwIC1MTXJ3cHMJIyBtbG9jaywgcG9wdWxhdGUs IHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1MQXJ3cHMJIyBtbG9ja2FsbCwgcG9w dWxhdGUsIHNoYXJlZAojLi90ZXN0LW1tYXAgLUxSTXJwcwkjIHJlYWQtb25seSwgbWxvY2ssIHBv cHVsYXRlLCBzaGFyZWQgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTHJ3cHMJIyBwb3BsdWF0 ZSwgc2hhcmVkIChwb3BsdWF0ZSBubyBlZmZlY3QpCiMuL3Rlc3QtbW1hcCAtTFJycHMJIyByZWFk LW9ubHkgcG9wbHVhdGUsIHNoYXJlZCAocG9wbHVhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAg LUxNcndzCSMgbWxvY2ssIHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1MUk1ycwkj IHJlYWQtb25seSwgbWxvY2ssIHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1Mcndz CSMgc2hhcmVkIChvaykKIy4vdGVzdC1tbWFwIC1MUnJzCSMgcmVhZC1vbmx5LCBzaGFyZWQgKG9r KQoKIyBQUklWQVRFCiMuL3Rlc3QtbW1hcCAtTE1yd3AJIyBtbG9jaywgcG9wdWxhdGUsIHByaXZh dGUgKG9rKQojLi90ZXN0LW1tYXAgLUxSTXJwCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUs IHByaXZhdGUgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTHJ3cAkjIHBvcHVsYXRlLCBwcml2 YXRlIChvaykKIy4vdGVzdC1tbWFwIC1MUnJwCSMgcmVhZC1vbmx5LCBwb3B1bGF0ZSwgcHJpdmF0 ZSAocG9wdWxhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLUxNcncJIyBtbG9jaywgcHJpdmF0 ZSAob2spCiMuL3Rlc3QtbW1hcCAtTFJNcgkjIHJlYWQtb25seSwgbWxvY2ssIHByaXZhdGUgKG1s b2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTE1TcgkjIHByaXZhdGUsIHJlYWQgYmVmb3JlIG1sb2Nr IChvaykKIy4vdGVzdC1tbWFwIC1McncJIyBwcml2YXRlIChvaykKIy4vdGVzdC1tbWFwIC1MUnIJ IyByZWFkLW9ubHksIHByaXZhdGUgKG9rKQoKI2VjaG8gJ2ZpbGUgbW0vZ3VwLmMgLXAnID4gL3N5 cy9rZXJuZWwvZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNlY2hvICdmaWxlIG1tL2h1Z2Vf bWVtb3J5LmMgLXAnID4gL3N5cy9rZXJuZWwvZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNl Y2hvICdmaWxlIG1tL21lbW9yeS5jIC1wJyA+IC9zeXMva2VybmVsL2RlYnVnL2R5bmFtaWNfZGVi dWcvY29udHJvbAojZWNobyAnZmlsZSBmcy9kYXguYyAtcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9k eW5hbWljX2RlYnVnL2NvbnRyb2wK --=-UwOidSeeVOF63N1Bvr6o Content-Disposition: attachment; filename="test-mmap.c" Content-Type: text/x-csrc; name="test-mmap.c"; charset="UTF-8" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPHN5cy90eXBlcy5oPgojaW5jbHVkZSA8c3lzL3N0YXQuaD4KI2luY2x1ZGUgPHN5 cy9tbWFuLmg+CiNpbmNsdWRlIDxzeXMvdGltZS5oPgojaW5jbHVkZSA8c3RyaW5nLmg+CiNpbmNs dWRlIDxmY250bC5oPgojaW5jbHVkZSA8c3RkaW8uaD4KI2luY2x1ZGUgPHN0ZGxpYi5oPgojaW5j bHVkZSA8dW5pc3RkLmg+CgojZGVmaW5lIE1CKGEpCQkoKGEpICogMTAyNFVMICogMTAyNFVMKQoK c3RhdGljIHN0cnVjdCB0aW1ldmFsIHN0YXJ0X3R2LCBzdG9wX3R2OwoKLy8gQ2FsY3VsYXRlIHRo ZSBkaWZmZXJlbmNlIGJldHdlZW4gdHdvIHRpbWUgdmFsdWVzLgp2b2lkIHR2c3ViKHN0cnVjdCB0 aW1ldmFsICp0ZGlmZiwgc3RydWN0IHRpbWV2YWwgKnQxLCBzdHJ1Y3QgdGltZXZhbCAqdDApCnsK CXRkaWZmLT50dl9zZWMgPSB0MS0+dHZfc2VjIC0gdDAtPnR2X3NlYzsKCXRkaWZmLT50dl91c2Vj ID0gdDEtPnR2X3VzZWMgLSB0MC0+dHZfdXNlYzsKCWlmICh0ZGlmZi0+dHZfdXNlYyA8IDApCgkJ dGRpZmYtPnR2X3NlYy0tLCB0ZGlmZi0+dHZfdXNlYyArPSAxMDAwMDAwOwp9CgovLyBTdGFydCB0 aW1pbmcgbm93Lgp2b2lkIHN0YXJ0KCkKewoJKHZvaWQpIGdldHRpbWVvZmRheSgmc3RhcnRfdHYs IChzdHJ1Y3QgdGltZXpvbmUgKikgMCk7Cn0KCi8vIFN0b3AgdGltaW5nIGFuZCByZXR1cm4gcmVh bCB0aW1lIGluIG1pY3Jvc2Vjb25kcy4KdW5zaWduZWQgbG9uZyBsb25nIHN0b3AoKQp7CglzdHJ1 Y3QgdGltZXZhbCB0ZGlmZjsKCgkodm9pZCkgZ2V0dGltZW9mZGF5KCZzdG9wX3R2LCAoc3RydWN0 IHRpbWV6b25lICopIDApOwoJdHZzdWIoJnRkaWZmLCAmc3RvcF90diwgJnN0YXJ0X3R2KTsKCXJl dHVybiAodGRpZmYudHZfc2VjICogMTAwMDAwMCArIHRkaWZmLnR2X3VzZWMpOwp9Cgp2b2lkIHRl c3Rfd3JpdGUodW5zaWduZWQgbG9uZyAqcCwgc2l6ZV90IHNpemUpCnsKCWludCBpOwoJdW5zaWdu ZWQgbG9uZyAqd3AsIHRtcDsKCXVuc2lnbmVkIGxvbmcgbG9uZyB0aW1ldmFsOwoKCXN0YXJ0KCk7 Cglmb3IgKGk9MCwgd3A9cDsgaTwoc2l6ZS9zaXplb2Yod3ApKTsgaSsrKQoJCSp3cCsrID0gMTsK CXRpbWV2YWwgPSBzdG9wKCk7CglwcmludGYoIldyaXRlOiAlMTBsbHUgdXNlY1xuIiwgdGltZXZh bCk7Cn0KCnZvaWQgdGVzdF9yZWFkKHVuc2lnbmVkIGxvbmcgKnAsIHNpemVfdCBzaXplKQp7Cglp bnQgaTsKCXVuc2lnbmVkIGxvbmcgKndwLCB0bXA7Cgl1bnNpZ25lZCBsb25nIGxvbmcgdGltZXZh bDsKCglzdGFydCgpOwoJZm9yIChpPTAsIHdwPXA7IGk8KHNpemUvc2l6ZW9mKHdwKSk7IGkrKykK CQl0bXAgPSAqd3ArKzsKCXRpbWV2YWwgPSBzdG9wKCk7CglwcmludGYoIlJlYWQgOiAlMTBsbHUg dXNlY1xuIiwgdGltZXZhbCk7Cn0KCmludCBtYWluKGludCBhcmdjLCBjaGFyICoqYXJndikKewoJ aW50IGZkLCBpLCBvcHQsIHJldDsKCWludCBvZmxhZ3MsIG1wcm90LCBtZmxhZ3MgPSAwOwoJaW50 IGlzX3JlYWRfb25seSA9IDAsIGlzX21sb2NrID0gMCwgaXNfbWxvY2thbGwgPSAwOwoJaW50IG1s b2NrX3NraXAgPSAwLCByZWFkX3Rlc3QgPSAwLCB3cml0ZV90ZXN0ID0gMDsKCXZvaWQgKm1wdHIg PSBOVUxMOwoJdW5zaWduZWQgbG9uZyAqcDsKCXN0cnVjdCBzdGF0IHN0YXQ7CglzaXplX3Qgc2l6 ZSwgY3B5X3NpemU7Cgljb25zdCBjaGFyICpmaWxlX25hbWUgPSBOVUxMOwoKCXdoaWxlICgob3B0 ID0gZ2V0b3B0KGFyZ2MsIGFyZ3YsICJMUk1TQXBzcnciKSkgIT0gLTEpIHsKCQlzd2l0Y2ggKG9w dCkgewoJCWNhc2UgJ0wnOgoJCQlmaWxlX25hbWUgPSAiL21udC9wbWVtMC80R2ZpbGUiOwoJCQli cmVhazsKCQljYXNlICdSJzoKCQkJcHJpbnRmKCI+IG1tYXA6IHJlYWQtb25seVxuIik7CgkJCWlz X3JlYWRfb25seSA9IDE7CgkJCWJyZWFrOwoJCWNhc2UgJ00nOgoJCQlwcmludGYoIj4gbWxvY2tc biIpOwoJCQlpc19tbG9jayA9IDE7CgkJCWJyZWFrOwoJCWNhc2UgJ1MnOgoJCQlwcmludGYoIj4g bWxvY2sgLSBza2lwIGZpcnN0IGl0ZVxuIik7CgkJCW1sb2NrX3NraXAgPSAxOwoJCQlicmVhazsK CQljYXNlICdBJzoKCQkJcHJpbnRmKCI+IG1sb2NrYWxsXG4iKTsKCQkJaXNfbWxvY2thbGwgPSAx OwoJCQlicmVhazsKCQljYXNlICdwJzoKCQkJcHJpbnRmKCI+IE1BUF9QT1BVTEFURVxuIik7CgkJ CW1mbGFncyB8PSBNQVBfUE9QVUxBVEU7CgkJCWJyZWFrOwoJCWNhc2UgJ3MnOgoJCQlwcmludGYo Ij4gTUFQX1NIQVJFRFxuIik7CgkJCW1mbGFncyB8PSBNQVBfU0hBUkVEOwoJCQlicmVhazsKCQlj YXNlICdyJzoKCQkJcHJpbnRmKCI+IHJlYWQtdGVzdFxuIik7CgkJCXJlYWRfdGVzdCA9IDE7CgkJ CWJyZWFrOwoJCWNhc2UgJ3cnOgoJCQlwcmludGYoIj4gd3JpdGUtdGVzdFxuIik7CgkJCXdyaXRl X3Rlc3QgPSAxOwoJCQlicmVhazsKCQl9Cgl9CgoJaWYgKCFmaWxlX25hbWUpIHsKCQlmaWxlX25h bWUgPSAiL21udC9wbWVtMS8zMktmaWxlIjsKCX0KCglpZiAoIShtZmxhZ3MgJiBNQVBfU0hBUkVE KSkgewoJCXByaW50ZigiPiBNQVBfUFJJVkFURVxuIik7CgkJbWZsYWdzIHw9IE1BUF9QUklWQVRF OwoJfQoKCWlmIChpc19yZWFkX29ubHkpIHsKCQlvZmxhZ3MgPSBPX1JET05MWTsKCQltcHJvdCA9 IFBST1RfUkVBRDsKCX0gZWxzZSB7CgkJb2ZsYWdzID0gT19SRFdSOwoJCW1wcm90ID0gUFJPVF9S RUFEfFBST1RfV1JJVEU7Cgl9CgoJZmQgPSBvcGVuKGZpbGVfbmFtZSwgb2ZsYWdzKTsKCWlmIChm ZCA9PSAtMSkgewoJCXBlcnJvcigib3BlbiBmYWlsZWQiKTsKCQlleGl0KDEpOwoJfQoKCXJldCA9 IGZzdGF0KGZkLCAmc3RhdCk7CglpZiAocmV0IDwgMCkgewoJCXBlcnJvcigiZnN0YXQgZmFpbGVk Iik7CgkJZXhpdCgxKTsKCX0KCXNpemUgPSBzdGF0LnN0X3NpemU7CgoJcHJpbnRmKCI+IG9wZW4g JXMgc2l6ZSAweCV4IGZsYWdzIDB4JXhcbiIsIGZpbGVfbmFtZSwgc2l6ZSwgb2ZsYWdzKTsKCgly ZXQgPSBwb3NpeF9tZW1hbGlnbigmbXB0ciwgTUIoMiksIHNpemUpOwoJaWYgKHJldCA9PTApCgkJ ZnJlZShtcHRyKTsKCglwcmludGYoIj4gbW1hcCBtcHJvdCAweCV4IGZsYWdzIDB4JXhcbiIsIG1w cm90LCBtZmxhZ3MpOwoJcCA9IG1tYXAobXB0ciwgc2l6ZSwgbXByb3QsIG1mbGFncywgZmQsIDB4 MCk7CglpZiAoIXApIHsKCQlwZXJyb3IoIm1tYXAgZmFpbGVkIik7CgkJZXhpdCgxKTsKCX0KCWlm ICgobG9uZyB1bnNpZ25lZClwICYgKE1CKDIpLTEpKQoJCXByaW50ZigiPiBtbWFwOiBOT1QgMk1C IGFsaWduZWQ6IDB4JXBcbiIsIHApOwoJZWxzZQoJCXByaW50ZigiPiBtbWFwOiAyTUIgYWxpZ25l ZDogMHglcFxuIiwgcCk7CgojaWYgMAkvKiBTSVpFIExJTUlUICovCglpZiAoc2l6ZSA+PSBNQigy KSkKCQljcHlfc2l6ZSA9IE1CKDMyKTsKCWVsc2UKI2VuZGlmCgkJY3B5X3NpemUgPSBzaXplOwoK CWZvciAoaT0wOyBpPDM7IGkrKykgewoKCQlpZiAoaXNfbWxvY2sgJiYgIW1sb2NrX3NraXApIHsK CQkJcHJpbnRmKCI+IG1sb2NrIDB4JXBcbiIsIHApOwoJCQlyZXQgPSBtbG9jayhwLCBzaXplKTsK CQkJaWYgKHJldCA8IDApIHsKCQkJCXBlcnJvcigibWxvY2sgZmFpbGVkIik7CgkJCQlleGl0KDEp OwoJCQl9CgkJfSBlbHNlIGlmIChpc19tbG9ja2FsbCkgewoJCQlwcmludGYoIj4gbWxvY2thbGxc biIpOwoJCQlyZXQgPSBtbG9ja2FsbChNQ0xfQ1VSUkVOVHxNQ0xfRlVUVVJFKTsKCQkJaWYgKHJl dCA8IDApIHsKCQkJCXBlcnJvcigibWxvY2thbGwgZmFpbGVkIik7CgkJCQlleGl0KDEpOwoJCQl9 CgkJfQoKCQlwcmludGYoIj09PT09ICVkID09PT09XG4iLCBpKzEpOwoJCWlmICh3cml0ZV90ZXN0 KQoJCQl0ZXN0X3dyaXRlKHAsIGNweV9zaXplKTsKCQlpZiAocmVhZF90ZXN0KQoJCQl0ZXN0X3Jl YWQocCwgY3B5X3NpemUpOwoKCQlpZiAoaXNfbWxvY2sgJiYgIW1sb2NrX3NraXApIHsKCQkJcHJp bnRmKCI+IG11bmxvY2sgMHglcFxuIiwgcCk7CgkJCXJldCA9IG11bmxvY2socCwgc2l6ZSk7CgkJ CWlmIChyZXQgPCAwKSB7CgkJCQlwZXJyb3IoIm11bmxvY2sgZmFpbGVkIik7CgkJCQlleGl0KDEp OwoJCQl9CgkJfSBlbHNlIGlmIChpc19tbG9ja2FsbCkgewoJCQlwcmludGYoIj4gbXVubG9ja2Fs bFxuIik7CgkJCXJldCA9IG11bmxvY2thbGwoKTsKCQkJaWYgKHJldCA8IDApIHsKCQkJCXBlcnJv cigibXVubG9ja2FsbCBmYWlsZWQiKTsKCQkJCWV4aXQoMSk7CgkJCX0KCQl9CgoJCS8qIHNraXAs IGlmIHJlcXVlc3RlZCwgb25seSB0aGUgZmlyc3QgaXRlcmF0aW9uICovCgkJbWxvY2tfc2tpcCA9 IDA7Cgl9CgoJcHJpbnRmKCI+IG11bm1hcCAweCVwXG4iLCBwKTsKCW11bm1hcChwLCBzaXplKTsK fQo= --=-UwOidSeeVOF63N1Bvr6o-- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Date: Mon, 23 Nov 2015 12:53:02 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do something more comprehensive in 4.5. > > Reported-by: Mauricio Porto > Signed-off-by: Toshi Kani > Cc: Andrew Morton > Cc: Kirill A. Shutemov > Cc: Matthew Wilcox > Cc: Dan Williams > Cc: Ross Zwisler > --- > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d5b8920..f56e034 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c [..] > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > goto out; > > + /* pfn map does not have a struct page */ > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > + page = ERR_PTR(ret); > + goto out; > + } > + > page = pmd_page(*pmd); > VM_BUG_ON_PAGE(!PageHead(page), page); > if (flags & FOLL_TOUCH) { I think it is already problematic that dax pmd mappings are getting confused with transparent huge pages. They're more closely related to a hugetlbfs pmd mappings in that they are mapping an explicit allocation. I have some pending patches to address this dax-pmd vs hugetlb-pmd vs thp-pmd classification that I will post shortly. By the way, I'm collecting DAX pmd regression tests [1], is this just a simple crash upon using MAP_POPULATE? [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: From: Toshi Kani Subject: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping Date: Mon, 23 Nov 2015 13:04:42 -0700 Message-Id: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Sender: owner-linux-mm@kvack.org To: akpm@linux-foundation.org Cc: kirill.shutemov@linux.intel.com, willy@linux.intel.com, ross.zwisler@linux.intel.com, dan.j.williams@intel.com, mauricio.porto@hpe.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@lists.01.org, linux-kernel@vger.kernel.org, Toshi Kani List-ID: The following oops was observed when mmap() with MAP_POPULATE pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() expects that a target address has a struct page. BUG: unable to handle kernel paging request at ffffea0012220000 follow_trans_huge_pmd+0xba/0x390 follow_page_mask+0x33d/0x420 __get_user_pages+0xdc/0x800 populate_vma_page_range+0xb5/0xe0 __mm_populate+0xc5/0x150 vm_mmap_pgoff+0xd5/0xe0 SyS_mmap_pgoff+0x1c1/0x290 SyS_mmap+0x1b/0x30 Fix it by making the PMD pre-fault handling consistent with PTE. After pre-faulted in faultin_page(), follow_page_mask() calls follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH and returns with -EEXIST. Reported-by: Mauricio Porto Signed-off-by: Toshi Kani Cc: Andrew Morton Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Dan Williams Cc: Ross Zwisler --- mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5b8920..f56e034 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1267,6 +1267,32 @@ out_unlock: return ret; } +/* + * Follow a pmd inserted by vmf_insert_pfn_pmd(). See follow_pfn_pte() for pte. + */ +static int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + /* No page to get reference */ + if (flags & FOLL_GET) + return -EFAULT; + + if (flags & FOLL_TOUCH) { + pmd_t entry = *pmd; + + /* Set the dirty bit per follow_trans_huge_pmd() */ + entry = pmd_mkyoung(pmd_mkdirty(entry)); + + if (!pmd_same(*pmd, entry)) { + set_pmd_at(vma->vm_mm, address, pmd, entry); + update_mmu_cache_pmd(vma, address, pmd); + } + } + + /* Proper page table entry exists, but no corresponding struct page */ + return -EEXIST; +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1274,6 +1300,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; struct page *page = NULL; + int ret; assert_spin_locked(pmd_lockptr(mm, pmd)); @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; + /* pfn map does not have a struct page */ + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { + ret = follow_pfn_pmd(vma, addr, pmd, flags); + page = ERR_PTR(ret); + goto out; + } + page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page), page); if (flags & FOLL_TOUCH) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449248149.9855.85.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Fri, 04 Dec 2015 09:55:49 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <1449093339.9855.1.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Thu, 2015-12-03 at 15:43 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani > > > wrote: > > > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > > > [..] > > > > > The whole point of __get_user_page_fast() is to avoid the > > > > > overhead of taking the mm semaphore to access the vma. > > > > > _PAGE_SPECIAL simply tells > > > > > __get_user_pages_fast that it needs to fallback to the > > > > > __get_user_pages slow path. > > > > > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when > > > > !pfn_valid(), instead of VM_BUG_ON. > > > > > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > > > per node... what happens when pmem is located below that point. I > > > haven't been able to convince myself that we won't get false > > > positives, but maybe I'm missing something. > > > > I believe we use the version of pfn_valid() in linux/mmzone.h. > > Talking this over with Dave we came to the conclusion that it would be > safer to be explicit about the pmd not being mapped. He points out > that unless a platform can guarantee that persistent memory is always > section aligned we might get false positive pfn_valid() indications. > Given the get_user_pages_fast() path is arch specific we can simply > have an arch specific pmd bit and not worry about generically enabling > a "pmd special" bit for now. Sounds good to me. Thanks! -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449093339.9855.1.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <1449093339.9855.1.camel@hpe.com> Date: Thu, 3 Dec 2015 15:43:46 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >> > > The whole point of __get_user_page_fast() is to avoid the overhead of >> > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> > > __get_user_pages_fast that it needs to fallback to the >> > > __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. >> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > I believe we use the version of pfn_valid() in linux/mmzone.h. Talking this over with Dave we came to the conclusion that it would be safer to be explicit about the pmd not being mapped. He points out that unless a platform can guarantee that persistent memory is always section aligned we might get false positive pfn_valid() indications. Given the get_user_pages_fast() path is arch specific we can simply have an arch specific pmd bit and not worry about generically enabling a "pmd special" bit for now. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449102105.9855.15.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449102105.9855.15.camel@hpe.com> Date: Wed, 2 Dec 2015 15:33:58 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 4:21 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: >> On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: >> > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: >> > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > : >> > > > >> > > > Hey Toshi, >> > > > >> > > > I ended up fixing this differently with follow_pmd_devmap() introduced >> > > > in this series: >> > > > >> > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> > > > >> > > > Does the latest libnvdimm-pending branch [1] pass your test case? >> > > >> > > Hi Dan, >> > > >> > > I ran several test cases, and they all hit the case "pfn not in memmap" in >> > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, >> > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I >> > > thought I let you know first. I've also seen the test thread got hung up >> > > at the end sometime. >> > >> > That PFN_MAP flag will not be set by default for NFIT-defined >> > persistent memory. See pmem_should_map_pages() for pmem namespaces >> > that will have it set by default, currently only e820 type-12 memory >> > ranges. >> > >> > NFIT-defined persistent memory can have a memmap array dynamically >> > allocated by setting up a pfn device (similar to setting up a btt). >> > We don't map it by default because the NFIT may describe hundreds of >> > gigabytes of persistent and the overhead of the memmap may be too >> > large to locate the memmap in ram. >> >> Oh, I see. I will setup the memmap array and run the tests again. > > I setup a pfn device, and ran a few test cases again. Yes, it solved the > PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by > 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect > due to the pfn metadata.If I pass a 2MB-aligned+2pages virtual address to > mmap(MAP_POPULATE), the mmap() call gets hung up. Ok, I need to switch over from my memmap=ss!nn config. We just need to pad the info block reservation to 2M. As for the MAP_POPULATE hang, I'll take a look. Right now I'm in the process of rebasing the whole set on top of -mm which has a pending THP re-works from Kirill. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449102105.9855.15.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 17:21:45 -0700 In-Reply-To: <1449078237.31589.30.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: > On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: : > > > > > > > > Hey Toshi, > > > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > > in this series: > > > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > > > Hi Dan, > > > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I > > > thought I let you know first. I've also seen the test thread got hung up > > > at the end sometime. > > > > That PFN_MAP flag will not be set by default for NFIT-defined > > persistent memory. See pmem_should_map_pages() for pmem namespaces > > that will have it set by default, currently only e820 type-12 memory > > ranges. > > > > NFIT-defined persistent memory can have a memmap array dynamically > > allocated by setting up a pfn device (similar to setting up a btt). > > We don't map it by default because the NFIT may describe hundreds of > > gigabytes of persistent and the overhead of the memmap may be too > > large to locate the memmap in ram. > > Oh, I see. I will setup the memmap array and run the tests again. I setup a pfn device, and ran a few test cases again. Yes, it solved the PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect due to the pfn metadata. If I pass a 2MB-aligned+2pages virtual address to mmap(MAP_POPULATE), the mmap() call gets hung up. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <565F69FE.601@intel.com> From: Dave Hansen Message-ID: <565F6C06.9060208@intel.com> Date: Wed, 2 Dec 2015 14:09:10 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Toshi Kani , Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On 12/02/2015 02:03 PM, Dan Williams wrote: >>> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >>> >> per node... what happens when pmem is located below that point. I >>> >> haven't been able to convince myself that we won't get false >>> >> positives, but maybe I'm missing something. >> > >> > With sparsemem at least, it makes sure that you're looking at a valid >> > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. > At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD". Yeah, it seems like an awful layering violation. But, sparsemem is turned on everywhere (all the distros/users) that we care about, as far as I know. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <565F69FE.601@intel.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <565F69FE.601@intel.com> Date: Wed, 2 Dec 2015 14:03:46 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Dave Hansen Cc: Toshi Kani , Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 2:00 PM, Dave Hansen wrote: > On 12/02/2015 12:54 PM, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >>>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>>> >> __get_user_pages_fast that it needs to fallback to the >>>> >> __get_user_pages slow path. >>> > >>> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >>> > instead of VM_BUG_ON. >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > With sparsemem at least, it makes sure that you're looking at a valid > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD". -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> From: Dave Hansen Message-ID: <565F69FE.601@intel.com> Date: Wed, 2 Dec 2015 14:00:30 -0800 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams , Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On 12/02/2015 12:54 PM, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] >>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>> >> __get_user_pages_fast that it needs to fallback to the >>> >> __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. With sparsemem at least, it makes sure that you're looking at a valid _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449093339.9855.1.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 14:55:39 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] > > > The whole point of __get_user_page_fast() is to avoid the overhead of > > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > > > __get_user_pages_fast that it needs to fallback to the > > > __get_user_pages slow path. > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > > instead of VM_BUG_ON. > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. I believe we use the version of pfn_valid() in linux/mmzone.h. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449092226.31589.50.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> Date: Wed, 2 Dec 2015 12:54:05 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: [..] >> The whole point of __get_user_page_fast() is to avoid the overhead of >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> __get_user_pages_fast that it needs to fallback to the >> __get_user_pages slow path. > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > instead of VM_BUG_ON. Is pfn_valid() a reliable check? It seems to be based on a max_pfn per node... what happens when pmem is located below that point. I haven't been able to convince myself that we won't get false positives, but maybe I'm missing something. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449092226.31589.50.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 14:37:06 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > > > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams < > > > > > > dan.j.williams@intel.com> > > > > > > wrote: > > > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani > > > > > > > wrote: > > > > > > > > Oh, I see. I will setup the memmap array and run the tests > > > > > > > > again. > > > > > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We > > > > > > > > have observed major performance improvement with PMD. This > > > > > > > > feature should always be enabled with DAX regardless of the > > > > > > > > option to allocate the memmap array. > > > > > > > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > > > alternatives but here's the reasoning: > > > > > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path > > > > > > > leading to commit e82c9ed41e8 "dax: disable pmd mappings". The > > > > > > > reason pte mappings don't crash and instead trigger -EFAULT is due > > > > > > > to the _PAGE_SPECIAL pte bit. > > > > > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge > > > > > > > -page case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get > > > > > > > two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. > > > > > > > Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of > > > > > > > pursuing it. > > > > > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > > > page-less DAX-pmd case. > > > > > > > > > > > > But I'm still of the opinion that we run away from the page-less > > > > > > case until it can be made a full class citizen with O_DIRECT for pfn > > > > > > support. > > > > > > > > > > I may be missing something, but per vm_normal_page(), I think > > > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > > > have the memmap. > > > > > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although > > > > > I missed the pfn_valid() check. > > > > > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > > > don't have access to the vma. > > > > > > __get_user_page_fast already refers current->mm, so we should be able to > > > get the vma, and pass it down to gup_pud_range(). > > > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when > > the !pfn_valid() condition is met, so that we do not add the code to the > > main path of __get_user_pages_fast. > > The whole point of __get_user_page_fast() is to avoid the overhead of > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > __get_user_pages_fast that it needs to fallback to the > __get_user_pages slow path. I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), instead of VM_BUG_ON. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449087125.31589.45.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> Date: Wed, 2 Dec 2015 11:57:55 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: >> On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: >> > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: >> > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams >> > > > wrote: >> > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> > > > > > Oh, I see. I will setup the memmap array and run the tests again. >> > > > > > >> > > > > > But, why does the PMD mapping depend on the memmap array? We have >> > > > > > observed major performance improvement with PMD. This feature >> > > > > > should always be enabled with DAX regardless of the option to >> > > > > > allocate the memmap array. >> > > > > > >> > > > > >> > > > > Several factors drove this decision, I'm open to considering >> > > > > alternatives but here's the reasoning: >> > > > > >> > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > > > > mappings don't crash and instead trigger -EFAULT is due to the >> > > > > _PAGE_SPECIAL pte bit. >> > > > > >> > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > > > > case, we need a new pte bit _PAGE_DEVMAP. >> > > > > >> > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> > > > >> > > > Actually, Dave says they aren't that hard to come by for pmds, so we >> > > > could go add _PMD_SPECIAL if we really wanted to support the limited >> > > > page-less DAX-pmd case. >> > > > >> > > > But I'm still of the opinion that we run away from the page-less case >> > > > until it can be made a full class citizen with O_DIRECT for pfn >> > > > support. >> > > >> > > I may be missing something, but per vm_normal_page(), I think >> > > _PAGE_SPECIAL can be substituted by the following check when we do not >> > > have the memmap. >> > > >> > > if ((vma->vm_flags & VM_PFNMAP) || >> > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { >> > > >> > > This is what I did in this patch for follow_trans_huge_pmd(), although I >> > > missed the pfn_valid() check. >> > >> > That works for __get_user_pages but not __get_user_pages_fast where we >> > don't have access to the vma. >> >> __get_user_page_fast already refers current->mm, so we should be able to get >> the vma, and pass it down to gup_pud_range(). > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the > !pfn_valid() condition is met, so that we do not add the code to the main path > of __get_user_pages_fast. The whole point of __get_user_page_fast() is to avoid the overhead of taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells __get_user_pages_fast that it needs to fallback to the __get_user_pages slow path. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449087125.31589.45.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 13:12:05 -0700 In-Reply-To: <1449086521.31589.39.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams > > > > wrote: > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > > observed major performance improvement with PMD. This feature > > > > > > should always be enabled with DAX regardless of the option to > > > > > > allocate the memmap array. > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > alternatives but here's the reasoning: > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > > _PAGE_SPECIAL pte bit. > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > page-less DAX-pmd case. > > > > > > > > But I'm still of the opinion that we run away from the page-less case > > > > until it can be made a full class citizen with O_DIRECT for pfn > > > > support. > > > > > > I may be missing something, but per vm_normal_page(), I think > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > have the memmap. > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > > missed the pfn_valid() check. > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > don't have access to the vma. > > __get_user_page_fast already refers current->mm, so we should be able to get > the vma, and pass it down to gup_pud_range(). Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the !pfn_valid() condition is met, so that we do not add the code to the main path of __get_user_pages_fast. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449086521.31589.39.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 13:02:01 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams > > > wrote: > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > observed major performance improvement with PMD. This feature should > > > > > always be enabled with DAX regardless of the option to allocate the > > > > > memmap > > > > > array. > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > alternatives but here's the reasoning: > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > _PAGE_SPECIAL pte bit. > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > page-less DAX-pmd case. > > > > > > But I'm still of the opinion that we run away from the page-less case > > > until it can be made a full class citizen with O_DIRECT for pfn > > > support. > > > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL > > can > > be substituted by the following check when we do not have the memmap. > > > > if ((vma->vm_flags & VM_PFNMAP) || > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > missed > > the pfn_valid() check. > > That works for __get_user_pages but not __get_user_pages_fast where we > don't have access to the vma. __get_user_page_fast already refers current->mm, so we should be able to get the vma, and pass it down to gup_pud_range(). Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449084362.31589.37.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> Date: Wed, 2 Dec 2015 11:00:49 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: >> > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> > > Oh, I see. I will setup the memmap array and run the tests again. >> > > >> > > But, why does the PMD mapping depend on the memmap array? We have >> > > observed major performance improvement with PMD. This feature should >> > > always be enabled with DAX regardless of the option to allocate the memmap >> > > array. >> > > >> > >> > Several factors drove this decision, I'm open to considering >> > alternatives but here's the reasoning: >> > >> > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > mappings don't crash and instead trigger -EFAULT is due to the >> > _PAGE_SPECIAL pte bit. >> > >> > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > case, we need a new pte bit _PAGE_DEVMAP. >> > >> > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> >> Actually, Dave says they aren't that hard to come by for pmds, so we >> could go add _PMD_SPECIAL if we really wanted to support the limited >> page-less DAX-pmd case. >> >> But I'm still of the opinion that we run away from the page-less case >> until it can be made a full class citizen with O_DIRECT for pfn >> support. > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can > be substituted by the following check when we do not have the memmap. > > if ((vma->vm_flags & VM_PFNMAP) || > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > This is what I did in this patch for follow_trans_huge_pmd(), although I missed > the pfn_valid() check. That works for __get_user_pages but not __get_user_pages_fast where we don't have access to the vma. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449084362.31589.37.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 12:26:02 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > observed major performance improvement with PMD. This feature should > > > always be enabled with DAX regardless of the option to allocate the memmap > > > array. > > > > > > > Several factors drove this decision, I'm open to considering > > alternatives but here's the reasoning: > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > mappings don't crash and instead trigger -EFAULT is due to the > > _PAGE_SPECIAL pte bit. > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > case, we need a new pte bit _PAGE_DEVMAP. > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > Actually, Dave says they aren't that hard to come by for pmds, so we > could go add _PMD_SPECIAL if we really wanted to support the limited > page-less DAX-pmd case. > > But I'm still of the opinion that we run away from the page-less case > until it can be made a full class citizen with O_DIRECT for pfn > support. I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can be substituted by the following check when we do not have the memmap. if ((vma->vm_flags & VM_PFNMAP) || ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { This is what I did in this patch for follow_trans_huge_pmd(), although I missed the pfn_valid() check. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Date: Wed, 2 Dec 2015 10:06:36 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> Oh, I see. I will setup the memmap array and run the tests again. >> >> But, why does the PMD mapping depend on the memmap array? We have observed >> major performance improvement with PMD. This feature should always be enabled >> with DAX regardless of the option to allocate the memmap array. >> > > Several factors drove this decision, I'm open to considering > alternatives but here's the reasoning: > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > mappings don't crash and instead trigger -EFAULT is due to the > _PAGE_SPECIAL pte bit. > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > case, we need a new pte bit _PAGE_DEVMAP. > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. Actually, Dave says they aren't that hard to come by for pmds, so we could go add _PMD_SPECIAL if we really wanted to support the limited page-less DAX-pmd case. But I'm still of the opinion that we run away from the page-less case until it can be made a full class citizen with O_DIRECT for pfn support. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449078237.31589.30.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Date: Wed, 2 Dec 2015 09:01:36 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > Oh, I see. I will setup the memmap array and run the tests again. > > But, why does the PMD mapping depend on the memmap array? We have observed > major performance improvement with PMD. This feature should always be enabled > with DAX regardless of the option to allocate the memmap array. > Several factors drove this decision, I'm open to considering alternatives but here's the reasoning: 1/ DAX pmd mappings caused crashes in the get_user_pages path leading to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte mappings don't crash and instead trigger -EFAULT is due to the _PAGE_SPECIAL pte bit. 2/ To enable get_user_pages for DAX, in both the page and huge-page case, we need a new pte bit _PAGE_DEVMAP. 3/ Given the pte bits are hard to come I'm assuming we won't get two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. End result is that DAX pmd mappings must be fully enabled through the get_user_pages paths with _PAGE_DEVMAP or turned off completely. In general I think the "page less" DAX implementation was a good starting point, but we need to shift to page-backed by default until we can teach more of the kernel to operate on bare pfns. That "default" will need to be enforced by userspace tooling. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449078237.31589.30.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Wed, 02 Dec 2015 10:43:57 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > > > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > > > The following oops was observed when mmap() with MAP_POPULATE > > > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > > > expects that a target address has a struct page. > > > > > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > > > follow_trans_huge_pmd+0xba/0x390 > > > > follow_page_mask+0x33d/0x420 > > > > __get_user_pages+0xdc/0x800 > > > > populate_vma_page_range+0xb5/0xe0 > > > > __mm_populate+0xc5/0x150 > > > > vm_mmap_pgoff+0xd5/0xe0 > > > > SyS_mmap_pgoff+0x1c1/0x290 > > > > SyS_mmap+0x1b/0x30 > > > > > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > > > After pre-faulted in faultin_page(), follow_page_mask() calls > > > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > > > and returns with -EEXIST. > > > > > > > > Reported-by: Mauricio Porto > > > > Signed-off-by: Toshi Kani > > > > Cc: Andrew Morton > > > > Cc: Kirill A. Shutemov > > > > Cc: Matthew Wilcox > > > > Cc: Dan Williams > > > > Cc: Ross Zwisler > > > > --- > > > > > > Hey Toshi, > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > in this series: > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > Hi Dan, > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > PFN_DEV is > > set but PFN_MAP is not. I have not looked into why, but I thought I let you > > know first. I've also seen the test thread got hung up at the end sometime. > > That PFN_MAP flag will not be set by default for NFIT-defined > persistent memory. See pmem_should_map_pages() for pmem namespaces > that will have it set by default, currently only e820 type-12 memory > ranges. > > NFIT-defined persistent memory can have a memmap array dynamically > allocated by setting up a pfn device (similar to setting up a btt). > We don't map it by default because the NFIT may describe hundreds of > gigabytes of persistent and the overhead of the memmap may be too > large to locate the memmap in ram. Oh, I see. I will setup the memmap array and run the tests again. But, why does the PMD mapping depend on the memmap array? We have observed major performance improvement with PMD. This feature should always be enabled with DAX regardless of the option to allocate the memmap array. Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: MIME-Version: 1.0 In-Reply-To: <1449022764.31589.24.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> Date: Tue, 1 Dec 2015 19:45:01 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams Content-Type: text/plain; charset=UTF-8 Sender: owner-linux-mm@kvack.org To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: >> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: >> > The following oops was observed when mmap() with MAP_POPULATE >> > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() >> > expects that a target address has a struct page. >> > >> > BUG: unable to handle kernel paging request at ffffea0012220000 >> > follow_trans_huge_pmd+0xba/0x390 >> > follow_page_mask+0x33d/0x420 >> > __get_user_pages+0xdc/0x800 >> > populate_vma_page_range+0xb5/0xe0 >> > __mm_populate+0xc5/0x150 >> > vm_mmap_pgoff+0xd5/0xe0 >> > SyS_mmap_pgoff+0x1c1/0x290 >> > SyS_mmap+0x1b/0x30 >> > >> > Fix it by making the PMD pre-fault handling consistent with PTE. >> > After pre-faulted in faultin_page(), follow_page_mask() calls >> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() >> > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH >> > and returns with -EEXIST. >> > >> > Reported-by: Mauricio Porto >> > Signed-off-by: Toshi Kani >> > Cc: Andrew Morton >> > Cc: Kirill A. Shutemov >> > Cc: Matthew Wilcox >> > Cc: Dan Williams >> > Cc: Ross Zwisler >> > --- >> >> Hey Toshi, >> >> I ended up fixing this differently with follow_pmd_devmap() introduced >> in this series: >> >> https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> >> Does the latest libnvdimm-pending branch [1] pass your test case? > > Hi Dan, > > I ran several test cases, and they all hit the case "pfn not in memmap" in > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is > set but PFN_MAP is not. I have not looked into why, but I thought I let you > know first. I've also seen the test thread got hung up at the end sometime. That PFN_MAP flag will not be set by default for NFIT-defined persistent memory. See pmem_should_map_pages() for pmem namespaces that will have it set by default, currently only e820 type-12 memory ranges. NFIT-defined persistent memory can have a memmap array dynamically allocated by setting up a pfn device (similar to setting up a btt). We don't map it by default because the NFIT may describe hundreds of gigabytes of persistent and the overhead of the memmap may be too large to locate the memmap in ram. I have a pending patch in libnvdimm-pending that allows the capacity for the memmap to come from pmem instead of ram: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/commit/?h=libnvdimm-pending&id=3117a24e07fe > I also noticed that reason is not set in the case below. > > if (length < PMD_SIZE > || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { > dax_unmap_atomic(bdev, &dax); > goto fallback; > } Thanks, I'll fix that up. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <1449022764.31589.24.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani Date: Tue, 01 Dec 2015 19:19:24 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: owner-linux-mm@kvack.org To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" List-ID: On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > > > Reported-by: Mauricio Porto > > Signed-off-by: Toshi Kani > > Cc: Andrew Morton > > Cc: Kirill A. Shutemov > > Cc: Matthew Wilcox > > Cc: Dan Williams > > Cc: Ross Zwisler > > --- > > Hey Toshi, > > I ended up fixing this differently with follow_pmd_devmap() introduced > in this series: > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > Does the latest libnvdimm-pending branch [1] pass your test case? Hi Dan, I ran several test cases, and they all hit the case "pfn not in memmap" in __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I thought I let you know first. I've also seen the test thread got hung up at the end sometime. I also noticed that reason is not set in the case below. if (length < PMD_SIZE || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { dax_unmap_atomic(bdev, &dax); goto fallback; } Thanks, -Toshi -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752773AbbKWUJH (ORCPT ); Mon, 23 Nov 2015 15:09:07 -0500 Received: from g2t2355.austin.hp.com ([15.217.128.54]:58007 "EHLO g2t2355.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751270AbbKWUJF (ORCPT ); Mon, 23 Nov 2015 15:09:05 -0500 From: Toshi Kani To: akpm@linux-foundation.org Cc: kirill.shutemov@linux.intel.com, willy@linux.intel.com, ross.zwisler@linux.intel.com, dan.j.williams@intel.com, mauricio.porto@hpe.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-nvdimm@ml01.01.org, linux-kernel@vger.kernel.org, Toshi Kani Subject: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping Date: Mon, 23 Nov 2015 13:04:42 -0700 Message-Id: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> X-Mailer: git-send-email 2.4.3 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following oops was observed when mmap() with MAP_POPULATE pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() expects that a target address has a struct page. BUG: unable to handle kernel paging request at ffffea0012220000 follow_trans_huge_pmd+0xba/0x390 follow_page_mask+0x33d/0x420 __get_user_pages+0xdc/0x800 populate_vma_page_range+0xb5/0xe0 __mm_populate+0xc5/0x150 vm_mmap_pgoff+0xd5/0xe0 SyS_mmap_pgoff+0x1c1/0x290 SyS_mmap+0x1b/0x30 Fix it by making the PMD pre-fault handling consistent with PTE. After pre-faulted in faultin_page(), follow_page_mask() calls follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH and returns with -EEXIST. Reported-by: Mauricio Porto Signed-off-by: Toshi Kani Cc: Andrew Morton Cc: Kirill A. Shutemov Cc: Matthew Wilcox Cc: Dan Williams Cc: Ross Zwisler --- mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index d5b8920..f56e034 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1267,6 +1267,32 @@ out_unlock: return ret; } +/* + * Follow a pmd inserted by vmf_insert_pfn_pmd(). See follow_pfn_pte() for pte. + */ +static int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + /* No page to get reference */ + if (flags & FOLL_GET) + return -EFAULT; + + if (flags & FOLL_TOUCH) { + pmd_t entry = *pmd; + + /* Set the dirty bit per follow_trans_huge_pmd() */ + entry = pmd_mkyoung(pmd_mkdirty(entry)); + + if (!pmd_same(*pmd, entry)) { + set_pmd_at(vma->vm_mm, address, pmd, entry); + update_mmu_cache_pmd(vma, address, pmd); + } + } + + /* Proper page table entry exists, but no corresponding struct page */ + return -EEXIST; +} + struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd, @@ -1274,6 +1300,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; struct page *page = NULL; + int ret; assert_spin_locked(pmd_lockptr(mm, pmd)); @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; + /* pfn map does not have a struct page */ + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { + ret = follow_pfn_pmd(vma, addr, pmd, flags); + page = ERR_PTR(ret); + goto out; + } + page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page), page); if (flags & FOLL_TOUCH) { From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753537AbbKWUxI (ORCPT ); Mon, 23 Nov 2015 15:53:08 -0500 Received: from mail-wm0-f43.google.com ([74.125.82.43]:36104 "EHLO mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750708AbbKWUxE (ORCPT ); Mon, 23 Nov 2015 15:53:04 -0500 MIME-Version: 1.0 In-Reply-To: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Date: Mon, 23 Nov 2015 12:53:02 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do something more comprehensive in 4.5. > > Reported-by: Mauricio Porto > Signed-off-by: Toshi Kani > Cc: Andrew Morton > Cc: Kirill A. Shutemov > Cc: Matthew Wilcox > Cc: Dan Williams > Cc: Ross Zwisler > --- > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > 1 file changed, 34 insertions(+) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index d5b8920..f56e034 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c [..] > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > goto out; > > + /* pfn map does not have a struct page */ > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > + page = ERR_PTR(ret); > + goto out; > + } > + > page = pmd_page(*pmd); > VM_BUG_ON_PAGE(!PageHead(page), page); > if (flags & FOLL_TOUCH) { I think it is already problematic that dax pmd mappings are getting confused with transparent huge pages. They're more closely related to a hugetlbfs pmd mappings in that they are mapping an explicit allocation. I have some pending patches to address this dax-pmd vs hugetlb-pmd vs thp-pmd classification that I will post shortly. By the way, I'm collecting DAX pmd regression tests [1], is this just a simple crash upon using MAP_POPULATE? [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755430AbbKWWT1 (ORCPT ); Mon, 23 Nov 2015 17:19:27 -0500 Received: from g2t2355.austin.hp.com ([15.217.128.54]:20022 "EHLO g2t2355.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752122AbbKWWTZ (ORCPT ); Mon, 23 Nov 2015 17:19:25 -0500 Message-ID: <1448316903.19320.46.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Mon, 23 Nov 2015 15:15:03 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Content-Type: multipart/mixed; boundary="=-UwOidSeeVOF63N1Bvr6o" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org --=-UwOidSeeVOF63N1Bvr6o Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit On Mon, 2015-11-23 at 12:53 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > As of 4.4.-rc2 DAX pmd mappings are disabled. So we have time to do > something more comprehensive in 4.5. Yes, I noticed during my testing that I could not use pmd... > > Reported-by: Mauricio Porto > > Signed-off-by: Toshi Kani > > Cc: Andrew Morton > > Cc: Kirill A. Shutemov > > Cc: Matthew Wilcox > > Cc: Dan Williams > > Cc: Ross Zwisler > > --- > > mm/huge_memory.c | 34 ++++++++++++++++++++++++++++++++++ > > 1 file changed, 34 insertions(+) > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index d5b8920..f56e034 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > [..] > > @@ -1288,6 +1315,13 @@ struct page *follow_trans_huge_pmd(struct > > vm_area_struct *vma, > > if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) > > goto out; > > > > + /* pfn map does not have a struct page */ > > + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { > > + ret = follow_pfn_pmd(vma, addr, pmd, flags); > > + page = ERR_PTR(ret); > > + goto out; > > + } > > + > > page = pmd_page(*pmd); > > VM_BUG_ON_PAGE(!PageHead(page), page); > > if (flags & FOLL_TOUCH) { > > I think it is already problematic that dax pmd mappings are getting > confused with transparent huge pages. We had the same issue with dax pte mapping [1], and this change extends the pfn map handling to pmd. So, this problem is not specific to pmd. [1] https://lkml.org/lkml/2015/6/23/181 > They're more closely related to > a hugetlbfs pmd mappings in that they are mapping an explicit > allocation. I have some pending patches to address this dax-pmd vs > hugetlb-pmd vs thp-pmd classification that I will post shortly. Not sure which way is better, but I am certainly interested in your changes. > By the way, I'm collecting DAX pmd regression tests [1], is this just > a simple crash upon using MAP_POPULATE? > > [1]: https://github.com/pmem/ndctl/blob/master/lib/test-dax-pmd.c Yes, this issue is easy to reproduce with MAP_POPULATE. In case it helps, attached are the test I used for testing the patches. Sorry, the code is messy since it was only intended for my internal use... - The test was originally written for the pte change [1] and comments in test.sh (ex. mlock fail, ok) reflect the results without the pte change. - For the pmd test, I modified test-mmap.c to call posix_memalign() before mmap(). By calling free(), the 2MB-aligned address from posix_memalign() can be used for mmap(). This keeps the mmap'd address aligned on 2MB. - I created test file(s) with dd (i.e. all blocks written) in my test. - The other infinite loop issue (fixed by my other patch) was found by the test case with option "-LMSr". Thanks, -Toshi --=-UwOidSeeVOF63N1Bvr6o Content-Type: application/x-shellscript; name="test.sh" Content-Disposition: attachment; filename="test.sh" Content-Transfer-Encoding: base64 c2V0IC14IAp1bW91bnQgL21udC9wbWVtMAptb3VudCAvbW50L3BtZW0wCgojZWNobyAnZmlsZSBt bS9ndXAuYyArcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9keW5hbWljX2RlYnVnL2NvbnRyb2wKI2Vj aG8gJ2ZpbGUgbW0vaHVnZV9tZW1vcnkuYyArcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9keW5hbWlj X2RlYnVnL2NvbnRyb2wKI2VjaG8gJ2ZpbGUgbW0vbWVtb3J5LmMgK3AnID4gL3N5cy9rZXJuZWwv ZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNlY2hvICdmaWxlIGZzL2RheC5jICtwJyA+IC9z eXMva2VybmVsL2RlYnVnL2R5bmFtaWNfZGVidWcvY29udHJvbAoKIyMjIyAzMksgIyMjIwojIFNI QVJFRAojLi90ZXN0LW1tYXAgLU1yd3BzCSMgbWxvY2ssIHBvcHVsYXRlLCBzaGFyZWQgKG1sb2Nr IGZhaWwpCiMuL3Rlc3QtbW1hcCAtQXJ3cHMJIyBtbG9ja2FsbCwgcG9wdWxhdGUsIHNoYXJlZAoj Li90ZXN0LW1tYXAgLVJNcnBzCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUsIHNoYXJlZCAo bWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1yd3BzCSMgcG9wbHVhdGUsIHNoYXJlZCAocG9wbHVh dGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLVJycHMJIyByZWFkLW9ubHkgcG9wbHVhdGUsIHNo YXJlZCAocG9wbHVhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLU1yd3MJIyBtbG9jaywgc2hh cmVkIChtbG9jayBmYWlsKQojLi90ZXN0LW1tYXAgLVJNcnMJIyByZWFkLW9ubHksIG1sb2NrLCBz aGFyZWQgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtcndzCSMgc2hhcmVkIChvaykKIy4vdGVz dC1tbWFwIC1ScnMJIyByZWFkLW9ubHksIHNoYXJlZCAob2spCgojIFBSSVZBVEUKIy4vdGVzdC1t bWFwIC1NcndwCSMgbWxvY2ssIHBvcHVsYXRlLCBwcml2YXRlIChvaykKIy4vdGVzdC1tbWFwIC1S TXJwCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUsIHByaXZhdGUgKG1sb2NrIGZhaWwpCiMu L3Rlc3QtbW1hcCAtcndwCSMgcG9wdWxhdGUsIHByaXZhdGUgKG9rKQojLi90ZXN0LW1tYXAgLVJy cAkjIHJlYWQtb25seSwgcG9wdWxhdGUsIHByaXZhdGUgKHBvcHVsYXRlIG5vIGVmZmVjdCkKIy4v dGVzdC1tbWFwIC1NcncJIyBtbG9jaywgcHJpdmF0ZSAob2spCiMuL3Rlc3QtbW1hcCAtUk1yCSMg cmVhZC1vbmx5LCBtbG9jaywgcHJpdmF0ZSAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1NU3IJ IyBwcml2YXRlLCByZWFkIGJlZm9yZSBtbG9jayAob2spCiMuL3Rlc3QtbW1hcCAtcncJIyBwcml2 YXRlIChvaykKIy4vdGVzdC1tbWFwIC1ScgkjIHJlYWQtb25seSwgcHJpdmF0ZSAob2spCgojIyMj IDRHICMjIyMKIyBTSEFSRUQKIy4vdGVzdC1tbWFwIC1MTXJ3cHMJIyBtbG9jaywgcG9wdWxhdGUs IHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1MQXJ3cHMJIyBtbG9ja2FsbCwgcG9w dWxhdGUsIHNoYXJlZAojLi90ZXN0LW1tYXAgLUxSTXJwcwkjIHJlYWQtb25seSwgbWxvY2ssIHBv cHVsYXRlLCBzaGFyZWQgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTHJ3cHMJIyBwb3BsdWF0 ZSwgc2hhcmVkIChwb3BsdWF0ZSBubyBlZmZlY3QpCiMuL3Rlc3QtbW1hcCAtTFJycHMJIyByZWFk LW9ubHkgcG9wbHVhdGUsIHNoYXJlZCAocG9wbHVhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAg LUxNcndzCSMgbWxvY2ssIHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1MUk1ycwkj IHJlYWQtb25seSwgbWxvY2ssIHNoYXJlZCAobWxvY2sgZmFpbCkKIy4vdGVzdC1tbWFwIC1Mcndz CSMgc2hhcmVkIChvaykKIy4vdGVzdC1tbWFwIC1MUnJzCSMgcmVhZC1vbmx5LCBzaGFyZWQgKG9r KQoKIyBQUklWQVRFCiMuL3Rlc3QtbW1hcCAtTE1yd3AJIyBtbG9jaywgcG9wdWxhdGUsIHByaXZh dGUgKG9rKQojLi90ZXN0LW1tYXAgLUxSTXJwCSMgcmVhZC1vbmx5LCBtbG9jaywgcG9wdWxhdGUs IHByaXZhdGUgKG1sb2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTHJ3cAkjIHBvcHVsYXRlLCBwcml2 YXRlIChvaykKIy4vdGVzdC1tbWFwIC1MUnJwCSMgcmVhZC1vbmx5LCBwb3B1bGF0ZSwgcHJpdmF0 ZSAocG9wdWxhdGUgbm8gZWZmZWN0KQojLi90ZXN0LW1tYXAgLUxNcncJIyBtbG9jaywgcHJpdmF0 ZSAob2spCiMuL3Rlc3QtbW1hcCAtTFJNcgkjIHJlYWQtb25seSwgbWxvY2ssIHByaXZhdGUgKG1s b2NrIGZhaWwpCiMuL3Rlc3QtbW1hcCAtTE1TcgkjIHByaXZhdGUsIHJlYWQgYmVmb3JlIG1sb2Nr IChvaykKIy4vdGVzdC1tbWFwIC1McncJIyBwcml2YXRlIChvaykKIy4vdGVzdC1tbWFwIC1MUnIJ IyByZWFkLW9ubHksIHByaXZhdGUgKG9rKQoKI2VjaG8gJ2ZpbGUgbW0vZ3VwLmMgLXAnID4gL3N5 cy9rZXJuZWwvZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNlY2hvICdmaWxlIG1tL2h1Z2Vf bWVtb3J5LmMgLXAnID4gL3N5cy9rZXJuZWwvZGVidWcvZHluYW1pY19kZWJ1Zy9jb250cm9sCiNl Y2hvICdmaWxlIG1tL21lbW9yeS5jIC1wJyA+IC9zeXMva2VybmVsL2RlYnVnL2R5bmFtaWNfZGVi dWcvY29udHJvbAojZWNobyAnZmlsZSBmcy9kYXguYyAtcCcgPiAvc3lzL2tlcm5lbC9kZWJ1Zy9k eW5hbWljX2RlYnVnL2NvbnRyb2wK --=-UwOidSeeVOF63N1Bvr6o Content-Disposition: attachment; filename="test-mmap.c" Content-Type: text/x-csrc; name="test-mmap.c"; charset="UTF-8" Content-Transfer-Encoding: base64 I2luY2x1ZGUgPHN5cy90eXBlcy5oPgojaW5jbHVkZSA8c3lzL3N0YXQuaD4KI2luY2x1ZGUgPHN5 cy9tbWFuLmg+CiNpbmNsdWRlIDxzeXMvdGltZS5oPgojaW5jbHVkZSA8c3RyaW5nLmg+CiNpbmNs dWRlIDxmY250bC5oPgojaW5jbHVkZSA8c3RkaW8uaD4KI2luY2x1ZGUgPHN0ZGxpYi5oPgojaW5j bHVkZSA8dW5pc3RkLmg+CgojZGVmaW5lIE1CKGEpCQkoKGEpICogMTAyNFVMICogMTAyNFVMKQoK c3RhdGljIHN0cnVjdCB0aW1ldmFsIHN0YXJ0X3R2LCBzdG9wX3R2OwoKLy8gQ2FsY3VsYXRlIHRo ZSBkaWZmZXJlbmNlIGJldHdlZW4gdHdvIHRpbWUgdmFsdWVzLgp2b2lkIHR2c3ViKHN0cnVjdCB0 aW1ldmFsICp0ZGlmZiwgc3RydWN0IHRpbWV2YWwgKnQxLCBzdHJ1Y3QgdGltZXZhbCAqdDApCnsK CXRkaWZmLT50dl9zZWMgPSB0MS0+dHZfc2VjIC0gdDAtPnR2X3NlYzsKCXRkaWZmLT50dl91c2Vj ID0gdDEtPnR2X3VzZWMgLSB0MC0+dHZfdXNlYzsKCWlmICh0ZGlmZi0+dHZfdXNlYyA8IDApCgkJ dGRpZmYtPnR2X3NlYy0tLCB0ZGlmZi0+dHZfdXNlYyArPSAxMDAwMDAwOwp9CgovLyBTdGFydCB0 aW1pbmcgbm93Lgp2b2lkIHN0YXJ0KCkKewoJKHZvaWQpIGdldHRpbWVvZmRheSgmc3RhcnRfdHYs IChzdHJ1Y3QgdGltZXpvbmUgKikgMCk7Cn0KCi8vIFN0b3AgdGltaW5nIGFuZCByZXR1cm4gcmVh bCB0aW1lIGluIG1pY3Jvc2Vjb25kcy4KdW5zaWduZWQgbG9uZyBsb25nIHN0b3AoKQp7CglzdHJ1 Y3QgdGltZXZhbCB0ZGlmZjsKCgkodm9pZCkgZ2V0dGltZW9mZGF5KCZzdG9wX3R2LCAoc3RydWN0 IHRpbWV6b25lICopIDApOwoJdHZzdWIoJnRkaWZmLCAmc3RvcF90diwgJnN0YXJ0X3R2KTsKCXJl dHVybiAodGRpZmYudHZfc2VjICogMTAwMDAwMCArIHRkaWZmLnR2X3VzZWMpOwp9Cgp2b2lkIHRl c3Rfd3JpdGUodW5zaWduZWQgbG9uZyAqcCwgc2l6ZV90IHNpemUpCnsKCWludCBpOwoJdW5zaWdu ZWQgbG9uZyAqd3AsIHRtcDsKCXVuc2lnbmVkIGxvbmcgbG9uZyB0aW1ldmFsOwoKCXN0YXJ0KCk7 Cglmb3IgKGk9MCwgd3A9cDsgaTwoc2l6ZS9zaXplb2Yod3ApKTsgaSsrKQoJCSp3cCsrID0gMTsK CXRpbWV2YWwgPSBzdG9wKCk7CglwcmludGYoIldyaXRlOiAlMTBsbHUgdXNlY1xuIiwgdGltZXZh bCk7Cn0KCnZvaWQgdGVzdF9yZWFkKHVuc2lnbmVkIGxvbmcgKnAsIHNpemVfdCBzaXplKQp7Cglp bnQgaTsKCXVuc2lnbmVkIGxvbmcgKndwLCB0bXA7Cgl1bnNpZ25lZCBsb25nIGxvbmcgdGltZXZh bDsKCglzdGFydCgpOwoJZm9yIChpPTAsIHdwPXA7IGk8KHNpemUvc2l6ZW9mKHdwKSk7IGkrKykK CQl0bXAgPSAqd3ArKzsKCXRpbWV2YWwgPSBzdG9wKCk7CglwcmludGYoIlJlYWQgOiAlMTBsbHUg dXNlY1xuIiwgdGltZXZhbCk7Cn0KCmludCBtYWluKGludCBhcmdjLCBjaGFyICoqYXJndikKewoJ aW50IGZkLCBpLCBvcHQsIHJldDsKCWludCBvZmxhZ3MsIG1wcm90LCBtZmxhZ3MgPSAwOwoJaW50 IGlzX3JlYWRfb25seSA9IDAsIGlzX21sb2NrID0gMCwgaXNfbWxvY2thbGwgPSAwOwoJaW50IG1s b2NrX3NraXAgPSAwLCByZWFkX3Rlc3QgPSAwLCB3cml0ZV90ZXN0ID0gMDsKCXZvaWQgKm1wdHIg PSBOVUxMOwoJdW5zaWduZWQgbG9uZyAqcDsKCXN0cnVjdCBzdGF0IHN0YXQ7CglzaXplX3Qgc2l6 ZSwgY3B5X3NpemU7Cgljb25zdCBjaGFyICpmaWxlX25hbWUgPSBOVUxMOwoKCXdoaWxlICgob3B0 ID0gZ2V0b3B0KGFyZ2MsIGFyZ3YsICJMUk1TQXBzcnciKSkgIT0gLTEpIHsKCQlzd2l0Y2ggKG9w dCkgewoJCWNhc2UgJ0wnOgoJCQlmaWxlX25hbWUgPSAiL21udC9wbWVtMC80R2ZpbGUiOwoJCQli cmVhazsKCQljYXNlICdSJzoKCQkJcHJpbnRmKCI+IG1tYXA6IHJlYWQtb25seVxuIik7CgkJCWlz X3JlYWRfb25seSA9IDE7CgkJCWJyZWFrOwoJCWNhc2UgJ00nOgoJCQlwcmludGYoIj4gbWxvY2tc biIpOwoJCQlpc19tbG9jayA9IDE7CgkJCWJyZWFrOwoJCWNhc2UgJ1MnOgoJCQlwcmludGYoIj4g bWxvY2sgLSBza2lwIGZpcnN0IGl0ZVxuIik7CgkJCW1sb2NrX3NraXAgPSAxOwoJCQlicmVhazsK CQljYXNlICdBJzoKCQkJcHJpbnRmKCI+IG1sb2NrYWxsXG4iKTsKCQkJaXNfbWxvY2thbGwgPSAx OwoJCQlicmVhazsKCQljYXNlICdwJzoKCQkJcHJpbnRmKCI+IE1BUF9QT1BVTEFURVxuIik7CgkJ CW1mbGFncyB8PSBNQVBfUE9QVUxBVEU7CgkJCWJyZWFrOwoJCWNhc2UgJ3MnOgoJCQlwcmludGYo Ij4gTUFQX1NIQVJFRFxuIik7CgkJCW1mbGFncyB8PSBNQVBfU0hBUkVEOwoJCQlicmVhazsKCQlj YXNlICdyJzoKCQkJcHJpbnRmKCI+IHJlYWQtdGVzdFxuIik7CgkJCXJlYWRfdGVzdCA9IDE7CgkJ CWJyZWFrOwoJCWNhc2UgJ3cnOgoJCQlwcmludGYoIj4gd3JpdGUtdGVzdFxuIik7CgkJCXdyaXRl X3Rlc3QgPSAxOwoJCQlicmVhazsKCQl9Cgl9CgoJaWYgKCFmaWxlX25hbWUpIHsKCQlmaWxlX25h bWUgPSAiL21udC9wbWVtMS8zMktmaWxlIjsKCX0KCglpZiAoIShtZmxhZ3MgJiBNQVBfU0hBUkVE KSkgewoJCXByaW50ZigiPiBNQVBfUFJJVkFURVxuIik7CgkJbWZsYWdzIHw9IE1BUF9QUklWQVRF OwoJfQoKCWlmIChpc19yZWFkX29ubHkpIHsKCQlvZmxhZ3MgPSBPX1JET05MWTsKCQltcHJvdCA9 IFBST1RfUkVBRDsKCX0gZWxzZSB7CgkJb2ZsYWdzID0gT19SRFdSOwoJCW1wcm90ID0gUFJPVF9S RUFEfFBST1RfV1JJVEU7Cgl9CgoJZmQgPSBvcGVuKGZpbGVfbmFtZSwgb2ZsYWdzKTsKCWlmIChm ZCA9PSAtMSkgewoJCXBlcnJvcigib3BlbiBmYWlsZWQiKTsKCQlleGl0KDEpOwoJfQoKCXJldCA9 IGZzdGF0KGZkLCAmc3RhdCk7CglpZiAocmV0IDwgMCkgewoJCXBlcnJvcigiZnN0YXQgZmFpbGVk Iik7CgkJZXhpdCgxKTsKCX0KCXNpemUgPSBzdGF0LnN0X3NpemU7CgoJcHJpbnRmKCI+IG9wZW4g JXMgc2l6ZSAweCV4IGZsYWdzIDB4JXhcbiIsIGZpbGVfbmFtZSwgc2l6ZSwgb2ZsYWdzKTsKCgly ZXQgPSBwb3NpeF9tZW1hbGlnbigmbXB0ciwgTUIoMiksIHNpemUpOwoJaWYgKHJldCA9PTApCgkJ ZnJlZShtcHRyKTsKCglwcmludGYoIj4gbW1hcCBtcHJvdCAweCV4IGZsYWdzIDB4JXhcbiIsIG1w cm90LCBtZmxhZ3MpOwoJcCA9IG1tYXAobXB0ciwgc2l6ZSwgbXByb3QsIG1mbGFncywgZmQsIDB4 MCk7CglpZiAoIXApIHsKCQlwZXJyb3IoIm1tYXAgZmFpbGVkIik7CgkJZXhpdCgxKTsKCX0KCWlm ICgobG9uZyB1bnNpZ25lZClwICYgKE1CKDIpLTEpKQoJCXByaW50ZigiPiBtbWFwOiBOT1QgMk1C IGFsaWduZWQ6IDB4JXBcbiIsIHApOwoJZWxzZQoJCXByaW50ZigiPiBtbWFwOiAyTUIgYWxpZ25l ZDogMHglcFxuIiwgcCk7CgojaWYgMAkvKiBTSVpFIExJTUlUICovCglpZiAoc2l6ZSA+PSBNQigy KSkKCQljcHlfc2l6ZSA9IE1CKDMyKTsKCWVsc2UKI2VuZGlmCgkJY3B5X3NpemUgPSBzaXplOwoK CWZvciAoaT0wOyBpPDM7IGkrKykgewoKCQlpZiAoaXNfbWxvY2sgJiYgIW1sb2NrX3NraXApIHsK CQkJcHJpbnRmKCI+IG1sb2NrIDB4JXBcbiIsIHApOwoJCQlyZXQgPSBtbG9jayhwLCBzaXplKTsK CQkJaWYgKHJldCA8IDApIHsKCQkJCXBlcnJvcigibWxvY2sgZmFpbGVkIik7CgkJCQlleGl0KDEp OwoJCQl9CgkJfSBlbHNlIGlmIChpc19tbG9ja2FsbCkgewoJCQlwcmludGYoIj4gbWxvY2thbGxc biIpOwoJCQlyZXQgPSBtbG9ja2FsbChNQ0xfQ1VSUkVOVHxNQ0xfRlVUVVJFKTsKCQkJaWYgKHJl dCA8IDApIHsKCQkJCXBlcnJvcigibWxvY2thbGwgZmFpbGVkIik7CgkJCQlleGl0KDEpOwoJCQl9 CgkJfQoKCQlwcmludGYoIj09PT09ICVkID09PT09XG4iLCBpKzEpOwoJCWlmICh3cml0ZV90ZXN0 KQoJCQl0ZXN0X3dyaXRlKHAsIGNweV9zaXplKTsKCQlpZiAocmVhZF90ZXN0KQoJCQl0ZXN0X3Jl YWQocCwgY3B5X3NpemUpOwoKCQlpZiAoaXNfbWxvY2sgJiYgIW1sb2NrX3NraXApIHsKCQkJcHJp bnRmKCI+IG11bmxvY2sgMHglcFxuIiwgcCk7CgkJCXJldCA9IG11bmxvY2socCwgc2l6ZSk7CgkJ CWlmIChyZXQgPCAwKSB7CgkJCQlwZXJyb3IoIm11bmxvY2sgZmFpbGVkIik7CgkJCQlleGl0KDEp OwoJCQl9CgkJfSBlbHNlIGlmIChpc19tbG9ja2FsbCkgewoJCQlwcmludGYoIj4gbXVubG9ja2Fs bFxuIik7CgkJCXJldCA9IG11bmxvY2thbGwoKTsKCQkJaWYgKHJldCA8IDApIHsKCQkJCXBlcnJv cigibXVubG9ja2FsbCBmYWlsZWQiKTsKCQkJCWV4aXQoMSk7CgkJCX0KCQl9CgoJCS8qIHNraXAs IGlmIHJlcXVlc3RlZCwgb25seSB0aGUgZmlyc3QgaXRlcmF0aW9uICovCgkJbWxvY2tfc2tpcCA9 IDA7Cgl9CgoJcHJpbnRmKCI+IG11bm1hcCAweCVwXG4iLCBwKTsKCW11bm1hcChwLCBzaXplKTsK fQo= --=-UwOidSeeVOF63N1Bvr6o-- From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755329AbbK3WI4 (ORCPT ); Mon, 30 Nov 2015 17:08:56 -0500 Received: from mail-yk0-f179.google.com ([209.85.160.179]:33839 "EHLO mail-yk0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755197AbbK3WIy (ORCPT ); Mon, 30 Nov 2015 17:08:54 -0500 MIME-Version: 1.0 In-Reply-To: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Date: Mon, 30 Nov 2015 14:08:53 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > The following oops was observed when mmap() with MAP_POPULATE > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > expects that a target address has a struct page. > > BUG: unable to handle kernel paging request at ffffea0012220000 > follow_trans_huge_pmd+0xba/0x390 > follow_page_mask+0x33d/0x420 > __get_user_pages+0xdc/0x800 > populate_vma_page_range+0xb5/0xe0 > __mm_populate+0xc5/0x150 > vm_mmap_pgoff+0xd5/0xe0 > SyS_mmap_pgoff+0x1c1/0x290 > SyS_mmap+0x1b/0x30 > > Fix it by making the PMD pre-fault handling consistent with PTE. > After pre-faulted in faultin_page(), follow_page_mask() calls > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > and returns with -EEXIST. > > Reported-by: Mauricio Porto > Signed-off-by: Toshi Kani > Cc: Andrew Morton > Cc: Kirill A. Shutemov > Cc: Matthew Wilcox > Cc: Dan Williams > Cc: Ross Zwisler > --- Hey Toshi, I ended up fixing this differently with follow_pmd_devmap() introduced in this series: https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html Does the latest libnvdimm-pending branch [1] pass your test case? [1]: git://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm libnvdimm-pending From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932354AbbLBBYY (ORCPT ); Tue, 1 Dec 2015 20:24:24 -0500 Received: from g9t5008.houston.hp.com ([15.240.92.66]:53786 "EHLO g9t5008.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752542AbbLBBYW (ORCPT ); Tue, 1 Dec 2015 20:24:22 -0500 Message-ID: <1449022764.31589.24.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Tue, 01 Dec 2015 19:19:24 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > The following oops was observed when mmap() with MAP_POPULATE > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > expects that a target address has a struct page. > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > follow_trans_huge_pmd+0xba/0x390 > > follow_page_mask+0x33d/0x420 > > __get_user_pages+0xdc/0x800 > > populate_vma_page_range+0xb5/0xe0 > > __mm_populate+0xc5/0x150 > > vm_mmap_pgoff+0xd5/0xe0 > > SyS_mmap_pgoff+0x1c1/0x290 > > SyS_mmap+0x1b/0x30 > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > After pre-faulted in faultin_page(), follow_page_mask() calls > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > and returns with -EEXIST. > > > > Reported-by: Mauricio Porto > > Signed-off-by: Toshi Kani > > Cc: Andrew Morton > > Cc: Kirill A. Shutemov > > Cc: Matthew Wilcox > > Cc: Dan Williams > > Cc: Ross Zwisler > > --- > > Hey Toshi, > > I ended up fixing this differently with follow_pmd_devmap() introduced > in this series: > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > Does the latest libnvdimm-pending branch [1] pass your test case? Hi Dan, I ran several test cases, and they all hit the case "pfn not in memmap" in __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I thought I let you know first. I've also seen the test thread got hung up at the end sometime. I also noticed that reason is not set in the case below. if (length < PMD_SIZE || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { dax_unmap_atomic(bdev, &dax); goto fallback; } Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752822AbbLBDpF (ORCPT ); Tue, 1 Dec 2015 22:45:05 -0500 Received: from mail-yk0-f169.google.com ([209.85.160.169]:36524 "EHLO mail-yk0-f169.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750918AbbLBDpC (ORCPT ); Tue, 1 Dec 2015 22:45:02 -0500 MIME-Version: 1.0 In-Reply-To: <1449022764.31589.24.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> Date: Tue, 1 Dec 2015 19:45:01 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: >> On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: >> > The following oops was observed when mmap() with MAP_POPULATE >> > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() >> > expects that a target address has a struct page. >> > >> > BUG: unable to handle kernel paging request at ffffea0012220000 >> > follow_trans_huge_pmd+0xba/0x390 >> > follow_page_mask+0x33d/0x420 >> > __get_user_pages+0xdc/0x800 >> > populate_vma_page_range+0xb5/0xe0 >> > __mm_populate+0xc5/0x150 >> > vm_mmap_pgoff+0xd5/0xe0 >> > SyS_mmap_pgoff+0x1c1/0x290 >> > SyS_mmap+0x1b/0x30 >> > >> > Fix it by making the PMD pre-fault handling consistent with PTE. >> > After pre-faulted in faultin_page(), follow_page_mask() calls >> > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() >> > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH >> > and returns with -EEXIST. >> > >> > Reported-by: Mauricio Porto >> > Signed-off-by: Toshi Kani >> > Cc: Andrew Morton >> > Cc: Kirill A. Shutemov >> > Cc: Matthew Wilcox >> > Cc: Dan Williams >> > Cc: Ross Zwisler >> > --- >> >> Hey Toshi, >> >> I ended up fixing this differently with follow_pmd_devmap() introduced >> in this series: >> >> https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> >> Does the latest libnvdimm-pending branch [1] pass your test case? > > Hi Dan, > > I ran several test cases, and they all hit the case "pfn not in memmap" in > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, PFN_DEV is > set but PFN_MAP is not. I have not looked into why, but I thought I let you > know first. I've also seen the test thread got hung up at the end sometime. That PFN_MAP flag will not be set by default for NFIT-defined persistent memory. See pmem_should_map_pages() for pmem namespaces that will have it set by default, currently only e820 type-12 memory ranges. NFIT-defined persistent memory can have a memmap array dynamically allocated by setting up a pfn device (similar to setting up a btt). We don't map it by default because the NFIT may describe hundreds of gigabytes of persistent and the overhead of the memmap may be too large to locate the memmap in ram. I have a pending patch in libnvdimm-pending that allows the capacity for the memmap to come from pmem instead of ram: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git/commit/?h=libnvdimm-pending&id=3117a24e07fe > I also noticed that reason is not set in the case below. > > if (length < PMD_SIZE > || (pfn_t_to_pfn(dax.pfn) & PG_PMD_COLOUR)) { > dax_unmap_atomic(bdev, &dax); > goto fallback; > } Thanks, I'll fix that up. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S964933AbbLBR36 (ORCPT ); Wed, 2 Dec 2015 12:29:58 -0500 Received: from mail-yk0-f180.google.com ([209.85.160.180]:35931 "EHLO mail-yk0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933266AbbLBRBh (ORCPT ); Wed, 2 Dec 2015 12:01:37 -0500 MIME-Version: 1.0 In-Reply-To: <1449078237.31589.30.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Date: Wed, 2 Dec 2015 09:01:36 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > Oh, I see. I will setup the memmap array and run the tests again. > > But, why does the PMD mapping depend on the memmap array? We have observed > major performance improvement with PMD. This feature should always be enabled > with DAX regardless of the option to allocate the memmap array. > Several factors drove this decision, I'm open to considering alternatives but here's the reasoning: 1/ DAX pmd mappings caused crashes in the get_user_pages path leading to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte mappings don't crash and instead trigger -EFAULT is due to the _PAGE_SPECIAL pte bit. 2/ To enable get_user_pages for DAX, in both the page and huge-page case, we need a new pte bit _PAGE_DEVMAP. 3/ Given the pte bits are hard to come I'm assuming we won't get two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. End result is that DAX pmd mappings must be fully enabled through the get_user_pages paths with _PAGE_DEVMAP or turned off completely. In general I think the "page less" DAX implementation was a good starting point, but we need to shift to page-backed by default until we can teach more of the kernel to operate on bare pfns. That "default" will need to be enforced by userspace tooling. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932985AbbLBQs6 (ORCPT ); Wed, 2 Dec 2015 11:48:58 -0500 Received: from g4t3427.houston.hp.com ([15.201.208.55]:36173 "EHLO g4t3427.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932933AbbLBQs4 (ORCPT ); Wed, 2 Dec 2015 11:48:56 -0500 Message-ID: <1449078237.31589.30.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 10:43:57 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > > > On Mon, Nov 23, 2015 at 12:04 PM, Toshi Kani wrote: > > > > The following oops was observed when mmap() with MAP_POPULATE > > > > pre-faulted pmd mappings of a DAX file. follow_trans_huge_pmd() > > > > expects that a target address has a struct page. > > > > > > > > BUG: unable to handle kernel paging request at ffffea0012220000 > > > > follow_trans_huge_pmd+0xba/0x390 > > > > follow_page_mask+0x33d/0x420 > > > > __get_user_pages+0xdc/0x800 > > > > populate_vma_page_range+0xb5/0xe0 > > > > __mm_populate+0xc5/0x150 > > > > vm_mmap_pgoff+0xd5/0xe0 > > > > SyS_mmap_pgoff+0x1c1/0x290 > > > > SyS_mmap+0x1b/0x30 > > > > > > > > Fix it by making the PMD pre-fault handling consistent with PTE. > > > > After pre-faulted in faultin_page(), follow_page_mask() calls > > > > follow_trans_huge_pmd(), which is changed to call follow_pfn_pmd() > > > > for VM_PFNMAP or VM_MIXEDMAP. follow_pfn_pmd() handles FOLL_TOUCH > > > > and returns with -EEXIST. > > > > > > > > Reported-by: Mauricio Porto > > > > Signed-off-by: Toshi Kani > > > > Cc: Andrew Morton > > > > Cc: Kirill A. Shutemov > > > > Cc: Matthew Wilcox > > > > Cc: Dan Williams > > > > Cc: Ross Zwisler > > > > --- > > > > > > Hey Toshi, > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > in this series: > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > Hi Dan, > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > PFN_DEV is > > set but PFN_MAP is not. I have not looked into why, but I thought I let you > > know first. I've also seen the test thread got hung up at the end sometime. > > That PFN_MAP flag will not be set by default for NFIT-defined > persistent memory. See pmem_should_map_pages() for pmem namespaces > that will have it set by default, currently only e820 type-12 memory > ranges. > > NFIT-defined persistent memory can have a memmap array dynamically > allocated by setting up a pfn device (similar to setting up a btt). > We don't map it by default because the NFIT may describe hundreds of > gigabytes of persistent and the overhead of the memmap may be too > large to locate the memmap in ram. Oh, I see. I will setup the memmap array and run the tests again. But, why does the PMD mapping depend on the memmap array? We have observed major performance improvement with PMD. This feature should always be enabled with DAX regardless of the option to allocate the memmap array. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756784AbbLBSGj (ORCPT ); Wed, 2 Dec 2015 13:06:39 -0500 Received: from mail-yk0-f170.google.com ([209.85.160.170]:35603 "EHLO mail-yk0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752921AbbLBSGh (ORCPT ); Wed, 2 Dec 2015 13:06:37 -0500 MIME-Version: 1.0 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Date: Wed, 2 Dec 2015 10:06:36 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> Oh, I see. I will setup the memmap array and run the tests again. >> >> But, why does the PMD mapping depend on the memmap array? We have observed >> major performance improvement with PMD. This feature should always be enabled >> with DAX regardless of the option to allocate the memmap array. >> > > Several factors drove this decision, I'm open to considering > alternatives but here's the reasoning: > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > mappings don't crash and instead trigger -EFAULT is due to the > _PAGE_SPECIAL pte bit. > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > case, we need a new pte bit _PAGE_DEVMAP. > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. Actually, Dave says they aren't that hard to come by for pmds, so we could go add _PMD_SPECIAL if we really wanted to support the limited page-less DAX-pmd case. But I'm still of the opinion that we run away from the page-less case until it can be made a full class citizen with O_DIRECT for pfn support. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759114AbbLBTAw (ORCPT ); Wed, 2 Dec 2015 14:00:52 -0500 Received: from mail-yk0-f181.google.com ([209.85.160.181]:33553 "EHLO mail-yk0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757595AbbLBTAu (ORCPT ); Wed, 2 Dec 2015 14:00:50 -0500 MIME-Version: 1.0 In-Reply-To: <1449084362.31589.37.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> Date: Wed, 2 Dec 2015 11:00:49 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: >> > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> > > Oh, I see. I will setup the memmap array and run the tests again. >> > > >> > > But, why does the PMD mapping depend on the memmap array? We have >> > > observed major performance improvement with PMD. This feature should >> > > always be enabled with DAX regardless of the option to allocate the memmap >> > > array. >> > > >> > >> > Several factors drove this decision, I'm open to considering >> > alternatives but here's the reasoning: >> > >> > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > mappings don't crash and instead trigger -EFAULT is due to the >> > _PAGE_SPECIAL pte bit. >> > >> > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > case, we need a new pte bit _PAGE_DEVMAP. >> > >> > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> >> Actually, Dave says they aren't that hard to come by for pmds, so we >> could go add _PMD_SPECIAL if we really wanted to support the limited >> page-less DAX-pmd case. >> >> But I'm still of the opinion that we run away from the page-less case >> until it can be made a full class citizen with O_DIRECT for pfn >> support. > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can > be substituted by the following check when we do not have the memmap. > > if ((vma->vm_flags & VM_PFNMAP) || > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > This is what I did in this patch for follow_trans_huge_pmd(), although I missed > the pfn_valid() check. That works for __get_user_pages but not __get_user_pages_fast where we don't have access to the vma. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1758518AbbLBSbK (ORCPT ); Wed, 2 Dec 2015 13:31:10 -0500 Received: from g2t1383g.austin.hp.com ([15.217.136.92]:10007 "EHLO g2t1383g.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750964AbbLBSbI (ORCPT ); Wed, 2 Dec 2015 13:31:08 -0500 Message-ID: <1449084362.31589.37.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 12:26:02 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams wrote: > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > observed major performance improvement with PMD. This feature should > > > always be enabled with DAX regardless of the option to allocate the memmap > > > array. > > > > > > > Several factors drove this decision, I'm open to considering > > alternatives but here's the reasoning: > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > mappings don't crash and instead trigger -EFAULT is due to the > > _PAGE_SPECIAL pte bit. > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > case, we need a new pte bit _PAGE_DEVMAP. > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > Actually, Dave says they aren't that hard to come by for pmds, so we > could go add _PMD_SPECIAL if we really wanted to support the limited > page-less DAX-pmd case. > > But I'm still of the opinion that we run away from the page-less case > until it can be made a full class citizen with O_DIRECT for pfn > support. I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL can be substituted by the following check when we do not have the memmap. if ((vma->vm_flags & VM_PFNMAP) || ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { This is what I did in this patch for follow_trans_huge_pmd(), although I missed the pfn_valid() check. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755206AbbLBT56 (ORCPT ); Wed, 2 Dec 2015 14:57:58 -0500 Received: from mail-yk0-f170.google.com ([209.85.160.170]:33040 "EHLO mail-yk0-f170.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751413AbbLBT54 (ORCPT ); Wed, 2 Dec 2015 14:57:56 -0500 MIME-Version: 1.0 In-Reply-To: <1449087125.31589.45.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> Date: Wed, 2 Dec 2015 11:57:55 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: >> On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: >> > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: >> > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: >> > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams >> > > > wrote: >> > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: >> > > > > > Oh, I see. I will setup the memmap array and run the tests again. >> > > > > > >> > > > > > But, why does the PMD mapping depend on the memmap array? We have >> > > > > > observed major performance improvement with PMD. This feature >> > > > > > should always be enabled with DAX regardless of the option to >> > > > > > allocate the memmap array. >> > > > > > >> > > > > >> > > > > Several factors drove this decision, I'm open to considering >> > > > > alternatives but here's the reasoning: >> > > > > >> > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading >> > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte >> > > > > mappings don't crash and instead trigger -EFAULT is due to the >> > > > > _PAGE_SPECIAL pte bit. >> > > > > >> > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page >> > > > > case, we need a new pte bit _PAGE_DEVMAP. >> > > > > >> > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, >> > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we >> > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. >> > > > >> > > > Actually, Dave says they aren't that hard to come by for pmds, so we >> > > > could go add _PMD_SPECIAL if we really wanted to support the limited >> > > > page-less DAX-pmd case. >> > > > >> > > > But I'm still of the opinion that we run away from the page-less case >> > > > until it can be made a full class citizen with O_DIRECT for pfn >> > > > support. >> > > >> > > I may be missing something, but per vm_normal_page(), I think >> > > _PAGE_SPECIAL can be substituted by the following check when we do not >> > > have the memmap. >> > > >> > > if ((vma->vm_flags & VM_PFNMAP) || >> > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { >> > > >> > > This is what I did in this patch for follow_trans_huge_pmd(), although I >> > > missed the pfn_valid() check. >> > >> > That works for __get_user_pages but not __get_user_pages_fast where we >> > don't have access to the vma. >> >> __get_user_page_fast already refers current->mm, so we should be able to get >> the vma, and pass it down to gup_pud_range(). > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the > !pfn_valid() condition is met, so that we do not add the code to the main path > of __get_user_pages_fast. The whole point of __get_user_page_fast() is to avoid the overhead of taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells __get_user_pages_fast that it needs to fallback to the __get_user_pages slow path. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759370AbbLBTHD (ORCPT ); Wed, 2 Dec 2015 14:07:03 -0500 Received: from g4t3426.houston.hp.com ([15.201.208.54]:30429 "EHLO g4t3426.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756260AbbLBTHA (ORCPT ); Wed, 2 Dec 2015 14:07:00 -0500 Message-ID: <1449086521.31589.39.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 13:02:01 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams > > > wrote: > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > observed major performance improvement with PMD. This feature should > > > > > always be enabled with DAX regardless of the option to allocate the > > > > > memmap > > > > > array. > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > alternatives but here's the reasoning: > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > _PAGE_SPECIAL pte bit. > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > page-less DAX-pmd case. > > > > > > But I'm still of the opinion that we run away from the page-less case > > > until it can be made a full class citizen with O_DIRECT for pfn > > > support. > > > > I may be missing something, but per vm_normal_page(), I think _PAGE_SPECIAL > > can > > be substituted by the following check when we do not have the memmap. > > > > if ((vma->vm_flags & VM_PFNMAP) || > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > missed > > the pfn_valid() check. > > That works for __get_user_pages but not __get_user_pages_fast where we > don't have access to the vma. __get_user_page_fast already refers current->mm, so we should be able to get the vma, and pass it down to gup_pud_range(). Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754908AbbLBTRH (ORCPT ); Wed, 2 Dec 2015 14:17:07 -0500 Received: from g9t5008.houston.hp.com ([15.240.92.66]:38295 "EHLO g9t5008.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751989AbbLBTRF (ORCPT ); Wed, 2 Dec 2015 14:17:05 -0500 Message-ID: <1449087125.31589.45.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 13:12:05 -0700 In-Reply-To: <1449086521.31589.39.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams > > > > wrote: > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani wrote: > > > > > > Oh, I see. I will setup the memmap array and run the tests again. > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We have > > > > > > observed major performance improvement with PMD. This feature > > > > > > should always be enabled with DAX regardless of the option to > > > > > > allocate the memmap array. > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > alternatives but here's the reasoning: > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path leading > > > > > to commit e82c9ed41e8 "dax: disable pmd mappings". The reason pte > > > > > mappings don't crash and instead trigger -EFAULT is due to the > > > > > _PAGE_SPECIAL pte bit. > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge-page > > > > > case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get two, > > > > > i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. Even if we > > > > > could get a _PAGE_SPECIAL for pmds I'm not in favor of pursuing it. > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > page-less DAX-pmd case. > > > > > > > > But I'm still of the opinion that we run away from the page-less case > > > > until it can be made a full class citizen with O_DIRECT for pfn > > > > support. > > > > > > I may be missing something, but per vm_normal_page(), I think > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > have the memmap. > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although I > > > missed the pfn_valid() check. > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > don't have access to the vma. > > __get_user_page_fast already refers current->mm, so we should be able to get > the vma, and pass it down to gup_pud_range(). Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when the !pfn_valid() condition is met, so that we do not add the code to the main path of __get_user_pages_fast. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757243AbbLBUyL (ORCPT ); Wed, 2 Dec 2015 15:54:11 -0500 Received: from mail-yk0-f182.google.com ([209.85.160.182]:35299 "EHLO mail-yk0-f182.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757191AbbLBUyG (ORCPT ); Wed, 2 Dec 2015 15:54:06 -0500 MIME-Version: 1.0 In-Reply-To: <1449092226.31589.50.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> Date: Wed, 2 Dec 2015 12:54:05 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: [..] >> The whole point of __get_user_page_fast() is to avoid the overhead of >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> __get_user_pages_fast that it needs to fallback to the >> __get_user_pages slow path. > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > instead of VM_BUG_ON. Is pfn_valid() a reliable check? It seems to be based on a max_pfn per node... what happens when pmem is located below that point. I haven't been able to convince myself that we won't get false positives, but maybe I'm missing something. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755963AbbLBUmI (ORCPT ); Wed, 2 Dec 2015 15:42:08 -0500 Received: from g4t3426.houston.hp.com ([15.201.208.54]:3620 "EHLO g4t3426.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751775AbbLBUmG (ORCPT ); Wed, 2 Dec 2015 15:42:06 -0500 Message-ID: <1449092226.31589.50.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 14:37:06 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 12:12 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 13:02 -0700, Toshi Kani wrote: > > > On Wed, 2015-12-02 at 11:00 -0800, Dan Williams wrote: > > > > On Wed, Dec 2, 2015 at 11:26 AM, Toshi Kani wrote: > > > > > On Wed, 2015-12-02 at 10:06 -0800, Dan Williams wrote: > > > > > > On Wed, Dec 2, 2015 at 9:01 AM, Dan Williams < > > > > > > dan.j.williams@intel.com> > > > > > > wrote: > > > > > > > On Wed, Dec 2, 2015 at 9:43 AM, Toshi Kani > > > > > > > wrote: > > > > > > > > Oh, I see. I will setup the memmap array and run the tests > > > > > > > > again. > > > > > > > > > > > > > > > > But, why does the PMD mapping depend on the memmap array? We > > > > > > > > have observed major performance improvement with PMD. This > > > > > > > > feature should always be enabled with DAX regardless of the > > > > > > > > option to allocate the memmap array. > > > > > > > > > > > > > > > > > > > > > > Several factors drove this decision, I'm open to considering > > > > > > > alternatives but here's the reasoning: > > > > > > > > > > > > > > 1/ DAX pmd mappings caused crashes in the get_user_pages path > > > > > > > leading to commit e82c9ed41e8 "dax: disable pmd mappings". The > > > > > > > reason pte mappings don't crash and instead trigger -EFAULT is due > > > > > > > to the _PAGE_SPECIAL pte bit. > > > > > > > > > > > > > > 2/ To enable get_user_pages for DAX, in both the page and huge > > > > > > > -page case, we need a new pte bit _PAGE_DEVMAP. > > > > > > > > > > > > > > 3/ Given the pte bits are hard to come I'm assuming we won't get > > > > > > > two, i.e. both _PAGE_DEVMAP and a new _PAGE_SPECIAL for pmds. > > > > > > > Even if we could get a _PAGE_SPECIAL for pmds I'm not in favor of > > > > > > > pursuing it. > > > > > > > > > > > > Actually, Dave says they aren't that hard to come by for pmds, so we > > > > > > could go add _PMD_SPECIAL if we really wanted to support the limited > > > > > > page-less DAX-pmd case. > > > > > > > > > > > > But I'm still of the opinion that we run away from the page-less > > > > > > case until it can be made a full class citizen with O_DIRECT for pfn > > > > > > support. > > > > > > > > > > I may be missing something, but per vm_normal_page(), I think > > > > > _PAGE_SPECIAL can be substituted by the following check when we do not > > > > > have the memmap. > > > > > > > > > > if ((vma->vm_flags & VM_PFNMAP) || > > > > > ((vma->vm_flags & VM_MIXEDMAP) && (!pfn_valid(pfn)))) { > > > > > > > > > > This is what I did in this patch for follow_trans_huge_pmd(), although > > > > > I missed the pfn_valid() check. > > > > > > > > That works for __get_user_pages but not __get_user_pages_fast where we > > > > don't have access to the vma. > > > > > > __get_user_page_fast already refers current->mm, so we should be able to > > > get the vma, and pass it down to gup_pud_range(). > > > > Alternatively, we can obtain the vma from current->mm in gup_huge_pmd() when > > the !pfn_valid() condition is met, so that we do not add the code to the > > main path of __get_user_pages_fast. > > The whole point of __get_user_page_fast() is to avoid the overhead of > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > __get_user_pages_fast that it needs to fallback to the > __get_user_pages slow path. I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), instead of VM_BUG_ON. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756119AbbLBVAj (ORCPT ); Wed, 2 Dec 2015 16:00:39 -0500 Received: from g2t2352.austin.hp.com ([15.217.128.51]:33271 "EHLO g2t2352.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752536AbbLBVAi (ORCPT ); Wed, 2 Dec 2015 16:00:38 -0500 Message-ID: <1449093339.9855.1.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 14:55:39 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] > > > The whole point of __get_user_page_fast() is to avoid the overhead of > > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells > > > __get_user_pages_fast that it needs to fallback to the > > > __get_user_pages slow path. > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), > > instead of VM_BUG_ON. > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. I believe we use the version of pfn_valid() in linux/mmzone.h. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756261AbbLBWAg (ORCPT ); Wed, 2 Dec 2015 17:00:36 -0500 Received: from mga01.intel.com ([192.55.52.88]:36892 "EHLO mga01.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751078AbbLBWAe (ORCPT ); Wed, 2 Dec 2015 17:00:34 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,375,1444719600"; d="scan'208";a="863369572" Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping To: Dan Williams , Toshi Kani References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" From: Dave Hansen Message-ID: <565F69FE.601@intel.com> Date: Wed, 2 Dec 2015 14:00:30 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/02/2015 12:54 PM, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > [..] >>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>> >> __get_user_pages_fast that it needs to fallback to the >>> >> __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > per node... what happens when pmem is located below that point. I > haven't been able to convince myself that we won't get false > positives, but maybe I'm missing something. With sparsemem at least, it makes sure that you're looking at a valid _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755847AbbLBWDs (ORCPT ); Wed, 2 Dec 2015 17:03:48 -0500 Received: from mail-yk0-f178.google.com ([209.85.160.178]:36653 "EHLO mail-yk0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750798AbbLBWDr (ORCPT ); Wed, 2 Dec 2015 17:03:47 -0500 MIME-Version: 1.0 In-Reply-To: <565F69FE.601@intel.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <565F69FE.601@intel.com> Date: Wed, 2 Dec 2015 14:03:46 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Dave Hansen Cc: Toshi Kani , Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 2:00 PM, Dave Hansen wrote: > On 12/02/2015 12:54 PM, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >>> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >>>> >> The whole point of __get_user_page_fast() is to avoid the overhead of >>>> >> taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >>>> >> __get_user_pages_fast that it needs to fallback to the >>>> >> __get_user_pages slow path. >>> > >>> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >>> > instead of VM_BUG_ON. >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > With sparsemem at least, it makes sure that you're looking at a valid > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD". From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756913AbbLBWJO (ORCPT ); Wed, 2 Dec 2015 17:09:14 -0500 Received: from mga02.intel.com ([134.134.136.20]:48784 "EHLO mga02.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755427AbbLBWJL (ORCPT ); Wed, 2 Dec 2015 17:09:11 -0500 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.20,375,1444719600"; d="scan'208";a="699061766" Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping To: Dan Williams References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <565F69FE.601@intel.com> Cc: Toshi Kani , Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" From: Dave Hansen Message-ID: <565F6C06.9060208@intel.com> Date: Wed, 2 Dec 2015 14:09:10 -0800 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.3.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 12/02/2015 02:03 PM, Dan Williams wrote: >>> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >>> >> per node... what happens when pmem is located below that point. I >>> >> haven't been able to convince myself that we won't get false >>> >> positives, but maybe I'm missing something. >> > >> > With sparsemem at least, it makes sure that you're looking at a valid >> > _section_. See the pfn_valid() at ~include/linux/mmzone.h:1222. > At a minimum we would need to add "depends on SPARSEMEM" to "config FS_DAX_PMD". Yeah, it seems like an awful layering violation. But, sparsemem is turned on everywhere (all the distros/users) that we care about, as far as I know. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757338AbbLBXeC (ORCPT ); Wed, 2 Dec 2015 18:34:02 -0500 Received: from mail-yk0-f179.google.com ([209.85.160.179]:36223 "EHLO mail-yk0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756313AbbLBXd7 (ORCPT ); Wed, 2 Dec 2015 18:33:59 -0500 MIME-Version: 1.0 In-Reply-To: <1449102105.9855.15.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449102105.9855.15.camel@hpe.com> Date: Wed, 2 Dec 2015 15:33:58 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 4:21 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: >> On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: >> > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: >> > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: > : >> > > > >> > > > Hey Toshi, >> > > > >> > > > I ended up fixing this differently with follow_pmd_devmap() introduced >> > > > in this series: >> > > > >> > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html >> > > > >> > > > Does the latest libnvdimm-pending branch [1] pass your test case? >> > > >> > > Hi Dan, >> > > >> > > I ran several test cases, and they all hit the case "pfn not in memmap" in >> > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, >> > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I >> > > thought I let you know first. I've also seen the test thread got hung up >> > > at the end sometime. >> > >> > That PFN_MAP flag will not be set by default for NFIT-defined >> > persistent memory. See pmem_should_map_pages() for pmem namespaces >> > that will have it set by default, currently only e820 type-12 memory >> > ranges. >> > >> > NFIT-defined persistent memory can have a memmap array dynamically >> > allocated by setting up a pfn device (similar to setting up a btt). >> > We don't map it by default because the NFIT may describe hundreds of >> > gigabytes of persistent and the overhead of the memmap may be too >> > large to locate the memmap in ram. >> >> Oh, I see. I will setup the memmap array and run the tests again. > > I setup a pfn device, and ran a few test cases again. Yes, it solved the > PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by > 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect > due to the pfn metadata.If I pass a 2MB-aligned+2pages virtual address to > mmap(MAP_POPULATE), the mmap() call gets hung up. Ok, I need to switch over from my memmap=ss!nn config. We just need to pad the info block reservation to 2M. As for the MAP_POPULATE hang, I'll take a look. Right now I'm in the process of rebasing the whole set on top of -mm which has a pending THP re-works from Kirill. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1759479AbbLBX0r (ORCPT ); Wed, 2 Dec 2015 18:26:47 -0500 Received: from g4t3425.houston.hp.com ([15.201.208.53]:51664 "EHLO g4t3425.houston.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757339AbbLBX0n (ORCPT ); Wed, 2 Dec 2015 18:26:43 -0500 Message-ID: <1449102105.9855.15.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Wed, 02 Dec 2015 17:21:45 -0700 In-Reply-To: <1449078237.31589.30.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, 2015-12-02 at 10:43 -0700, Toshi Kani wrote: > On Tue, 2015-12-01 at 19:45 -0800, Dan Williams wrote: > > On Tue, Dec 1, 2015 at 6:19 PM, Toshi Kani wrote: > > > On Mon, 2015-11-30 at 14:08 -0800, Dan Williams wrote: : > > > > > > > > Hey Toshi, > > > > > > > > I ended up fixing this differently with follow_pmd_devmap() introduced > > > > in this series: > > > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/003033.html > > > > > > > > Does the latest libnvdimm-pending branch [1] pass your test case? > > > > > > Hi Dan, > > > > > > I ran several test cases, and they all hit the case "pfn not in memmap" in > > > __dax_pmd_fault() during mmap(MAP_POPULATE). Looking at the dax.pfn, > > > PFN_DEV is set but PFN_MAP is not. I have not looked into why, but I > > > thought I let you know first. I've also seen the test thread got hung up > > > at the end sometime. > > > > That PFN_MAP flag will not be set by default for NFIT-defined > > persistent memory. See pmem_should_map_pages() for pmem namespaces > > that will have it set by default, currently only e820 type-12 memory > > ranges. > > > > NFIT-defined persistent memory can have a memmap array dynamically > > allocated by setting up a pfn device (similar to setting up a btt). > > We don't map it by default because the NFIT may describe hundreds of > > gigabytes of persistent and the overhead of the memmap may be too > > large to locate the memmap in ram. > > Oh, I see. I will setup the memmap array and run the tests again. I setup a pfn device, and ran a few test cases again. Yes, it solved the PFN_MAP issue. However, I am no longer able to allocate FS blocks aligned by 2MB, so PMD faults fall back to PTE. They are off by 2 pages, which I suspect due to the pfn metadata. If I pass a 2MB-aligned+2pages virtual address to mmap(MAP_POPULATE), the mmap() call gets hung up. Thanks, -Toshi From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753447AbbLCXnt (ORCPT ); Thu, 3 Dec 2015 18:43:49 -0500 Received: from mail-yk0-f174.google.com ([209.85.160.174]:35556 "EHLO mail-yk0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751141AbbLCXnr (ORCPT ); Thu, 3 Dec 2015 18:43:47 -0500 MIME-Version: 1.0 In-Reply-To: <1449093339.9855.1.camel@hpe.com> References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <1449093339.9855.1.camel@hpe.com> Date: Thu, 3 Dec 2015 15:43:46 -0800 Message-ID: Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Dan Williams To: Toshi Kani Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani wrote: > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: >> On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani wrote: >> > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: >> [..] >> > > The whole point of __get_user_page_fast() is to avoid the overhead of >> > > taking the mm semaphore to access the vma. _PAGE_SPECIAL simply tells >> > > __get_user_pages_fast that it needs to fallback to the >> > > __get_user_pages slow path. >> > >> > I see. Then, I think gup_huge_pmd() can simply return 0 when !pfn_valid(), >> > instead of VM_BUG_ON. >> >> Is pfn_valid() a reliable check? It seems to be based on a max_pfn >> per node... what happens when pmem is located below that point. I >> haven't been able to convince myself that we won't get false >> positives, but maybe I'm missing something. > > I believe we use the version of pfn_valid() in linux/mmzone.h. Talking this over with Dave we came to the conclusion that it would be safer to be explicit about the pmd not being mapped. He points out that unless a platform can guarantee that persistent memory is always section aligned we might get false positive pfn_valid() indications. Given the get_user_pages_fast() path is arch specific we can simply have an arch specific pmd bit and not worry about generically enabling a "pmd special" bit for now. From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754246AbbLDQAv (ORCPT ); Fri, 4 Dec 2015 11:00:51 -0500 Received: from g2t2355.austin.hp.com ([15.217.128.54]:39707 "EHLO g2t2355.austin.hp.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753309AbbLDQAt (ORCPT ); Fri, 4 Dec 2015 11:00:49 -0500 Message-ID: <1449248149.9855.85.camel@hpe.com> Subject: Re: [PATCH] mm: Fix mmap MAP_POPULATE for DAX pmd mapping From: Toshi Kani To: Dan Williams Cc: Andrew Morton , "Kirill A. Shutemov" , Matthew Wilcox , Ross Zwisler , mauricio.porto@hpe.com, Linux MM , linux-fsdevel , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" Date: Fri, 04 Dec 2015 09:55:49 -0700 In-Reply-To: References: <1448309082-20851-1-git-send-email-toshi.kani@hpe.com> <1449022764.31589.24.camel@hpe.com> <1449078237.31589.30.camel@hpe.com> <1449084362.31589.37.camel@hpe.com> <1449086521.31589.39.camel@hpe.com> <1449087125.31589.45.camel@hpe.com> <1449092226.31589.50.camel@hpe.com> <1449093339.9855.1.camel@hpe.com> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.16.5 (3.16.5-3.fc22) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 2015-12-03 at 15:43 -0800, Dan Williams wrote: > On Wed, Dec 2, 2015 at 1:55 PM, Toshi Kani wrote: > > On Wed, 2015-12-02 at 12:54 -0800, Dan Williams wrote: > > > On Wed, Dec 2, 2015 at 1:37 PM, Toshi Kani > > > wrote: > > > > On Wed, 2015-12-02 at 11:57 -0800, Dan Williams wrote: > > > [..] > > > > > The whole point of __get_user_page_fast() is to avoid the > > > > > overhead of taking the mm semaphore to access the vma. > > > > > _PAGE_SPECIAL simply tells > > > > > __get_user_pages_fast that it needs to fallback to the > > > > > __get_user_pages slow path. > > > > > > > > I see. Then, I think gup_huge_pmd() can simply return 0 when > > > > !pfn_valid(), instead of VM_BUG_ON. > > > > > > Is pfn_valid() a reliable check? It seems to be based on a max_pfn > > > per node... what happens when pmem is located below that point. I > > > haven't been able to convince myself that we won't get false > > > positives, but maybe I'm missing something. > > > > I believe we use the version of pfn_valid() in linux/mmzone.h. > > Talking this over with Dave we came to the conclusion that it would be > safer to be explicit about the pmd not being mapped. He points out > that unless a platform can guarantee that persistent memory is always > section aligned we might get false positive pfn_valid() indications. > Given the get_user_pages_fast() path is arch specific we can simply > have an arch specific pmd bit and not worry about generically enabling > a "pmd special" bit for now. Sounds good to me. Thanks! -Toshi