* dax pmd fault handler never returns to userspace @ 2015-11-18 15:53 Jeff Moyer 2015-11-18 15:56 ` Zwisler, Ross 2015-11-18 16:52 ` Dan Williams 0 siblings, 2 replies; 19+ messages in thread From: Jeff Moyer @ 2015-11-18 15:53 UTC (permalink / raw) To: linux-ext4, linux-nvdimm, linux-fsdevel; +Cc: ross.zwisler, Matthew R. Wilcox Hi, When running the nvml library's test suite against an ext4 file system mounted with -o dax, I ran into an issue where many of the tests would simply timeout. The problem appears to be that the pmd fault handler never returns to userspace (the application is doing a memcpy of 512 bytes into pmem). Here's the 'perf report -g' output: - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back - 88.30% __memmove_ssse3_back - 66.63% page_fault - 66.47% do_page_fault - 66.16% __do_page_fault - 63.38% handle_mm_fault - 61.15% ext4_dax_pmd_fault - 45.04% __dax_pmd_fault - 37.05% vmf_insert_pfn_pmd - track_pfn_insert - 35.58% lookup_memtype - 33.80% pat_pagerange_is_ram - 33.40% walk_system_ram_range - 31.63% find_next_iomem_res 21.78% strcmp And here's 'perf top': Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519 Overhead Shared Object Symbol 22.55% [kernel] [k] strcmp 20.33% [unknown] [k] 0x00007f9f549ef3f3 10.01% [kernel] [k] native_irq_return_iret 9.54% [kernel] [k] find_next_iomem_res 3.00% [jbd2] [k] start_this_handle This is easily reproduced by doing the following: git clone https://github.com/pmem/nvml.git cd nvml make make test cd src/test/blk_non_zero ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0 I also ran the test suite against xfs, and the problem is not present there. However, I did not verify that the xfs tests were getting pmd faults. I'm happy to help diagnose the problem further, if necessary. Cheers, Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 15:53 dax pmd fault handler never returns to userspace Jeff Moyer @ 2015-11-18 15:56 ` Zwisler, Ross 2015-11-18 16:52 ` Dan Williams 1 sibling, 0 replies; 19+ messages in thread From: Zwisler, Ross @ 2015-11-18 15:56 UTC (permalink / raw) To: jmoyer@redhat.com Cc: linux-ext4@vger.kernel.org, willy@linux.intel.com, linux-nvdimm@ml01.01.org, linux-fsdevel@vger.kernel.org On Wed, 2015-11-18 at 10:53 -0500, Jeff Moyer wrote: > Hi, > > When running the nvml library's test suite against an ext4 file system > mounted with -o dax, I ran into an issue where many of the tests would > simply timeout. The problem appears to be that the pmd fault handler > never returns to userspace (the application is doing a memcpy of 512 > bytes into pmem). Here's the 'perf report -g' output: > > - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back > - 88.30% __memmove_ssse3_back > - 66.63% page_fault > - 66.47% do_page_fault > - 66.16% __do_page_fault > - 63.38% handle_mm_fault > - 61.15% ext4_dax_pmd_fault > - 45.04% __dax_pmd_fault > - 37.05% vmf_insert_pfn_pmd > - track_pfn_insert > - 35.58% lookup_memtype > - 33.80% pat_pagerange_is_ram > - 33.40% walk_system_ram_range > - 31.63% find_next_iomem_res > 21.78% strcmp > > And here's 'perf top': > > Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519 > Overhead Shared Object Symbol > 22.55% [kernel] [k] strcmp > 20.33% [unknown] [k] 0x00007f9f549ef3f3 > 10.01% [kernel] [k] native_irq_return_iret > 9.54% [kernel] [k] find_next_iomem_res > 3.00% [jbd2] [k] start_this_handle > > This is easily reproduced by doing the following: > > git clone https://github.com/pmem/nvml.git > cd nvml > make > make test > cd src/test/blk_non_zero > ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0 > > I also ran the test suite against xfs, and the problem is not present > there. However, I did not verify that the xfs tests were getting pmd > faults. > > I'm happy to help diagnose the problem further, if necessary. Thanks for the report, I'll take a look. - Ross ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 15:53 dax pmd fault handler never returns to userspace Jeff Moyer 2015-11-18 15:56 ` Zwisler, Ross @ 2015-11-18 16:52 ` Dan Williams 2015-11-18 17:00 ` Ross Zwisler 1 sibling, 1 reply; 19+ messages in thread From: Dan Williams @ 2015-11-18 16:52 UTC (permalink / raw) To: Jeff Moyer; +Cc: linux-ext4, linux-nvdimm, linux-fsdevel, Ross Zwisler On Wed, Nov 18, 2015 at 7:53 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > Hi, > > When running the nvml library's test suite against an ext4 file system > mounted with -o dax, I ran into an issue where many of the tests would > simply timeout. The problem appears to be that the pmd fault handler > never returns to userspace (the application is doing a memcpy of 512 > bytes into pmem). Here's the 'perf report -g' output: > > - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back > - 88.30% __memmove_ssse3_back > - 66.63% page_fault > - 66.47% do_page_fault > - 66.16% __do_page_fault > - 63.38% handle_mm_fault > - 61.15% ext4_dax_pmd_fault > - 45.04% __dax_pmd_fault > - 37.05% vmf_insert_pfn_pmd > - track_pfn_insert > - 35.58% lookup_memtype > - 33.80% pat_pagerange_is_ram > - 33.40% walk_system_ram_range > - 31.63% find_next_iomem_res > 21.78% strcmp > > And here's 'perf top': > > Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519 > Overhead Shared Object Symbol > 22.55% [kernel] [k] strcmp > 20.33% [unknown] [k] 0x00007f9f549ef3f3 > 10.01% [kernel] [k] native_irq_return_iret > 9.54% [kernel] [k] find_next_iomem_res > 3.00% [jbd2] [k] start_this_handle > > This is easily reproduced by doing the following: > > git clone https://github.com/pmem/nvml.git > cd nvml > make > make test > cd src/test/blk_non_zero > ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0 > > I also ran the test suite against xfs, and the problem is not present > there. However, I did not verify that the xfs tests were getting pmd > faults. > > I'm happy to help diagnose the problem further, if necessary. Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 16:52 ` Dan Williams @ 2015-11-18 17:00 ` Ross Zwisler 2015-11-18 17:43 ` Jeff Moyer 0 siblings, 1 reply; 19+ messages in thread From: Ross Zwisler @ 2015-11-18 17:00 UTC (permalink / raw) To: Dan Williams Cc: Jeff Moyer, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote: > On Wed, Nov 18, 2015 at 7:53 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > > Hi, > > > > When running the nvml library's test suite against an ext4 file system > > mounted with -o dax, I ran into an issue where many of the tests would > > simply timeout. The problem appears to be that the pmd fault handler > > never returns to userspace (the application is doing a memcpy of 512 > > bytes into pmem). Here's the 'perf report -g' output: > > > > - 88.30% 0.01% blk_non_zero.st libc-2.17.so [.] __memmove_ssse3_back > > - 88.30% __memmove_ssse3_back > > - 66.63% page_fault > > - 66.47% do_page_fault > > - 66.16% __do_page_fault > > - 63.38% handle_mm_fault > > - 61.15% ext4_dax_pmd_fault > > - 45.04% __dax_pmd_fault > > - 37.05% vmf_insert_pfn_pmd > > - track_pfn_insert > > - 35.58% lookup_memtype > > - 33.80% pat_pagerange_is_ram > > - 33.40% walk_system_ram_range > > - 31.63% find_next_iomem_res > > 21.78% strcmp > > > > And here's 'perf top': > > > > Samples: 2M of event 'cycles:pp', Event count (approx.): 56080150519 > > Overhead Shared Object Symbol > > 22.55% [kernel] [k] strcmp > > 20.33% [unknown] [k] 0x00007f9f549ef3f3 > > 10.01% [kernel] [k] native_irq_return_iret > > 9.54% [kernel] [k] find_next_iomem_res > > 3.00% [jbd2] [k] start_this_handle > > > > This is easily reproduced by doing the following: > > > > git clone https://github.com/pmem/nvml.git > > cd nvml > > make > > make test > > cd src/test/blk_non_zero > > ./blk_non_zero.static-nondebug 512 /path/to/ext4/dax/fs/testfile1 c 1073741824 w:0 > > > > I also ran the test suite against xfs, and the problem is not present > > there. However, I did not verify that the xfs tests were getting pmd > > faults. > > > > I'm happy to help diagnose the problem further, if necessary. > > Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html I was able to reproduce the issue in my setup with v4.3, and the patch from Yigal seems to solve it. Jeff, can you confirm? ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 17:00 ` Ross Zwisler @ 2015-11-18 17:43 ` Jeff Moyer 2015-11-18 18:10 ` Dan Williams 0 siblings, 1 reply; 19+ messages in thread From: Jeff Moyer @ 2015-11-18 17:43 UTC (permalink / raw) To: Ross Zwisler Cc: Dan Williams, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler Ross Zwisler <ross.zwisler@linux.intel.com> writes: > On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote: >> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? >> >> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html > > I was able to reproduce the issue in my setup with v4.3, and the patch from > Yigal seems to solve it. Jeff, can you confirm? I applied the patch from Yigal and the symptoms persist. Ross, what are you testing on? I'm using an NVDIMM-N. Dan, here's sysrq-l (which is what w used to look like, I think). Only cpu 3 is interesting: [ 825.339264] NMI backtrace for cpu 3 [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17 [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015 [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000 [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30 [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246 [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000 [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8 [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0 [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200 [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000 00 [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0 [ 825.830906] Stack: [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0 [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000 [ 825.953220] Call Trace: [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130 [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20 [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0 [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0 [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0 [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60 [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210 [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610 [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4] [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4] [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510 [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0 [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80 [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30 [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f b6 47 ff 48 83 c6 01 3a 46 ff 74 eb The full output is large (48 cpus), so I'm going to be lazy and not cut-n-paste it here. Cheers, Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 17:43 ` Jeff Moyer @ 2015-11-18 18:10 ` Dan Williams 2015-11-18 18:23 ` Ross Zwisler 2015-11-18 18:30 ` Jeff Moyer 0 siblings, 2 replies; 19+ messages in thread From: Dan Williams @ 2015-11-18 18:10 UTC (permalink / raw) To: Jeff Moyer Cc: Ross Zwisler, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > Ross Zwisler <ross.zwisler@linux.intel.com> writes: > >> On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote: >>> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? >>> >>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html >> >> I was able to reproduce the issue in my setup with v4.3, and the patch from >> Yigal seems to solve it. Jeff, can you confirm? > > I applied the patch from Yigal and the symptoms persist. Ross, what are > you testing on? I'm using an NVDIMM-N. > > Dan, here's sysrq-l (which is what w used to look like, I think). Only > cpu 3 is interesting: > > [ 825.339264] NMI backtrace for cpu 3 > [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17 > [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015 > [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000 > [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30 > [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246 > [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000 > [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8 > [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0 > [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd > [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200 > [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000 > 00 > [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0 > [ 825.830906] Stack: > [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff > [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0 > [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000 > [ 825.953220] Call Trace: > [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130 > [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20 > [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0 > [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0 > [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0 > [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60 > [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210 > [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610 > [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4] > [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4] > [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510 > [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0 > [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80 > [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30 > [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 > c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f > b6 47 ff 48 83 c6 01 3a 46 ff 74 eb Hmm, a loop in the resource sibling list? What does /proc/iomem say? Not related to this bug, but lookup_memtype() looks broken for pmd mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which will cause problems if we're straddling the end of memory. > The full output is large (48 cpus), so I'm going to be lazy and not > cut-n-paste it here. Thanks for that ;-) ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:10 ` Dan Williams @ 2015-11-18 18:23 ` Ross Zwisler 2015-11-18 18:32 ` Jeff Moyer 2015-11-18 21:33 ` Toshi Kani 2015-11-18 18:30 ` Jeff Moyer 1 sibling, 2 replies; 19+ messages in thread From: Ross Zwisler @ 2015-11-18 18:23 UTC (permalink / raw) To: Dan Williams Cc: Jeff Moyer, Ross Zwisler, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 10:10:45AM -0800, Dan Williams wrote: > On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > > Ross Zwisler <ross.zwisler@linux.intel.com> writes: > > > >> On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote: > >>> Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? > >>> > >>> https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html > >> > >> I was able to reproduce the issue in my setup with v4.3, and the patch from > >> Yigal seems to solve it. Jeff, can you confirm? > > > > I applied the patch from Yigal and the symptoms persist. Ross, what are > > you testing on? I'm using an NVDIMM-N. > > > > Dan, here's sysrq-l (which is what w used to look like, I think). Only > > cpu 3 is interesting: > > > > [ 825.339264] NMI backtrace for cpu 3 > > [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0-rc1+ #17 > > [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015 > > [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: ffff88046133c000 > > [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] strcmp+0x6/0x30 > > [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246 > > [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: 000000076c800000 > > [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: ffffffff818ea1c8 > > [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: ffff8804652300c0 > > [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: ffffffff818ea1bd > > [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: 0000000080000200 > > [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) knlGS:00000000000000 > > 00 > > [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: 00000000001406e0 > > [ 825.830906] Stack: > > [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 000000076c800fff > > [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 ffffffff8106d1d0 > > [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d 000000076c800000 > > [ 825.953220] Call Trace: > > [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130 > > [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20 > > [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0 > > [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0 > > [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0 > > [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60 > > [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210 > > [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610 > > [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4] > > [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4] > > [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510 > > [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0 > > [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80 > > [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30 > > [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f b6 4e ff 48 83 > > c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 48 83 c7 01 0f > > b6 47 ff 48 83 c6 01 3a 46 ff 74 eb > > Hmm, a loop in the resource sibling list? > > What does /proc/iomem say? > > Not related to this bug, but lookup_memtype() looks broken for pmd > mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which > will cause problems if we're straddling the end of memory. > > > The full output is large (48 cpus), so I'm going to be lazy and not > > cut-n-paste it here. > > Thanks for that ;-) Yea, my first round of testing was broken, sorry about that. It looks like this test causes the PMD fault handler to be called repeatedly over and over until you kill the userspace process. This doesn't happen for XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. So, looks like a livelock as far as I can tell. Still debugging. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:23 ` Ross Zwisler @ 2015-11-18 18:32 ` Jeff Moyer 2015-11-18 18:53 ` Ross Zwisler 2015-11-18 21:33 ` Toshi Kani 1 sibling, 1 reply; 19+ messages in thread From: Jeff Moyer @ 2015-11-18 18:32 UTC (permalink / raw) To: Ross Zwisler Cc: Dan Williams, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler Ross Zwisler <ross.zwisler@linux.intel.com> writes: > Yea, my first round of testing was broken, sorry about that. > > It looks like this test causes the PMD fault handler to be called repeatedly > over and over until you kill the userspace process. This doesn't happen for > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. Hmm, I wonder why not? Sounds like that will need investigating as well, right? -Jeff > So, looks like a livelock as far as I can tell. > > Still debugging. Thanks! Jeff ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:32 ` Jeff Moyer @ 2015-11-18 18:53 ` Ross Zwisler 2015-11-18 18:58 ` Dan Williams 0 siblings, 1 reply; 19+ messages in thread From: Ross Zwisler @ 2015-11-18 18:53 UTC (permalink / raw) To: Jeff Moyer Cc: Ross Zwisler, Dan Williams, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote: > Ross Zwisler <ross.zwisler@linux.intel.com> writes: > > > Yea, my first round of testing was broken, sorry about that. > > > > It looks like this test causes the PMD fault handler to be called repeatedly > > over and over until you kill the userspace process. This doesn't happen for > > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. > > Hmm, I wonder why not? Well, whether or not you get PMDs is dependent on the block allocator for the filesystem. We ask the FS how much space is contiguous via get_blocks(), and if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault path. This code all lives in __dax_pmd_fault(). There are also a bunch of other reasons why we'd fall back to 4k faults - the virtual address isn't 2 MiB aligned, etc. It's actually pretty hard to get everything right so you actually get PMD faults. Anyway, my guess is that we're failing to meet one of our criteria in XFS, so we just always fall back to PTEs for this test. > Sounds like that will need investigating as well, right? Yep, on it. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:53 ` Ross Zwisler @ 2015-11-18 18:58 ` Dan Williams 2015-11-19 22:34 ` Dave Chinner 0 siblings, 1 reply; 19+ messages in thread From: Dan Williams @ 2015-11-18 18:58 UTC (permalink / raw) To: Ross Zwisler, Jeff Moyer, Dan Williams, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 10:53 AM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote: >> Ross Zwisler <ross.zwisler@linux.intel.com> writes: >> >> > Yea, my first round of testing was broken, sorry about that. >> > >> > It looks like this test causes the PMD fault handler to be called repeatedly >> > over and over until you kill the userspace process. This doesn't happen for >> > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. >> >> Hmm, I wonder why not? > > Well, whether or not you get PMDs is dependent on the block allocator for the > filesystem. We ask the FS how much space is contiguous via get_blocks(), and > if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault > path. This code all lives in __dax_pmd_fault(). There are also a bunch of > other reasons why we'd fall back to 4k faults - the virtual address isn't 2 > MiB aligned, etc. It's actually pretty hard to get everything right so you > actually get PMD faults. > > Anyway, my guess is that we're failing to meet one of our criteria in XFS, so > we just always fall back to PTEs for this test. > >> Sounds like that will need investigating as well, right? > > Yep, on it. XFS can do pmd faults just fine, you just need to use fiemap to find a 2MiB aligned physical offset. See the ndctl pmd test I posted. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:58 ` Dan Williams @ 2015-11-19 22:34 ` Dave Chinner 0 siblings, 0 replies; 19+ messages in thread From: Dave Chinner @ 2015-11-19 22:34 UTC (permalink / raw) To: Dan Williams Cc: Ross Zwisler, Jeff Moyer, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler On Wed, Nov 18, 2015 at 10:58:29AM -0800, Dan Williams wrote: > On Wed, Nov 18, 2015 at 10:53 AM, Ross Zwisler > <ross.zwisler@linux.intel.com> wrote: > > On Wed, Nov 18, 2015 at 01:32:46PM -0500, Jeff Moyer wrote: > >> Ross Zwisler <ross.zwisler@linux.intel.com> writes: > >> > >> > Yea, my first round of testing was broken, sorry about that. > >> > > >> > It looks like this test causes the PMD fault handler to be called repeatedly > >> > over and over until you kill the userspace process. This doesn't happen for > >> > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. > >> > >> Hmm, I wonder why not? > > > > Well, whether or not you get PMDs is dependent on the block allocator for the > > filesystem. We ask the FS how much space is contiguous via get_blocks(), and > > if it's less than PMD_SIZE (2 MiB) we fall back to the regular 4k page fault > > path. This code all lives in __dax_pmd_fault(). There are also a bunch of > > other reasons why we'd fall back to 4k faults - the virtual address isn't 2 > > MiB aligned, etc. It's actually pretty hard to get everything right so you > > actually get PMD faults. > > > > Anyway, my guess is that we're failing to meet one of our criteria in XFS, so > > we just always fall back to PTEs for this test. > > > >> Sounds like that will need investigating as well, right? > > > > Yep, on it. > > XFS can do pmd faults just fine, you just need to use fiemap to find a > 2MiB aligned physical offset. See the ndctl pmd test I posted. This comes under the topic of "XFS and Storage Alignment 101". there's nothing new here and it's just like aligning your filesystem to RAID5/6 geometries for optimal sequential IO patterns: # mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0 .... # mount /dev/pmem0 /mnt/xfs # xfs_io -c "extsize 2m" /mnt/xfs And now XFS will allocate strip unit (2MB) aligned extents of 2MB in all files created in that filesystem. Now all you have to care about is correctly aligning the base address of /dev/pmem0 to 2MB so that all the stripe units (and hence file extent allocations) are correctly aligned to the page tables. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:23 ` Ross Zwisler 2015-11-18 18:32 ` Jeff Moyer @ 2015-11-18 21:33 ` Toshi Kani 2015-11-18 21:57 ` Dan Williams 1 sibling, 1 reply; 19+ messages in thread From: Toshi Kani @ 2015-11-18 21:33 UTC (permalink / raw) To: Ross Zwisler, Dan Williams Cc: linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 [-- Attachment #1: Type: text/plain, Size: 4910 bytes --] On Wed, 2015-11-18 at 11:23 -0700, Ross Zwisler wrote: > On Wed, Nov 18, 2015 at 10:10:45AM -0800, Dan Williams wrote: > > On Wed, Nov 18, 2015 at 9:43 AM, Jeff Moyer <jmoyer@redhat.com> wrote: > > > Ross Zwisler <ross.zwisler@linux.intel.com> writes: > > > > > > > On Wed, Nov 18, 2015 at 08:52:59AM -0800, Dan Williams wrote: > > > > > Sysrq-t or sysrq-w dump? Also do you have the locking fix from Yigal? > > > > > > > > > > https://lists.01.org/pipermail/linux-nvdimm/2015-November/002842.html > > > > > > > > I was able to reproduce the issue in my setup with v4.3, and the patch > > > > from > > > > Yigal seems to solve it. Jeff, can you confirm? > > > > > > I applied the patch from Yigal and the symptoms persist. Ross, what are > > > you testing on? I'm using an NVDIMM-N. > > > > > > Dan, here's sysrq-l (which is what w used to look like, I think). Only > > > cpu 3 is interesting: > > > > > > [ 825.339264] NMI backtrace for cpu 3 > > > [ 825.356347] CPU: 3 PID: 13555 Comm: blk_non_zero.st Not tainted 4.4.0 > > > -rc1+ #17 > > > [ 825.392056] Hardware name: HP ProLiant DL380 Gen9, BIOS P89 06/09/2015 > > > [ 825.424472] task: ffff880465bf6a40 ti: ffff88046133c000 task.ti: > > > ffff88046133c000 > > > [ 825.461480] RIP: 0010:[<ffffffff81329856>] [<ffffffff81329856>] > > > strcmp+0x6/0x30 > > > [ 825.497916] RSP: 0000:ffff88046133fbc8 EFLAGS: 00000246 > > > [ 825.524836] RAX: 0000000000000000 RBX: ffff880c7fffd7c0 RCX: > > > 000000076c800000 > > > [ 825.566847] RDX: 000000076c800fff RSI: ffffffff818ea1c8 RDI: > > > ffffffff818ea1c8 > > > [ 825.605265] RBP: ffff88046133fbc8 R08: 0000000000000001 R09: > > > ffff8804652300c0 > > > [ 825.643628] R10: 00007f1b4fe0b000 R11: ffff880465230228 R12: > > > ffffffff818ea1bd > > > [ 825.681381] R13: 0000000000000001 R14: ffff88046133fc20 R15: > > > 0000000080000200 > > > [ 825.718607] FS: 00007f1b5102d880(0000) GS:ffff88046f8c0000(0000) > > > knlGS:00000000000000 > > > 00 > > > [ 825.761663] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 825.792213] CR2: 00007f1b4fe0b000 CR3: 000000046b225000 CR4: > > > 00000000001406e0 > > > [ 825.830906] Stack: > > > [ 825.841235] ffff88046133fc10 ffffffff81084610 000000076c800000 > > > 000000076c800fff > > > [ 825.879533] 000000076c800fff 00000000ffffffff ffff88046133fc90 > > > ffffffff8106d1d0 > > > [ 825.916774] 000000000000000c ffff88046133fc80 ffffffff81084f0d > > > 000000076c800000 > > > [ 825.953220] Call Trace: > > > [ 825.965386] [<ffffffff81084610>] find_next_iomem_res+0xd0/0x130 > > > [ 825.996804] [<ffffffff8106d1d0>] ? pat_enabled+0x20/0x20 > > > [ 826.024773] [<ffffffff81084f0d>] walk_system_ram_range+0x8d/0xf0 > > > [ 826.055565] [<ffffffff8106d2d8>] pat_pagerange_is_ram+0x78/0xa0 > > > [ 826.088971] [<ffffffff8106d475>] lookup_memtype+0x35/0xc0 > > > [ 826.121385] [<ffffffff8106e33b>] track_pfn_insert+0x2b/0x60 > > > [ 826.154600] [<ffffffff811e5523>] vmf_insert_pfn_pmd+0xb3/0x210 > > > [ 826.187992] [<ffffffff8124acab>] __dax_pmd_fault+0x3cb/0x610 > > > [ 826.221337] [<ffffffffa0769910>] ? ext4_dax_mkwrite+0x20/0x20 [ext4] > > > [ 826.259190] [<ffffffffa0769a4d>] ext4_dax_pmd_fault+0xcd/0x100 [ext4] > > > [ 826.293414] [<ffffffff811b0af7>] handle_mm_fault+0x3b7/0x510 > > > [ 826.323763] [<ffffffff81068f98>] __do_page_fault+0x188/0x3f0 > > > [ 826.358186] [<ffffffff81069230>] do_page_fault+0x30/0x80 > > > [ 826.391212] [<ffffffff8169c148>] page_fault+0x28/0x30 > > > [ 826.420752] Code: 89 e5 74 09 48 83 c2 01 80 3a 00 75 f7 48 83 c6 01 0f > > > b6 4e ff 48 83 > > > c2 01 84 c9 88 4a ff 75 ed 5d c3 0f 1f 00 55 48 89 e5 eb 04 <84> c0 74 18 > > > 48 83 c7 01 0f > > > b6 47 ff 48 83 c6 01 3a 46 ff 74 eb > > > > Hmm, a loop in the resource sibling list? > > > > What does /proc/iomem say? > > > > Not related to this bug, but lookup_memtype() looks broken for pmd > > mappings as we only check for PAGE_SIZE instead of HPAGE_SIZE. Which > > will cause problems if we're straddling the end of memory. > > > > > The full output is large (48 cpus), so I'm going to be lazy and not > > > cut-n-paste it here. > > > > Thanks for that ;-) > > Yea, my first round of testing was broken, sorry about that. > > It looks like this test causes the PMD fault handler to be called repeatedly > over and over until you kill the userspace process. This doesn't happen for > XFS because when using XFS this test doesn't hit PMD faults, only PTE faults. > > So, looks like a livelock as far as I can tell. > > Still debugging. I am seeing a similar/same problem in my test. I think the problem is that in case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> vmf_insert_pfn_pmd(), which is a no-op since the PMD is mapped already. We need WP handling for this PMD map. If it helps, I have attached change for follow_trans_huge_pmd(). I have not tested much, though. Thanks, -Toshi [-- Attachment #2: follow_pfn_pmd.patch --] [-- Type: text/x-patch, Size: 2272 bytes --] --- include/linux/mm.h | 2 ++ mm/gup.c | 24 ++++++++++++++++++++++++ mm/huge_memory.c | 8 ++++++++ 3 files changed, 34 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00bad77..a427b88 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1084,6 +1084,8 @@ struct zap_details { struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte); +int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags); int zap_vma_ptes(struct vm_area_struct *vma, unsigned long address, unsigned long size); diff --git a/mm/gup.c b/mm/gup.c index deafa2c..15135ee 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -34,6 +34,30 @@ static struct page *no_page_table(struct vm_area_struct *vma, return NULL; } +int follow_pfn_pmd(struct vm_area_struct *vma, unsigned long address, + pmd_t *pmd, unsigned int flags) +{ + /* No page to get reference */ + if (flags & FOLL_GET) + return -EFAULT; + + if (flags & FOLL_TOUCH) { + pmd_t entry = *pmd; + + if (flags & FOLL_WRITE) + entry = pmd_mkdirty(entry); + entry = pmd_mkyoung(entry); + + if (!pmd_same(*pmd, entry)) { + set_pmd_at(vma->vm_mm, address, pmd, entry); + update_mmu_cache_pmd(vma, address, pmd); + } + } + + /* Proper page table entry exists, but no corresponding struct page */ + return -EEXIST; +} + static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address, pte_t *pte, unsigned int flags) { diff --git a/mm/huge_memory.c b/mm/huge_memory.c index c29ddeb..41b277a 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1276,6 +1276,7 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, { struct mm_struct *mm = vma->vm_mm; struct page *page = NULL; + int ret; assert_spin_locked(pmd_lockptr(mm, pmd)); @@ -1290,6 +1291,13 @@ struct page *follow_trans_huge_pmd(struct vm_area_struct *vma, if ((flags & FOLL_NUMA) && pmd_protnone(*pmd)) goto out; + /* pfn map does not have struct page */ + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP)) { + ret = follow_pfn_pmd(vma, addr, pmd, flags); + page = ERR_PTR(ret); + goto out; + } + page = pmd_page(*pmd); VM_BUG_ON_PAGE(!PageHead(page), page); if (flags & FOLL_TOUCH) { ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 21:33 ` Toshi Kani @ 2015-11-18 21:57 ` Dan Williams 2015-11-18 22:04 ` Toshi Kani 0 siblings, 1 reply; 19+ messages in thread From: Dan Williams @ 2015-11-18 21:57 UTC (permalink / raw) To: Toshi Kani Cc: Ross Zwisler, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > I am seeing a similar/same problem in my test. I think the problem is that in > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> vmf_insert_pfn_pmd(), > which is a no-op since the PMD is mapped already. We need WP handling for this > PMD map. > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not > tested much, though. Interesting, I didn't get this far because my tests were crashing the kernel. I'll add this case the pmd fault test in ndctl. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 21:57 ` Dan Williams @ 2015-11-18 22:04 ` Toshi Kani 2015-11-19 0:36 ` Ross Zwisler 0 siblings, 1 reply; 19+ messages in thread From: Toshi Kani @ 2015-11-18 22:04 UTC (permalink / raw) To: Dan Williams Cc: Ross Zwisler, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote: > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > I am seeing a similar/same problem in my test. I think the problem is that > > in > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> > > vmf_insert_pfn_pmd(), > > which is a no-op since the PMD is mapped already. We need WP handling for > > this > > PMD map. > > > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not > > tested much, though. > > Interesting, I didn't get this far because my tests were crashing the > kernel. I'll add this case the pmd fault test in ndctl. I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP fault loop when writing to the range. Thanks, -Toshi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 22:04 ` Toshi Kani @ 2015-11-19 0:36 ` Ross Zwisler 2015-11-19 0:39 ` Dan Williams ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Ross Zwisler @ 2015-11-19 0:36 UTC (permalink / raw) To: Toshi Kani Cc: Dan Williams, Ross Zwisler, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote: > On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote: > > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > I am seeing a similar/same problem in my test. I think the problem is that > > > in > > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> > > > vmf_insert_pfn_pmd(), > > > which is a no-op since the PMD is mapped already. We need WP handling for > > > this > > > PMD map. > > > > > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not > > > tested much, though. > > > > Interesting, I didn't get this far because my tests were crashing the > > kernel. I'll add this case the pmd fault test in ndctl. > > I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP > fault loop when writing to the range. Here's a fix - please let me know if this seems incomplete or incorrect for some reason. -- >8 -- >From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001 From: Ross Zwisler <ross.zwisler@linux.intel.com> Date: Wed, 18 Nov 2015 17:15:09 -0700 Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable Prior to this change DAX PMD mappings that were made read-only were never able to be made writable again. This is because the code in insert_pfn_pmd() that calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD already existed in the page table. Instead, if we are doing a write always mark the PMD entry as dirty and writeable. Without this code we can get into a condition where we mark the PMD as read-only, and then on a subsequent write fault we get into an infinite loop of PMD faults where we try unsuccessfully to make the PMD writeable. Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> --- mm/huge_memory.c | 14 ++++++-------- 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/mm/huge_memory.c b/mm/huge_memory.c index bbac913..1b3df56 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); - if (pmd_none(*pmd)) { - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); - if (write) { - entry = pmd_mkyoung(pmd_mkdirty(entry)); - entry = maybe_pmd_mkwrite(entry, vma); - } - set_pmd_at(mm, addr, pmd, entry); - update_mmu_cache_pmd(vma, addr, pmd); + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); + if (write) { + entry = pmd_mkyoung(pmd_mkdirty(entry)); + entry = maybe_pmd_mkwrite(entry, vma); } + set_pmd_at(mm, addr, pmd, entry); + update_mmu_cache_pmd(vma, addr, pmd); spin_unlock(ptl); } -- 2.6.3 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-19 0:36 ` Ross Zwisler @ 2015-11-19 0:39 ` Dan Williams 2015-11-19 1:05 ` Toshi Kani 2015-11-19 1:19 ` Dan Williams 2 siblings, 0 replies; 19+ messages in thread From: Dan Williams @ 2015-11-19 0:39 UTC (permalink / raw) To: Ross Zwisler, Toshi Kani, Dan Williams, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, Nov 18, 2015 at 4:36 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote: >> On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote: >> > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > I am seeing a similar/same problem in my test. I think the problem is that >> > > in >> > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> >> > > vmf_insert_pfn_pmd(), >> > > which is a no-op since the PMD is mapped already. We need WP handling for >> > > this >> > > PMD map. >> > > >> > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not >> > > tested much, though. >> > >> > Interesting, I didn't get this far because my tests were crashing the >> > kernel. I'll add this case the pmd fault test in ndctl. >> >> I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP >> fault loop when writing to the range. > > Here's a fix - please let me know if this seems incomplete or incorrect for > some reason. > > -- >8 -- > From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001 > From: Ross Zwisler <ross.zwisler@linux.intel.com> > Date: Wed, 18 Nov 2015 17:15:09 -0700 > Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable > > Prior to this change DAX PMD mappings that were made read-only were never able > to be made writable again. This is because the code in insert_pfn_pmd() that > calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD > already existed in the page table. > > Instead, if we are doing a write always mark the PMD entry as dirty and > writeable. Without this code we can get into a condition where we mark the > PMD as read-only, and then on a subsequent write fault we get into an infinite > loop of PMD faults where we try unsuccessfully to make the PMD writeable. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > mm/huge_memory.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index bbac913..1b3df56 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > spinlock_t *ptl; > > ptl = pmd_lock(mm, pmd); > - if (pmd_none(*pmd)) { > - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); > - if (write) { > - entry = pmd_mkyoung(pmd_mkdirty(entry)); > - entry = maybe_pmd_mkwrite(entry, vma); > - } > - set_pmd_at(mm, addr, pmd, entry); > - update_mmu_cache_pmd(vma, addr, pmd); > + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); > + if (write) { > + entry = pmd_mkyoung(pmd_mkdirty(entry)); > + entry = maybe_pmd_mkwrite(entry, vma); > } > + set_pmd_at(mm, addr, pmd, entry); > + update_mmu_cache_pmd(vma, addr, pmd); > spin_unlock(ptl); > } Looks good to me. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-19 0:36 ` Ross Zwisler 2015-11-19 0:39 ` Dan Williams @ 2015-11-19 1:05 ` Toshi Kani 2015-11-19 1:19 ` Dan Williams 2 siblings, 0 replies; 19+ messages in thread From: Toshi Kani @ 2015-11-19 1:05 UTC (permalink / raw) To: Ross Zwisler Cc: Dan Williams, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, 2015-11-18 at 17:36 -0700, Ross Zwisler wrote: > On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote: > > On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote: > > > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: > > > > I am seeing a similar/same problem in my test. I think the problem is > > > > that in case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> > > > > vmf_insert_pfn_pmd(), which is a no-op since the PMD is mapped already. > > > > We need WP handling for this PMD map. > > > > > > > > If it helps, I have attached change for follow_trans_huge_pmd(). I have > > > > not tested much, though. > > > > > > Interesting, I didn't get this far because my tests were crashing the > > > kernel. I'll add this case the pmd fault test in ndctl. > > > > I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP > > fault loop when writing to the range. > > Here's a fix - please let me know if this seems incomplete or incorrect for > some reason. My test looks working now. :-) I will do more testing and submit the gup patch as well. Thanks, -Toshi ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-19 0:36 ` Ross Zwisler 2015-11-19 0:39 ` Dan Williams 2015-11-19 1:05 ` Toshi Kani @ 2015-11-19 1:19 ` Dan Williams 2 siblings, 0 replies; 19+ messages in thread From: Dan Williams @ 2015-11-19 1:19 UTC (permalink / raw) To: Ross Zwisler, Toshi Kani, Dan Williams, linux-nvdimm, Ross Zwisler, linux-fsdevel, linux-ext4 On Wed, Nov 18, 2015 at 4:36 PM, Ross Zwisler <ross.zwisler@linux.intel.com> wrote: > On Wed, Nov 18, 2015 at 03:04:41PM -0700, Toshi Kani wrote: >> On Wed, 2015-11-18 at 13:57 -0800, Dan Williams wrote: >> > On Wed, Nov 18, 2015 at 1:33 PM, Toshi Kani <toshi.kani@hpe.com> wrote: >> > > I am seeing a similar/same problem in my test. I think the problem is that >> > > in >> > > case of a WP fault, wp_huge_pmd() -> __dax_pmd_fault() -> >> > > vmf_insert_pfn_pmd(), >> > > which is a no-op since the PMD is mapped already. We need WP handling for >> > > this >> > > PMD map. >> > > >> > > If it helps, I have attached change for follow_trans_huge_pmd(). I have not >> > > tested much, though. >> > >> > Interesting, I didn't get this far because my tests were crashing the >> > kernel. I'll add this case the pmd fault test in ndctl. >> >> I hit this one with mmap(MAP_POPULATE). With this change, I then hit the WP >> fault loop when writing to the range. > > Here's a fix - please let me know if this seems incomplete or incorrect for > some reason. > > -- >8 -- > From 02aa9f37d7ec9c0c38413f7e304b2577eb9f974a Mon Sep 17 00:00:00 2001 > From: Ross Zwisler <ross.zwisler@linux.intel.com> > Date: Wed, 18 Nov 2015 17:15:09 -0700 > Subject: [PATCH] mm: Allow DAX PMD mappings to become writeable > > Prior to this change DAX PMD mappings that were made read-only were never able > to be made writable again. This is because the code in insert_pfn_pmd() that > calls pmd_mkdirty() and pmd_mkwrite() would skip these calls if the PMD > already existed in the page table. > > Instead, if we are doing a write always mark the PMD entry as dirty and > writeable. Without this code we can get into a condition where we mark the > PMD as read-only, and then on a subsequent write fault we get into an infinite > loop of PMD faults where we try unsuccessfully to make the PMD writeable. > > Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > mm/huge_memory.c | 14 ++++++-------- > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index bbac913..1b3df56 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -877,15 +877,13 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, > spinlock_t *ptl; > > ptl = pmd_lock(mm, pmd); > - if (pmd_none(*pmd)) { > - entry = pmd_mkhuge(pfn_pmd(pfn, prot)); > - if (write) { > - entry = pmd_mkyoung(pmd_mkdirty(entry)); > - entry = maybe_pmd_mkwrite(entry, vma); > - } > - set_pmd_at(mm, addr, pmd, entry); > - update_mmu_cache_pmd(vma, addr, pmd); > + entry = pmd_mkhuge(pfn_pmd(pfn, prot)); > + if (write) { > + entry = pmd_mkyoung(pmd_mkdirty(entry)); > + entry = maybe_pmd_mkwrite(entry, vma); > } > + set_pmd_at(mm, addr, pmd, entry); > + update_mmu_cache_pmd(vma, addr, pmd); > spin_unlock(ptl); Hmm other paths that do pmd_mkwrite are using pmdp_set_access_flags() and it's not immediately clear to me why. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: dax pmd fault handler never returns to userspace 2015-11-18 18:10 ` Dan Williams 2015-11-18 18:23 ` Ross Zwisler @ 2015-11-18 18:30 ` Jeff Moyer 1 sibling, 0 replies; 19+ messages in thread From: Jeff Moyer @ 2015-11-18 18:30 UTC (permalink / raw) To: Dan Williams Cc: Ross Zwisler, linux-fsdevel, linux-nvdimm, linux-ext4, Ross Zwisler Dan Williams <dan.j.williams@intel.com> writes: > Hmm, a loop in the resource sibling list? > > What does /proc/iomem say? Inline, below. -Jeff # cat /proc/iomem 00000000-00000fff : reserved 00001000-00092fff : System RAM 00093000-00093fff : reserved 00094000-0009ffff : System RAM 000a0000-000bffff : PCI Bus 0000:00 000c4000-000cbfff : PCI Bus 0000:00 000f0000-000fffff : System ROM 00100000-6b4fcfff : System RAM 01000000-0169edea : Kernel code 0169edeb-01b3507f : Kernel data 01cf5000-0200ffff : Kernel bss 6b4fd000-6b97cfff : reserved 6b97d000-6b97dfff : System RAM 6b97e000-6b9fefff : reserved 6b9ff000-76d48017 : System RAM 76d48018-76d4d457 : System RAM 76d4d458-76d4e017 : System RAM 76d4e018-76d7dc57 : System RAM 76d7dc58-76d7e017 : System RAM 76d7e018-76dadc57 : System RAM 76dadc58-76dae017 : System RAM 76dae018-76dddc57 : System RAM 76dddc58-76dde017 : System RAM 76dde018-76e0dc57 : System RAM 76e0dc58-76e0e017 : System RAM 76e0e018-76e18057 : System RAM 76e18058-76e19017 : System RAM 76e19018-76e21057 : System RAM 76e21058-76e22017 : System RAM 76e22018-76e7aa57 : System RAM 76e7aa58-76e7b017 : System RAM 76e7b018-76ed3a57 : System RAM 76ed3a58-76ed4017 : System RAM 76ed4018-76ef0457 : System RAM 76ef0458-784fefff : System RAM 784ff000-791fefff : reserved 791c9004-791c902f : APEI ERST 791ca000-791d9fff : APEI ERST 791ff000-7b5fefff : ACPI Non-volatile Storage 7b5ff000-7b7fefff : ACPI Tables 7b7ff000-7b7fffff : System RAM 7b800000-7bffffff : RAM buffer 80000000-8fffffff : PCI MMCONFIG 0000 [bus 00-ff] 80000000-8fffffff : reserved 90000000-c7ffbfff : PCI Bus 0000:00 90000000-92afffff : PCI Bus 0000:01 90000000-9000ffff : 0000:01:00.2 91000000-91ffffff : 0000:01:00.1 91000000-91ffffff : mgadrmfb_vram 92000000-927fffff : 0000:01:00.1 92800000-928fffff : 0000:01:00.2 92800000-928fffff : hpilo 92900000-929fffff : 0000:01:00.2 92900000-929fffff : hpilo 92a00000-92a7ffff : 0000:01:00.2 92a00000-92a7ffff : hpilo 92a80000-92a87fff : 0000:01:00.2 92a80000-92a87fff : hpilo 92a88000-92a8bfff : 0000:01:00.1 92a88000-92a8bfff : mgadrmfb_mmio 92a8c000-92a8c0ff : 0000:01:00.2 92a8c000-92a8c0ff : hpilo 92a8d000-92a8d1ff : 0000:01:00.0 92b00000-92bfffff : PCI Bus 0000:02 92b00000-92b3ffff : 0000:02:00.0 92b40000-92b7ffff : 0000:02:00.1 92b80000-92bbffff : 0000:02:00.2 92bc0000-92bfffff : 0000:02:00.3 93000000-950fffff : PCI Bus 0000:04 93000000-937fffff : 0000:04:00.0 93000000-937fffff : bnx2x 93800000-93ffffff : 0000:04:00.0 93800000-93ffffff : bnx2x 94000000-947fffff : 0000:04:00.1 94000000-947fffff : bnx2x 94800000-94ffffff : 0000:04:00.1 94800000-94ffffff : bnx2x 95000000-9500ffff : 0000:04:00.0 95000000-9500ffff : bnx2x 95010000-9501ffff : 0000:04:00.1 95010000-9501ffff : bnx2x 95080000-950fffff : 0000:04:00.0 95100000-951fffff : PCI Bus 0000:02 95100000-9510ffff : 0000:02:00.3 95100000-9510ffff : tg3 95110000-9511ffff : 0000:02:00.3 95110000-9511ffff : tg3 95120000-9512ffff : 0000:02:00.3 95120000-9512ffff : tg3 95130000-9513ffff : 0000:02:00.2 95130000-9513ffff : tg3 95140000-9514ffff : 0000:02:00.2 95140000-9514ffff : tg3 95150000-9515ffff : 0000:02:00.2 95150000-9515ffff : tg3 95160000-9516ffff : 0000:02:00.1 95160000-9516ffff : tg3 95170000-9517ffff : 0000:02:00.1 95170000-9517ffff : tg3 95180000-9518ffff : 0000:02:00.1 95180000-9518ffff : tg3 95190000-9519ffff : 0000:02:00.0 95190000-9519ffff : tg3 951a0000-951affff : 0000:02:00.0 951a0000-951affff : tg3 951b0000-951bffff : 0000:02:00.0 951b0000-951bffff : tg3 95200000-953fffff : PCI Bus 0000:03 95200000-952fffff : 0000:03:00.0 95200000-952fffff : hpsa 95300000-953003ff : 0000:03:00.0 95300000-953003ff : hpsa 95380000-953fffff : 0000:03:00.0 95400000-954007ff : 0000:00:1f.2 95400000-954007ff : ahci 95401000-954013ff : 0000:00:1d.0 95401000-954013ff : ehci_hcd 95402000-954023ff : 0000:00:1a.0 95402000-954023ff : ehci_hcd 95404000-95404fff : 0000:00:05.4 c7ffc000-c7ffcfff : dmar1 c8000000-fbffbfff : PCI Bus 0000:80 c8000000-c8000fff : 0000:80:05.4 fbffc000-fbffcfff : dmar0 fec00000-fecfffff : PNP0003:00 fec00000-fec003ff : IOAPIC 0 fec01000-fec013ff : IOAPIC 1 fec40000-fec403ff : IOAPIC 2 fed00000-fed003ff : HPET 0 fed00000-fed003ff : PNP0103:00 fed12000-fed1200f : pnp 00:01 fed12010-fed1201f : pnp 00:01 fed1b000-fed1bfff : pnp 00:01 fed1c000-fed3ffff : pnp 00:01 fed1f410-fed1f414 : iTCO_wdt.0.auto fed45000-fed8bfff : pnp 00:01 fee00000-feefffff : pnp 00:01 fee00000-fee00fff : Local APIC ff000000-ffffffff : pnp 00:01 100000000-47fffffff : System RAM 480000000-87fffffff : Persistent Memory 480000000-67fffffff : NVDM002C:00 480000000-67fffffff : btt0.1 680000000-87fffffff : NVDM002C:01 680000000-87fffffff : namespace1.0 880000000-c7fffffff : System RAM c80000000-107fffffff : Persistent Memory c80000000-e7fffffff : NVDM002C:02 c80000000-e7fffffff : namespace2.0 e80000000-107fffffff : NVDM002C:03 e80000000-107fffffff : btt3.1 38000000000-39fffffffff : PCI Bus 0000:00 39fffd00000-39fffefffff : PCI Bus 0000:04 39fffd00000-39fffd7ffff : 0000:04:00.1 39fffd80000-39fffdfffff : 0000:04:00.0 39fffe00000-39fffe1ffff : 0000:04:00.1 39fffe20000-39fffe3ffff : 0000:04:00.0 39ffff00000-39ffff0ffff : 0000:00:14.0 39ffff00000-39ffff0ffff : xhci-hcd 39ffff10000-39ffff13fff : 0000:00:04.7 39ffff10000-39ffff13fff : ioatdma 39ffff14000-39ffff17fff : 0000:00:04.6 39ffff14000-39ffff17fff : ioatdma 39ffff18000-39ffff1bfff : 0000:00:04.5 39ffff18000-39ffff1bfff : ioatdma 39ffff1c000-39ffff1ffff : 0000:00:04.4 39ffff1c000-39ffff1ffff : ioatdma 39ffff20000-39ffff23fff : 0000:00:04.3 39ffff20000-39ffff23fff : ioatdma 39ffff24000-39ffff27fff : 0000:00:04.2 39ffff24000-39ffff27fff : ioatdma 39ffff28000-39ffff2bfff : 0000:00:04.1 39ffff28000-39ffff2bfff : ioatdma 39ffff2c000-39ffff2ffff : 0000:00:04.0 39ffff2c000-39ffff2ffff : ioatdma 39ffff31000-39ffff310ff : 0000:00:1f.3 3a000000000-3bfffffffff : PCI Bus 0000:80 3bffff00000-3bffff03fff : 0000:80:04.7 3bffff00000-3bffff03fff : ioatdma 3bffff04000-3bffff07fff : 0000:80:04.6 3bffff04000-3bffff07fff : ioatdma 3bffff08000-3bffff0bfff : 0000:80:04.5 3bffff08000-3bffff0bfff : ioatdma 3bffff0c000-3bffff0ffff : 0000:80:04.4 3bffff0c000-3bffff0ffff : ioatdma 3bffff10000-3bffff13fff : 0000:80:04.3 3bffff10000-3bffff13fff : ioatdma 3bffff14000-3bffff17fff : 0000:80:04.2 3bffff14000-3bffff17fff : ioatdma 3bffff18000-3bffff1bfff : 0000:80:04.1 3bffff18000-3bffff1bfff : ioatdma 3bffff1c000-3bffff1ffff : 0000:80:04.0 3bffff1c000-3bffff1ffff : ioatdma ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2015-11-19 22:35 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-11-18 15:53 dax pmd fault handler never returns to userspace Jeff Moyer 2015-11-18 15:56 ` Zwisler, Ross 2015-11-18 16:52 ` Dan Williams 2015-11-18 17:00 ` Ross Zwisler 2015-11-18 17:43 ` Jeff Moyer 2015-11-18 18:10 ` Dan Williams 2015-11-18 18:23 ` Ross Zwisler 2015-11-18 18:32 ` Jeff Moyer 2015-11-18 18:53 ` Ross Zwisler 2015-11-18 18:58 ` Dan Williams 2015-11-19 22:34 ` Dave Chinner 2015-11-18 21:33 ` Toshi Kani 2015-11-18 21:57 ` Dan Williams 2015-11-18 22:04 ` Toshi Kani 2015-11-19 0:36 ` Ross Zwisler 2015-11-19 0:39 ` Dan Williams 2015-11-19 1:05 ` Toshi Kani 2015-11-19 1:19 ` Dan Williams 2015-11-18 18:30 ` Jeff Moyer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).