* synchronous page faults for XFS
@ 2017-08-24 15:22 Christoph Hellwig
  2017-08-24 15:22 ` [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Christoph Hellwig @ 2017-08-24 15:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler,
	Dan Williams
Hi Jan,
the three patches below wire up the synchronous page fault code for XFS.
I've only verified it by making sure we always use the synchronous fault
path and then running xfstests, so don't use it for critical applications
just yet :)
This is against the current tree at:
    git://git.kernel.org//pub/scm/linux/kernel/git/jack/linux-fs.git dax_sync_fault
^ permalink raw reply	[flat|nested] 14+ messages in thread* [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite 2017-08-24 15:22 synchronous page faults for XFS Christoph Hellwig @ 2017-08-24 15:22 ` Christoph Hellwig 2017-08-24 19:26 ` Ross Zwisler 2017-08-24 15:22 ` [PATCH 2/3] xfs: consolidate the various page fault handlers Christoph Hellwig 2017-08-24 15:22 ` [PATCH 3/3] xfs: support for synchronous DAX faults Christoph Hellwig 2 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2017-08-24 15:22 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/iomap.c | 4 ++-- fs/xfs/xfs_file.c | 1 - 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/iomap.c b/fs/iomap.c index 039266128b7f..22ffdfeabeda 100644 --- a/fs/iomap.c +++ b/fs/iomap.c @@ -477,10 +477,10 @@ int iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops) set_page_dirty(page); wait_for_stable_page(page); - return 0; + return VM_FAULT_LOCKED; out_unlock: unlock_page(page); - return ret; + return block_page_mkwrite_return(ret); } EXPORT_SYMBOL_GPL(iomap_page_mkwrite); diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index afa226b8c290..4213f02325a2 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1035,7 +1035,6 @@ xfs_filemap_page_mkwrite( ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); } else { ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); - ret = block_page_mkwrite_return(ret); } xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); -- 2.11.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite 2017-08-24 15:22 ` [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite Christoph Hellwig @ 2017-08-24 19:26 ` Ross Zwisler 2017-08-25 7:10 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Ross Zwisler @ 2017-08-24 19:26 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams On Thu, Aug 24, 2017 at 05:22:05PM +0200, Christoph Hellwig wrote: > Signed-off-by: Christoph Hellwig <hch@lst.de> This is obviously correct, but the lack of changelog makes it unclear why the change is being made. Does a later patch in the series depend on iomap_page_mkwrite() returning VM_FAULT_* codes? Is this just cleanup so you can hide the block_page_mkwrite_return() call inside of iomap_page_mkwrite()? I think it's the former, so the logic in __xfs_filemap_fault() is simpler? Anyway, the code is correct, so you can add: Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> > --- > fs/iomap.c | 4 ++-- > fs/xfs/xfs_file.c | 1 - > 2 files changed, 2 insertions(+), 3 deletions(-) > > diff --git a/fs/iomap.c b/fs/iomap.c > index 039266128b7f..22ffdfeabeda 100644 > --- a/fs/iomap.c > +++ b/fs/iomap.c > @@ -477,10 +477,10 @@ int iomap_page_mkwrite(struct vm_fault *vmf, const struct iomap_ops *ops) > > set_page_dirty(page); > wait_for_stable_page(page); > - return 0; > + return VM_FAULT_LOCKED; > out_unlock: > unlock_page(page); > - return ret; > + return block_page_mkwrite_return(ret); > } > EXPORT_SYMBOL_GPL(iomap_page_mkwrite); > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index afa226b8c290..4213f02325a2 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1035,7 +1035,6 @@ xfs_filemap_page_mkwrite( > ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); > } else { > ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); > - ret = block_page_mkwrite_return(ret); > } > > xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > -- > 2.11.0 > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite 2017-08-24 19:26 ` Ross Zwisler @ 2017-08-25 7:10 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-08-25 7:10 UTC (permalink / raw) To: Ross Zwisler, Christoph Hellwig, Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Dan Williams On Thu, Aug 24, 2017 at 01:26:24PM -0600, Ross Zwisler wrote: > On Thu, Aug 24, 2017 at 05:22:05PM +0200, Christoph Hellwig wrote: > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > This is obviously correct, but the lack of changelog makes it unclear why the > change is being made. Does a later patch in the series depend on > iomap_page_mkwrite() returning VM_FAULT_* codes? Is this just cleanup so you > can hide the block_page_mkwrite_return() call inside of iomap_page_mkwrite()? > > I think it's the former, so the logic in __xfs_filemap_fault() is simpler? Yes. ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-24 15:22 synchronous page faults for XFS Christoph Hellwig 2017-08-24 15:22 ` [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite Christoph Hellwig @ 2017-08-24 15:22 ` Christoph Hellwig 2017-08-24 21:29 ` Ross Zwisler 2017-08-24 15:22 ` [PATCH 3/3] xfs: support for synchronous DAX faults Christoph Hellwig 2 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2017-08-24 15:22 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams Add a new __xfs_filemap_fault helper that implements all four page fault callouts, and make these methods themselves small stubs that set the correct write_fault flag, and exit early for the non-DAX case for the hugepage related ones. Life would be so much simpler if we only had one method for all this. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 120 +++++++++++++++-------------------------------------- fs/xfs/xfs_trace.h | 24 +++++++++-- 2 files changed, 54 insertions(+), 90 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 4213f02325a2..460ca9639e66 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1012,34 +1012,36 @@ xfs_file_llseek( * i_lock (XFS - extent map serialisation) */ -/* - * mmap()d file has taken write protection fault and is being made writable. We - * can set the page state up correctly for a writable page, which means we can - * do correct delalloc accounting (ENOSPC checking!) and unwritten extent - * mapping. - */ STATIC int -xfs_filemap_page_mkwrite( - struct vm_fault *vmf) +__xfs_filemap_fault( + struct vm_fault *vmf, + enum page_entry_size pe_size, + bool write_fault) { struct inode *inode = file_inode(vmf->vma->vm_file); + struct xfs_inode *ip = XFS_I(inode); int ret; - trace_xfs_filemap_page_mkwrite(XFS_I(inode)); + trace_xfs_filemap_fault(ip, pe_size, write_fault); - sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); + if (write_fault) { + sb_start_pagefault(inode->i_sb); + file_update_time(vmf->vma->vm_file); + } + xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { - ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); + ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); } else { - ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); + if (write_fault) + ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); + else + ret = filemap_fault(vmf); } - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - sb_end_pagefault(inode->i_sb); + if (write_fault) + sb_end_pagefault(inode->i_sb); return ret; } @@ -1047,93 +1049,39 @@ STATIC int xfs_filemap_fault( struct vm_fault *vmf) { - struct inode *inode = file_inode(vmf->vma->vm_file); - int ret; - - trace_xfs_filemap_fault(XFS_I(inode)); - /* DAX can shortcut the normal fault path on write faults! */ - if ((vmf->flags & FAULT_FLAG_WRITE) && IS_DAX(inode)) - return xfs_filemap_page_mkwrite(vmf); - - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - if (IS_DAX(inode)) - ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); - else - ret = filemap_fault(vmf); - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - - return ret; + return __xfs_filemap_fault(vmf, PE_SIZE_PTE, + IS_DAX(file_inode(vmf->vma->vm_file)) && + (vmf->flags & FAULT_FLAG_WRITE)); } -/* - * Similar to xfs_filemap_fault(), the DAX fault path can call into here on - * both read and write faults. Hence we need to handle both cases. There is no - * ->huge_mkwrite callout for huge pages, so we have a single function here to - * handle both cases here. @flags carries the information on the type of fault - * occuring. - */ STATIC int xfs_filemap_huge_fault( struct vm_fault *vmf, enum page_entry_size pe_size) { - struct inode *inode = file_inode(vmf->vma->vm_file); - struct xfs_inode *ip = XFS_I(inode); - int ret; - - if (!IS_DAX(inode)) + if (!IS_DAX(file_inode(vmf->vma->vm_file))) return VM_FAULT_FALLBACK; - trace_xfs_filemap_huge_fault(ip); - - if (vmf->flags & FAULT_FLAG_WRITE) { - sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); - } - - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); - - if (vmf->flags & FAULT_FLAG_WRITE) - sb_end_pagefault(inode->i_sb); + /* DAX can shortcut the normal fault path on write faults! */ + return __xfs_filemap_fault(vmf, pe_size, + (vmf->flags & FAULT_FLAG_WRITE)); +} - return ret; +STATIC int +xfs_filemap_page_mkwrite( + struct vm_fault *vmf) +{ + return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } -/* - * pfn_mkwrite was originally inteneded to ensure we capture time stamp - * updates on write faults. In reality, it's need to serialise against - * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED - * to ensure we serialise the fault barrier in place. - */ static int xfs_filemap_pfn_mkwrite( struct vm_fault *vmf) { - - struct inode *inode = file_inode(vmf->vma->vm_file); - struct xfs_inode *ip = XFS_I(inode); - int ret = VM_FAULT_NOPAGE; - loff_t size; - - trace_xfs_filemap_pfn_mkwrite(ip); - - sb_start_pagefault(inode->i_sb); - file_update_time(vmf->vma->vm_file); - - /* check if the faulting page hasn't raced with truncate */ - xfs_ilock(ip, XFS_MMAPLOCK_SHARED); - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; - if (vmf->pgoff >= size) - ret = VM_FAULT_SIGBUS; - else if (IS_DAX(inode)) - ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); - xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); - sb_end_pagefault(inode->i_sb); - return ret; - + if (!IS_DAX(file_inode(vmf->vma->vm_file))) + return VM_FAULT_NOPAGE; + return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); } static const struct vm_operations_struct xfs_file_vm_ops = { diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index bcc3cdf8e1c5..3f41d5340eb8 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -689,10 +689,26 @@ DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag); DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag); DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid); -DEFINE_INODE_EVENT(xfs_filemap_fault); -DEFINE_INODE_EVENT(xfs_filemap_huge_fault); -DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite); -DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite); +TRACE_EVENT(xfs_filemap_fault, + TP_PROTO(struct xfs_inode *ip, enum page_entry_size pe_size, + bool write_fault), + TP_ARGS(ip, pe_size, write_fault), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(enum page_entry_size, pe_size) + __field(bool, write_fault) + ), + TP_fast_assign( + __entry->dev = VFS_I(ip)->i_sb->s_dev; + __entry->ino = ip->i_ino; + __entry->pe_size = pe_size; + __entry->write_fault = write_fault; + ), + TP_printk("dev %d:%d ino 0x%llx pe_size %d write_fault %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, __entry->pe_size, __entry->write_fault) +) DECLARE_EVENT_CLASS(xfs_iref_class, TP_PROTO(struct xfs_inode *ip, unsigned long caller_ip), -- 2.11.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-24 15:22 ` [PATCH 2/3] xfs: consolidate the various page fault handlers Christoph Hellwig @ 2017-08-24 21:29 ` Ross Zwisler 2017-08-25 7:15 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Ross Zwisler @ 2017-08-24 21:29 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams On Thu, Aug 24, 2017 at 05:22:06PM +0200, Christoph Hellwig wrote: > Add a new __xfs_filemap_fault helper that implements all four page fault > callouts, and make these methods themselves small stubs that set the > correct write_fault flag, and exit early for the non-DAX case for the > hugepage related ones. > > Life would be so much simpler if we only had one method for all this. > > Signed-off-by: Christoph Hellwig <hch@lst.de> A few nits and a question below. This looks much simpler and gets rid of a lot of duplication. Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> <> > STATIC int > xfs_filemap_huge_fault( > struct vm_fault *vmf, > enum page_entry_size pe_size) > { > - struct inode *inode = file_inode(vmf->vma->vm_file); > - struct xfs_inode *ip = XFS_I(inode); > - int ret; > - > - if (!IS_DAX(inode)) > + if (!IS_DAX(file_inode(vmf->vma->vm_file))) > return VM_FAULT_FALLBACK; > > - trace_xfs_filemap_huge_fault(ip); > - > - if (vmf->flags & FAULT_FLAG_WRITE) { > - sb_start_pagefault(inode->i_sb); > - file_update_time(vmf->vma->vm_file); > - } > - > - xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > - ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); > - xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > - > - if (vmf->flags & FAULT_FLAG_WRITE) > - sb_end_pagefault(inode->i_sb); > + /* DAX can shortcut the normal fault path on write faults! */ Does this comment still make sense for the PMD case? Here's what I *think* it means: For DAX write faults we make the PTE writeable in the ->fault() code and don't rely on a follow-up ->page_mkwrite() call. This happens in do_shared_fault(), where we return before the ->page_mkwrite() call because the DAX write fault returns VM_FAULT_NOPAGE. This is in contrast to the normal page fault case where the ->fault() call ends up calling filemap_fault(), which populates the page read-only, and then do_shared_fault() does a follow-up ->page_mkwrite() call which makes the page writable. First, is the above correct? :) If so, I think the comment doesn't make sense for the PMD fault path because in that case we always make the PMD writeable in the first ->huge_fault() call, as there is no follow-up *_mkwrite()? > + return __xfs_filemap_fault(vmf, pe_size, > + (vmf->flags & FAULT_FLAG_WRITE)); > +} > > - return ret; > +STATIC int > +xfs_filemap_page_mkwrite( > + struct vm_fault *vmf) > +{ > + return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > -/* > - * pfn_mkwrite was originally inteneded to ensure we capture time stamp > - * updates on write faults. In reality, it's need to serialise against > - * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED > - * to ensure we serialise the fault barrier in place. > - */ > static int STATIC > xfs_filemap_pfn_mkwrite( > struct vm_fault *vmf) > { > - > - struct inode *inode = file_inode(vmf->vma->vm_file); > - struct xfs_inode *ip = XFS_I(inode); > - int ret = VM_FAULT_NOPAGE; > - loff_t size; > - > - trace_xfs_filemap_pfn_mkwrite(ip); > - > - sb_start_pagefault(inode->i_sb); > - file_update_time(vmf->vma->vm_file); > - > - /* check if the faulting page hasn't raced with truncate */ > - xfs_ilock(ip, XFS_MMAPLOCK_SHARED); > - size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT; I see that this size check is gone from the new code, and we now rely on the equivalent check in dax_iomap_fault(). Nice. > - if (vmf->pgoff >= size) > - ret = VM_FAULT_SIGBUS; > - else if (IS_DAX(inode)) > - ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops); > - xfs_iunlock(ip, XFS_MMAPLOCK_SHARED); > - sb_end_pagefault(inode->i_sb); > - return ret; > - > + if (!IS_DAX(file_inode(vmf->vma->vm_file))) > + return VM_FAULT_NOPAGE; > + return __xfs_filemap_fault(vmf, PE_SIZE_PTE, true); > } > > static const struct vm_operations_struct xfs_file_vm_ops = { > diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h > index bcc3cdf8e1c5..3f41d5340eb8 100644 > --- a/fs/xfs/xfs_trace.h > +++ b/fs/xfs/xfs_trace.h > @@ -689,10 +689,26 @@ DEFINE_INODE_EVENT(xfs_inode_set_cowblocks_tag); > DEFINE_INODE_EVENT(xfs_inode_clear_cowblocks_tag); > DEFINE_INODE_EVENT(xfs_inode_free_cowblocks_invalid); > > -DEFINE_INODE_EVENT(xfs_filemap_fault); > -DEFINE_INODE_EVENT(xfs_filemap_huge_fault); > -DEFINE_INODE_EVENT(xfs_filemap_page_mkwrite); > -DEFINE_INODE_EVENT(xfs_filemap_pfn_mkwrite); > +TRACE_EVENT(xfs_filemap_fault, > + TP_PROTO(struct xfs_inode *ip, enum page_entry_size pe_size, > + bool write_fault), > + TP_ARGS(ip, pe_size, write_fault), > + TP_STRUCT__entry( > + __field(dev_t, dev) > + __field(xfs_ino_t, ino) > + __field(enum page_entry_size, pe_size) > + __field(bool, write_fault) > + ), > + TP_fast_assign( > + __entry->dev = VFS_I(ip)->i_sb->s_dev; > + __entry->ino = ip->i_ino; > + __entry->pe_size = pe_size; > + __entry->write_fault = write_fault; > + ), > + TP_printk("dev %d:%d ino 0x%llx pe_size %d write_fault %d", > + MAJOR(__entry->dev), MINOR(__entry->dev), > + __entry->ino, __entry->pe_size, __entry->write_fault) > +) I wonder if pe_size would be more readable as a string via __print_symbolic() so we see "pe_size PMD" instead of "pe_size 1" from enum page_entry_size? ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-24 21:29 ` Ross Zwisler @ 2017-08-25 7:15 ` Christoph Hellwig 2017-08-28 17:50 ` Ross Zwisler 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2017-08-25 7:15 UTC (permalink / raw) To: Ross Zwisler, Christoph Hellwig, Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Dan Williams On Thu, Aug 24, 2017 at 03:29:42PM -0600, Ross Zwisler wrote: > > + /* DAX can shortcut the normal fault path on write faults! */ > > Does this comment still make sense for the PMD case? I kept the code as before, and also kept the comment intent as-is (although I shortend it a bit). > Here's what I *think* it > means: For DAX write faults we make the PTE writeable in the ->fault() code > and don't rely on a follow-up ->page_mkwrite() call. This happens in > do_shared_fault(), where we return before the ->page_mkwrite() call because > the DAX write fault returns VM_FAULT_NOPAGE. > > This is in contrast to the normal page fault case where the ->fault() call > ends up calling filemap_fault(), which populates the page read-only, and then > do_shared_fault() does a follow-up ->page_mkwrite() call which makes the page > writable. > > First, is the above correct? :) Yes. > If so, I think the comment doesn't make > sense for the PMD fault path because in that case we always make the PMD > writeable in the first ->huge_fault() call, as there is no follow-up > *_mkwrite()? Well, there is non-DAX huge_fault case. It still is different from the normal non-hugepage fault case, so the comment still makes some sense, but maybe I should remove the DAX. That being said if we eventually get huge page page cache support (patches are on the list) I suspect they'll work like the normal fault path, not the DAX path. > > static int > > STATIC We're phasing that out, so I should probably remove it where it's currently newly added/moved in the patch. > I see that this size check is gone from the new code, and we now rely on the > equivalent check in dax_iomap_fault(). Nice. Yes. And I actually used to mention that in the changelog, but it got lost. I'll re-add it. > I wonder if pe_size would be more readable as a string via __print_symbolic() > so we see "pe_size PMD" instead of "pe_size 1" from enum page_entry_size? Probably. I was just lazy :) If I want to go all the way I could also use a __print_flags for the FAULT_FLAG_* flags. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-25 7:15 ` Christoph Hellwig @ 2017-08-28 17:50 ` Ross Zwisler 2017-08-28 17:52 ` Christoph Hellwig 0 siblings, 1 reply; 14+ messages in thread From: Ross Zwisler @ 2017-08-28 17:50 UTC (permalink / raw) To: Christoph Hellwig Cc: Ross Zwisler, Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Dan Williams On Fri, Aug 25, 2017 at 09:15:38AM +0200, Christoph Hellwig wrote: > On Thu, Aug 24, 2017 at 03:29:42PM -0600, Ross Zwisler wrote: <> > > I wonder if pe_size would be more readable as a string via __print_symbolic() > > so we see "pe_size PMD" instead of "pe_size 1" from enum page_entry_size? > > Probably. I was just lazy :) If I want to go all the way I could also > use a __print_flags for the FAULT_FLAG_* flags. Yep - I'm doing that in include/trace/events/fs_dax.h if you want to just copy & paste. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-28 17:50 ` Ross Zwisler @ 2017-08-28 17:52 ` Christoph Hellwig 2017-08-28 20:09 ` Ross Zwisler 0 siblings, 1 reply; 14+ messages in thread From: Christoph Hellwig @ 2017-08-28 17:52 UTC (permalink / raw) To: Ross Zwisler, Christoph Hellwig, Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Dan Williams On Mon, Aug 28, 2017 at 11:50:14AM -0600, Ross Zwisler wrote: > Yep - I'm doing that in include/trace/events/fs_dax.h if you want to just copy > & paste. Oh, nice. But maybe we need the trace point in the core MM instead of in XFS and the DAX code? :) ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 2/3] xfs: consolidate the various page fault handlers 2017-08-28 17:52 ` Christoph Hellwig @ 2017-08-28 20:09 ` Ross Zwisler 0 siblings, 0 replies; 14+ messages in thread From: Ross Zwisler @ 2017-08-28 20:09 UTC (permalink / raw) To: Christoph Hellwig Cc: Ross Zwisler, Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Dan Williams On Mon, Aug 28, 2017 at 07:52:57PM +0200, Christoph Hellwig wrote: > On Mon, Aug 28, 2017 at 11:50:14AM -0600, Ross Zwisler wrote: > > Yep - I'm doing that in include/trace/events/fs_dax.h if you want to just copy > > & paste. > > Oh, nice. But maybe we need the trace point in the core MM instead > of in XFS and the DAX code? :) Yea, there aren't any tracepoints at all in mm/memory.c. I'm not sure if that's because we just haven't gotten around to putting them in, or if for some reason it was intentional to not trace that bit of code? ^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH 3/3] xfs: support for synchronous DAX faults 2017-08-24 15:22 synchronous page faults for XFS Christoph Hellwig 2017-08-24 15:22 ` [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite Christoph Hellwig 2017-08-24 15:22 ` [PATCH 2/3] xfs: consolidate the various page fault handlers Christoph Hellwig @ 2017-08-24 15:22 ` Christoph Hellwig 2017-08-24 21:42 ` Ross Zwisler 2017-08-25 0:13 ` Dan Williams 2 siblings, 2 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-08-24 15:22 UTC (permalink / raw) To: Jan Kara; +Cc: linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams Return IOMAP_F_NEEDDSYNC from xfs_file_iomap_begin() for a synchronous write fault when inode is pinned, and has dirty fields other than the timestamps. In xfs_filemap_huge_fault() we then detect this case and call dax_finish_sync_fault() to make sure all metadata is committed, and to insert the page table entry. Note that this will also dirty corresponding radix tree entry which is what we want - fsync(2) will still provide data integrity guarantees for applications not using userspace flushing. And applications using userspace flushing can avoid calling fsync(2) and thus avoid the performance overhead. Signed-off-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 6 +++++- fs/xfs/xfs_iomap.c | 5 +++++ 2 files changed, 10 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 460ca9639e66..5c3597b7dbf9 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -1031,7 +1031,11 @@ __xfs_filemap_fault( xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); if (IS_DAX(inode)) { - ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); + pfn_t pfn; + + ret = dax_iomap_fault(vmf, pe_size, &pfn, &xfs_iomap_ops); + if (ret & VM_FAULT_NEEDDSYNC) + ret = dax_finish_sync_fault(vmf, pe_size, pfn); } else { if (write_fault) ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c index 813394c62849..e7c762c8fe27 100644 --- a/fs/xfs/xfs_iomap.c +++ b/fs/xfs/xfs_iomap.c @@ -33,6 +33,7 @@ #include "xfs_error.h" #include "xfs_trans.h" #include "xfs_trans_space.h" +#include "xfs_inode_item.h" #include "xfs_iomap.h" #include "xfs_trace.h" #include "xfs_icache.h" @@ -1085,6 +1086,10 @@ xfs_file_iomap_begin( trace_xfs_iomap_found(ip, offset, length, 0, &imap); } + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) + iomap->flags |= IOMAP_F_NEEDDSYNC; + xfs_bmbt_to_iomap(ip, iomap, &imap); /* optionally associate a dax device with the iomap bdev */ -- 2.11.0 ^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: support for synchronous DAX faults 2017-08-24 15:22 ` [PATCH 3/3] xfs: support for synchronous DAX faults Christoph Hellwig @ 2017-08-24 21:42 ` Ross Zwisler 2017-08-25 0:13 ` Dan Williams 1 sibling, 0 replies; 14+ messages in thread From: Ross Zwisler @ 2017-08-24 21:42 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-fsdevel, linux-nvdimm, linux-xfs, Ross Zwisler, Dan Williams On Thu, Aug 24, 2017 at 05:22:07PM +0200, Christoph Hellwig wrote: > Return IOMAP_F_NEEDDSYNC from xfs_file_iomap_begin() for a synchronous > write fault when inode is pinned, and has dirty fields other than the > timestamps. In xfs_filemap_huge_fault() we then detect this case and __xfs_filemap_fault() Otherwise this looks good to me: Reviewed-by: Ross Zwisler <ross.zwisler@linux.intel.com> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: support for synchronous DAX faults 2017-08-24 15:22 ` [PATCH 3/3] xfs: support for synchronous DAX faults Christoph Hellwig 2017-08-24 21:42 ` Ross Zwisler @ 2017-08-25 0:13 ` Dan Williams 2017-08-25 6:50 ` Christoph Hellwig 1 sibling, 1 reply; 14+ messages in thread From: Dan Williams @ 2017-08-25 0:13 UTC (permalink / raw) To: Christoph Hellwig Cc: Jan Kara, linux-fsdevel, linux-nvdimm@lists.01.org, linux-xfs, Ross Zwisler On Thu, Aug 24, 2017 at 8:22 AM, Christoph Hellwig <hch@lst.de> wrote: > Return IOMAP_F_NEEDDSYNC from xfs_file_iomap_begin() for a synchronous > write fault when inode is pinned, and has dirty fields other than the > timestamps. In xfs_filemap_huge_fault() we then detect this case and > call dax_finish_sync_fault() to make sure all metadata is committed, and > to insert the page table entry. > > Note that this will also dirty corresponding radix tree entry which is > what we want - fsync(2) will still provide data integrity guarantees for > applications not using userspace flushing. And applications using > userspace flushing can avoid calling fsync(2) and thus avoid the > performance overhead. > > Signed-off-by: Christoph Hellwig <hch@lst.de> > --- > fs/xfs/xfs_file.c | 6 +++++- > fs/xfs/xfs_iomap.c | 5 +++++ > 2 files changed, 10 insertions(+), 1 deletion(-) > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c > index 460ca9639e66..5c3597b7dbf9 100644 > --- a/fs/xfs/xfs_file.c > +++ b/fs/xfs/xfs_file.c > @@ -1031,7 +1031,11 @@ __xfs_filemap_fault( > > xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED); > if (IS_DAX(inode)) { > - ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops); > + pfn_t pfn; > + > + ret = dax_iomap_fault(vmf, pe_size, &pfn, &xfs_iomap_ops); > + if (ret & VM_FAULT_NEEDDSYNC) > + ret = dax_finish_sync_fault(vmf, pe_size, pfn); > } else { > if (write_fault) > ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops); > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c > index 813394c62849..e7c762c8fe27 100644 > --- a/fs/xfs/xfs_iomap.c > +++ b/fs/xfs/xfs_iomap.c > @@ -33,6 +33,7 @@ > #include "xfs_error.h" > #include "xfs_trans.h" > #include "xfs_trans_space.h" > +#include "xfs_inode_item.h" > #include "xfs_iomap.h" > #include "xfs_trace.h" > #include "xfs_icache.h" > @@ -1085,6 +1086,10 @@ xfs_file_iomap_begin( > trace_xfs_iomap_found(ip, offset, length, 0, &imap); > } > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) && > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP)) > + iomap->flags |= IOMAP_F_NEEDDSYNC; > + So this behaves differently than ext4 that returns EOPNOTSUP in the !DAX case. Are we generally documenting MAP_SYNC to be ignored in the pagecache case? Or should we explicitly fail MAP_SYNC for the !DAX case on all filesystems? Another option is to finish block allocations at fault time in the pagecache+MAP_SYNC case, but still require fsync to writeback dirty pages, but that seems pointless. Whatever we do I think all implementations should agree. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [PATCH 3/3] xfs: support for synchronous DAX faults 2017-08-25 0:13 ` Dan Williams @ 2017-08-25 6:50 ` Christoph Hellwig 0 siblings, 0 replies; 14+ messages in thread From: Christoph Hellwig @ 2017-08-25 6:50 UTC (permalink / raw) To: Dan Williams Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, linux-nvdimm@lists.01.org, linux-xfs, Ross Zwisler On Thu, Aug 24, 2017 at 05:13:54PM -0700, Dan Williams wrote: > > So this behaves differently than ext4 that returns EOPNOTSUP in the !DAX case. > > Are we generally documenting MAP_SYNC to be ignored in the pagecache > case? Or should we explicitly fail MAP_SYNC for the !DAX case on all > filesystems? It's just me being lazy for now until we've settled on the exact mmap interface. With your new ->mmap signature we can do proper flags checking, and I would add it. But with only Jans patches it seems like we'd silently support MAP_SYNC for all other file systems anyway. > Another option is to finish block allocations at fault time in the > pagecache+MAP_SYNC case, but still require fsync to writeback dirty > pages, but that seems pointless. Agreed. > Whatever we do I think all implementations should agree. Sure. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2017-08-28 20:09 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-08-24 15:22 synchronous page faults for XFS Christoph Hellwig 2017-08-24 15:22 ` [PATCH 1/3] iomap: return VM_FAULT_* codes from iomap_page_mkwrite Christoph Hellwig 2017-08-24 19:26 ` Ross Zwisler 2017-08-25 7:10 ` Christoph Hellwig 2017-08-24 15:22 ` [PATCH 2/3] xfs: consolidate the various page fault handlers Christoph Hellwig 2017-08-24 21:29 ` Ross Zwisler 2017-08-25 7:15 ` Christoph Hellwig 2017-08-28 17:50 ` Ross Zwisler 2017-08-28 17:52 ` Christoph Hellwig 2017-08-28 20:09 ` Ross Zwisler 2017-08-24 15:22 ` [PATCH 3/3] xfs: support for synchronous DAX faults Christoph Hellwig 2017-08-24 21:42 ` Ross Zwisler 2017-08-25 0:13 ` Dan Williams 2017-08-25 6:50 ` Christoph Hellwig
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).