* [PATCH v2 0/2] ext4: fix DAX dma vs truncate/hole-punch @ 2018-06-27 21:22 Ross Zwisler [not found] ` <20180627212252.31032-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 0 siblings, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-06-27 21:22 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jeff Moyer, linux-ext4-u79uwXL29TY76Z2rM5mHXA This series from Dan: https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html added synchronization between DAX dma and truncate/hole-punch in XFS. This short series adds analogous support to ext4. I've added calls to ext4_break_layouts() everywhere that ext4 removes blocks from an inode's map. The timings in XFS are such that it's difficult to hit this race. Dan was able to show the race by manually introducing delays in the direct I/O path. For ext4, though, its trivial to hit this race, and a hit will result in a trigger of this WARN_ON_ONCE() in dax_disassociate_entry(): WARN_ON_ONCE(trunc && page_ref_count(page) > 1); I've made an xfstest which tests all the paths where we now call ext4_break_layouts(). Each of the four paths easily hits this race many times in my test setup with the xfstest. You can find that test here: https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html With these patches applied, I've still seen occasional hits of the above WARN_ON_ONCE(), which tells me that we still have some work to do. I'll continue looking at these more rare hits. --- Changes in v2: * A little cleanup to each patch as suggested by Jan. * Removed the ext4_break_layouts() call in ext4_truncate_failed_write() and added a comment instead. (Jan) * Added reviewed-by tags from Jan. Ross Zwisler (2): dax: dax_layout_busy_page() warn on !exceptional ext4: handle layout changes to pinned DAX mappings fs/dax.c | 10 +++++++++- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 12 ++++++++++++ fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/truncate.h | 4 ++++ 5 files changed, 72 insertions(+), 1 deletion(-) -- 2.14.4 ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180627212252.31032-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional [not found] ` <20180627212252.31032-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-06-27 21:22 ` Ross Zwisler [not found] ` <20180627212252.31032-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-27 21:22 ` [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler 1 sibling, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-06-27 21:22 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jeff Moyer, linux-ext4-u79uwXL29TY76Z2rM5mHXA Inodes using DAX should only ever have exceptional entries in their page caches. Make this clear by warning if the iteration in dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a comment for the pagevec_release() call which only deals with struct page pointers. Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/dax.c | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/dax.c b/fs/dax.c index 641192808bb6..897b51e41d8f 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -566,7 +566,8 @@ struct page *dax_layout_busy_page(struct address_space *mapping) if (index >= end) break; - if (!radix_tree_exceptional_entry(pvec_ent)) + if (WARN_ON_ONCE( + !radix_tree_exceptional_entry(pvec_ent))) continue; xa_lock_irq(&mapping->i_pages); @@ -578,6 +579,13 @@ struct page *dax_layout_busy_page(struct address_space *mapping) if (page) break; } + + /* + * We don't expect normal struct page entries to exist in our + * tree, but we keep these pagevec calls so that this code is + * consistent with the common pattern for handling pagevecs + * throughout the kernel. + */ pagevec_remove_exceptionals(&pvec); pagevec_release(&pvec); index++; -- 2.14.4 ^ permalink raw reply related [flat|nested] 26+ messages in thread
[parent not found: <20180627212252.31032-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional [not found] ` <20180627212252.31032-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-07-02 22:15 ` Theodore Y. Ts'o [not found] ` <20180702221503.GA12830-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 0 siblings, 1 reply; 26+ messages in thread From: Theodore Y. Ts'o @ 2018-07-02 22:15 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Wed, Jun 27, 2018 at 03:22:51PM -0600, Ross Zwisler wrote: > Inodes using DAX should only ever have exceptional entries in their page > caches. Make this clear by warning if the iteration in > dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a > comment for the pagevec_release() call which only deals with struct page > pointers. > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Thanks, applied (to the ext4 tree). If someone thinks they should go in via some other tree, holler. - Ted ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180702221503.GA12830-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org>]
* Re: [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional [not found] ` <20180702221503.GA12830-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> @ 2018-07-03 15:41 ` Ross Zwisler [not found] ` <20180703154137.GB13019-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 0 siblings, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-07-03 15:41 UTC (permalink / raw) To: Theodore Y. Ts'o Cc: Jan Kara, Darrick J. Wong, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Mon, Jul 02, 2018 at 06:15:03PM -0400, Theodore Y. Ts'o wrote: > On Wed, Jun 27, 2018 at 03:22:51PM -0600, Ross Zwisler wrote: > > Inodes using DAX should only ever have exceptional entries in their page > > caches. Make this clear by warning if the iteration in > > dax_layout_busy_page() ever sees a non-exceptional entry, and by adding a > > comment for the pagevec_release() call which only deals with struct page > > pointers. > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > Thanks, applied (to the ext4 tree). If someone thinks they should go > in via some other tree, holler. > > - Ted Hey Ted, It looks like you only picked up patch 1/2? (I'm looking at the 'dev' branch in your repo.) Was that intentional? You can find the final version of the 2nd patch here: https://lists.01.org/pipermail/linux-nvdimm/2018-July/016602.html Thanks, - Ross ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180703154137.GB13019-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional [not found] ` <20180703154137.GB13019-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-07-03 17:44 ` Theodore Y. Ts'o 0 siblings, 0 replies; 26+ messages in thread From: Theodore Y. Ts'o @ 2018-07-03 17:44 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Tue, Jul 03, 2018 at 09:41:37AM -0600, Ross Zwisler wrote: > > It looks like you only picked up patch 1/2? (I'm looking at the 'dev' branch > in your repo.) Was that intentional? Actually, it was a mistake, in that if you looked at the commit, it's currently an empty commit. The patch failed to apply because the ext4 tree is still based on v4.17-rc4. My current plan is to hold the two patches until I get the current patch of fixes pushed to Linus (probably in the next day or two; I'll drop the empty commit before I send a pull request to reduce confusion). I'll then reset the ext4 tree to be based on v4.17 (or possibly v4.18-rcX if that is necessary) and then apply the two patches in this series. Apologies for the confusion.... - Ted ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180627212252.31032-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-27 21:22 ` [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler @ 2018-06-27 21:22 ` Ross Zwisler [not found] ` <20180627212252.31032-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 1 sibling, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-06-27 21:22 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jeff Moyer, linux-ext4-u79uwXL29TY76Z2rM5mHXA Follow the lead of xfs_break_dax_layouts() and add synchronization between operations in ext4 which remove blocks from an inode (hole punch, truncate down, etc.) and pages which are pinned due to DAX DMA operations. Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> --- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 12 ++++++++++++ fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/truncate.h | 4 ++++ 4 files changed, 63 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0b127853c584..34bccd64d83d 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); extern int ext4_inode_attach_jinode(struct inode *inode); extern int ext4_can_truncate(struct inode *inode); extern int ext4_truncate(struct inode *); +extern int ext4_break_layouts(struct inode *); extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); extern void ext4_set_inode_flags(struct inode *); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 0057fe3f248d..a6aef06f455b 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, * released from page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) { + up_write(&EXT4_I(inode)->i_mmap_sem); + goto out_mutex; + } + ret = ext4_update_disksize_before_punch(inode, offset, len); if (ret) { up_write(&EXT4_I(inode)->i_mmap_sem); @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_mmap; + /* * Need to round down offset to be aligned with page size boundary * for page size > block size. diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 2ea07efbe016..fadb8ecacb1e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, return 0; } +static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) +{ + *did_unlock = true; + up_write(&ei->i_mmap_sem); + schedule(); + down_write(&ei->i_mmap_sem); +} + +int ext4_break_layouts(struct inode *inode) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + struct page *page; + bool retry; + int error; + + if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) + return -EINVAL; + + do { + retry = false; + page = dax_layout_busy_page(inode->i_mapping); + if (!page) + return 0; + + error = ___wait_var_event(&page->_refcount, + atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + ext4_wait_dax_page(ei, &retry)); + } while (error == 0 && retry); + + return error; +} + /* * ext4_punch_hole: punches a hole in a file by releasing the blocks * associated with the given offset and length @@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_dio; + first_block_offset = round_up(offset, sb->s_blocksize); last_block_offset = round_down((offset + length), sb->s_blocksize) - 1; @@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) ext4_wait_for_tail_page_commit(inode); } down_write(&EXT4_I(inode)->i_mmap_sem); + + rc = ext4_break_layouts(inode); + if (rc) { + up_write(&EXT4_I(inode)->i_mmap_sem); + error = rc; + goto err_out; + } + /* * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h index 0cb13badf473..bcbe3668c1d4 100644 --- a/fs/ext4/truncate.h +++ b/fs/ext4/truncate.h @@ -11,6 +11,10 @@ */ static inline void ext4_truncate_failed_write(struct inode *inode) { + /* + * We don't need to call ext4_break_layouts() because the blocks we + * are truncating were never visible to userspace. + */ down_write(&EXT4_I(inode)->i_mmap_sem); truncate_inode_pages(inode->i_mapping, inode->i_size); ext4_truncate(inode); -- 2.14.4 ^ permalink raw reply related [flat|nested] 26+ messages in thread
[parent not found: <20180627212252.31032-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180627212252.31032-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-06-29 12:02 ` Lukas Czerner [not found] ` <20180629120223.oaslngsvspnwf4ae-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2018-07-02 17:29 ` [PATCH v3 " Ross Zwisler 1 sibling, 1 reply; 26+ messages in thread From: Lukas Czerner @ 2018-06-29 12:02 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > Follow the lead of xfs_break_dax_layouts() and add synchronization between > operations in ext4 which remove blocks from an inode (hole punch, truncate > down, etc.) and pages which are pinned due to DAX DMA operations. > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > --- > fs/ext4/ext4.h | 1 + > fs/ext4/extents.c | 12 ++++++++++++ > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > fs/ext4/truncate.h | 4 ++++ > 4 files changed, 63 insertions(+) > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > index 0b127853c584..34bccd64d83d 100644 > --- a/fs/ext4/ext4.h > +++ b/fs/ext4/ext4.h > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > extern int ext4_inode_attach_jinode(struct inode *inode); > extern int ext4_can_truncate(struct inode *inode); > extern int ext4_truncate(struct inode *); > +extern int ext4_break_layouts(struct inode *); > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > extern void ext4_set_inode_flags(struct inode *); > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > index 0057fe3f248d..a6aef06f455b 100644 > --- a/fs/ext4/extents.c > +++ b/fs/ext4/extents.c > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > * released from page cache. > */ > down_write(&EXT4_I(inode)->i_mmap_sem); > + > + ret = ext4_break_layouts(inode); > + if (ret) { > + up_write(&EXT4_I(inode)->i_mmap_sem); > + goto out_mutex; > + } > + > ret = ext4_update_disksize_before_punch(inode, offset, len); > if (ret) { > up_write(&EXT4_I(inode)->i_mmap_sem); > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > * page cache. > */ > down_write(&EXT4_I(inode)->i_mmap_sem); > + > + ret = ext4_break_layouts(inode); > + if (ret) > + goto out_mmap; Hi, don't we need to do the same for ext4_insert_range() since we're about to truncate_pagecache() as well ? /thinking out loud/ Xfs seems to do this before every fallocate operation, but in ext4 it does not seem to be needed at least for simply allocating falocate... Thanks! -Lukas > + > /* > * Need to round down offset to be aligned with page size boundary > * for page size > block size. > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 2ea07efbe016..fadb8ecacb1e 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, > return 0; > } > > +static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) > +{ > + *did_unlock = true; > + up_write(&ei->i_mmap_sem); > + schedule(); > + down_write(&ei->i_mmap_sem); > +} > + > +int ext4_break_layouts(struct inode *inode) > +{ > + struct ext4_inode_info *ei = EXT4_I(inode); > + struct page *page; > + bool retry; > + int error; > + > + if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) > + return -EINVAL; > + > + do { > + retry = false; > + page = dax_layout_busy_page(inode->i_mapping); > + if (!page) > + return 0; > + > + error = ___wait_var_event(&page->_refcount, > + atomic_read(&page->_refcount) == 1, > + TASK_INTERRUPTIBLE, 0, 0, > + ext4_wait_dax_page(ei, &retry)); > + } while (error == 0 && retry); > + > + return error; > +} > + > /* > * ext4_punch_hole: punches a hole in a file by releasing the blocks > * associated with the given offset and length > @@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) > * page cache. > */ > down_write(&EXT4_I(inode)->i_mmap_sem); > + > + ret = ext4_break_layouts(inode); > + if (ret) > + goto out_dio; > + > first_block_offset = round_up(offset, sb->s_blocksize); > last_block_offset = round_down((offset + length), sb->s_blocksize) - 1; > > @@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) > ext4_wait_for_tail_page_commit(inode); > } > down_write(&EXT4_I(inode)->i_mmap_sem); > + > + rc = ext4_break_layouts(inode); > + if (rc) { > + up_write(&EXT4_I(inode)->i_mmap_sem); > + error = rc; > + goto err_out; > + } > + > /* > * Truncate pagecache after we've waited for commit > * in data=journal mode to make pages freeable. > diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h > index 0cb13badf473..bcbe3668c1d4 100644 > --- a/fs/ext4/truncate.h > +++ b/fs/ext4/truncate.h > @@ -11,6 +11,10 @@ > */ > static inline void ext4_truncate_failed_write(struct inode *inode) > { > + /* > + * We don't need to call ext4_break_layouts() because the blocks we > + * are truncating were never visible to userspace. > + */ > down_write(&EXT4_I(inode)->i_mmap_sem); > truncate_inode_pages(inode->i_mapping, inode->i_size); > ext4_truncate(inode); > -- > 2.14.4 > ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180629120223.oaslngsvspnwf4ae-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180629120223.oaslngsvspnwf4ae-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2018-06-29 15:13 ` Ross Zwisler [not found] ` <20180629151300.GA3006-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-30 1:12 ` Dave Chinner 1 sibling, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-06-29 15:13 UTC (permalink / raw) To: Lukas Czerner Cc: Jan Kara, Darrick J. Wong, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote: > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > --- > > fs/ext4/ext4.h | 1 + > > fs/ext4/extents.c | 12 ++++++++++++ > > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > fs/ext4/truncate.h | 4 ++++ > > 4 files changed, 63 insertions(+) > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > index 0b127853c584..34bccd64d83d 100644 > > --- a/fs/ext4/ext4.h > > +++ b/fs/ext4/ext4.h > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > > extern int ext4_inode_attach_jinode(struct inode *inode); > > extern int ext4_can_truncate(struct inode *inode); > > extern int ext4_truncate(struct inode *); > > +extern int ext4_break_layouts(struct inode *); > > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > > extern void ext4_set_inode_flags(struct inode *); > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > index 0057fe3f248d..a6aef06f455b 100644 > > --- a/fs/ext4/extents.c > > +++ b/fs/ext4/extents.c > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > * released from page cache. > > */ > > down_write(&EXT4_I(inode)->i_mmap_sem); > > + > > + ret = ext4_break_layouts(inode); > > + if (ret) { > > + up_write(&EXT4_I(inode)->i_mmap_sem); > > + goto out_mutex; > > + } > > + > > ret = ext4_update_disksize_before_punch(inode, offset, len); > > if (ret) { > > up_write(&EXT4_I(inode)->i_mmap_sem); > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > > * page cache. > > */ > > down_write(&EXT4_I(inode)->i_mmap_sem); > > + > > + ret = ext4_break_layouts(inode); > > + if (ret) > > + goto out_mmap; > > Hi, > > don't we need to do the same for ext4_insert_range() since we're about > to truncate_pagecache() as well ? > > /thinking out loud/ > Xfs seems to do this before every fallocate operation, but in ext4 > it does not seem to be needed at least for simply allocating falocate... I saw the case in ext4_insert_range(), and decided that we didn't need to worry about synchronizing with DAX because no blocks were being removed from the inode's extent map. IIUC the truncate_pagecache() call is needed because we are unmapping and removing any page cache mappings for the part of the file after the insert because those blocks are now at a different offset in the inode. Because at the end of the operation we haven't removed any DAX pages from the inode, we have nothing that we need to synchronize. Hmm, unless this is a failure case we care about fixing? 1) schedule I/O via O_DIRECT to page X 2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger offset 3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block that resides at X - the I/O from 1) completes In this case the user is running I/O and issuing the fallocate at the same time, and the sequencing could have worked out that #1 and #2 were reversed, giving you the same behavior. IMO this seems fine and that we shouldn't have the DAX synchronization call in ext4_insert_range(), but I'm happy to add it if I'm wrong. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180629151300.GA3006-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180629151300.GA3006-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-07-02 7:34 ` Jan Kara 2018-07-02 7:59 ` Lukas Czerner 1 sibling, 0 replies; 26+ messages in thread From: Jan Kara @ 2018-07-02 7:34 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, Darrick J. Wong, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Lukas Czerner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Fri 29-06-18 09:13:00, Ross Zwisler wrote: > On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote: > > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > --- > > > fs/ext4/ext4.h | 1 + > > > fs/ext4/extents.c | 12 ++++++++++++ > > > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > fs/ext4/truncate.h | 4 ++++ > > > 4 files changed, 63 insertions(+) > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > > index 0b127853c584..34bccd64d83d 100644 > > > --- a/fs/ext4/ext4.h > > > +++ b/fs/ext4/ext4.h > > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > > > extern int ext4_inode_attach_jinode(struct inode *inode); > > > extern int ext4_can_truncate(struct inode *inode); > > > extern int ext4_truncate(struct inode *); > > > +extern int ext4_break_layouts(struct inode *); > > > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > > > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > > > extern void ext4_set_inode_flags(struct inode *); > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > index 0057fe3f248d..a6aef06f455b 100644 > > > --- a/fs/ext4/extents.c > > > +++ b/fs/ext4/extents.c > > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > * released from page cache. > > > */ > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > + > > > + ret = ext4_break_layouts(inode); > > > + if (ret) { > > > + up_write(&EXT4_I(inode)->i_mmap_sem); > > > + goto out_mutex; > > > + } > > > + > > > ret = ext4_update_disksize_before_punch(inode, offset, len); > > > if (ret) { > > > up_write(&EXT4_I(inode)->i_mmap_sem); > > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > > > * page cache. > > > */ > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > + > > > + ret = ext4_break_layouts(inode); > > > + if (ret) > > > + goto out_mmap; > > > > Hi, > > > > don't we need to do the same for ext4_insert_range() since we're about > > to truncate_pagecache() as well ? > > > > /thinking out loud/ > > Xfs seems to do this before every fallocate operation, but in ext4 > > it does not seem to be needed at least for simply allocating falocate... > > I saw the case in ext4_insert_range(), and decided that we didn't need to > worry about synchronizing with DAX because no blocks were being removed from > the inode's extent map. IIUC the truncate_pagecache() call is needed because > we are unmapping and removing any page cache mappings for the part of the file > after the insert because those blocks are now at a different offset in the > inode. Because at the end of the operation we haven't removed any DAX pages > from the inode, we have nothing that we need to synchronize. > > Hmm, unless this is a failure case we care about fixing? > 1) schedule I/O via O_DIRECT to page X > 2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger > offset > 3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block > that resides at X - the I/O from 1) completes > > In this case the user is running I/O and issuing the fallocate at the same > time, and the sequencing could have worked out that #1 and #2 were reversed, > giving you the same behavior. IMO this seems fine and that we shouldn't have > the DAX synchronization call in ext4_insert_range(), but I'm happy to add it > if I'm wrong. I also don't see a way how ext4_insert_range() not calling ext4_break_layouts() could lead to corruption. However the whole operation with splitting (and possible zeroing out) of extents, truncation of extents beyond EOF, etc. is complex enough that I'm not sure I've thought through all the corner cases :-) So it at least deserves a comment why ext4_break_layouts() is not necessary here (also as a reminder in case we change the implementation and it would be suddently needed). Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180629151300.GA3006-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-02 7:34 ` Jan Kara @ 2018-07-02 7:59 ` Lukas Czerner [not found] ` <20180702075948.i4aqjg5rrorwoxqj-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 1 sibling, 1 reply; 26+ messages in thread From: Lukas Czerner @ 2018-07-02 7:59 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Fri, Jun 29, 2018 at 09:13:00AM -0600, Ross Zwisler wrote: > On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote: > > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > --- > > > fs/ext4/ext4.h | 1 + > > > fs/ext4/extents.c | 12 ++++++++++++ > > > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > fs/ext4/truncate.h | 4 ++++ > > > 4 files changed, 63 insertions(+) > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > > index 0b127853c584..34bccd64d83d 100644 > > > --- a/fs/ext4/ext4.h > > > +++ b/fs/ext4/ext4.h > > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > > > extern int ext4_inode_attach_jinode(struct inode *inode); > > > extern int ext4_can_truncate(struct inode *inode); > > > extern int ext4_truncate(struct inode *); > > > +extern int ext4_break_layouts(struct inode *); > > > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > > > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > > > extern void ext4_set_inode_flags(struct inode *); > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > index 0057fe3f248d..a6aef06f455b 100644 > > > --- a/fs/ext4/extents.c > > > +++ b/fs/ext4/extents.c > > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > * released from page cache. > > > */ > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > + > > > + ret = ext4_break_layouts(inode); > > > + if (ret) { > > > + up_write(&EXT4_I(inode)->i_mmap_sem); > > > + goto out_mutex; > > > + } > > > + > > > ret = ext4_update_disksize_before_punch(inode, offset, len); > > > if (ret) { > > > up_write(&EXT4_I(inode)->i_mmap_sem); > > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > > > * page cache. > > > */ > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > + > > > + ret = ext4_break_layouts(inode); > > > + if (ret) > > > + goto out_mmap; > > > > Hi, > > > > don't we need to do the same for ext4_insert_range() since we're about > > to truncate_pagecache() as well ? > > > > /thinking out loud/ > > Xfs seems to do this before every fallocate operation, but in ext4 > > it does not seem to be needed at least for simply allocating falocate... > > I saw the case in ext4_insert_range(), and decided that we didn't need to > worry about synchronizing with DAX because no blocks were being removed from > the inode's extent map. IIUC the truncate_pagecache() call is needed because > we are unmapping and removing any page cache mappings for the part of the file > after the insert because those blocks are now at a different offset in the > inode. Because at the end of the operation we haven't removed any DAX pages > from the inode, we have nothing that we need to synchronize. > > Hmm, unless this is a failure case we care about fixing? > 1) schedule I/O via O_DIRECT to page X > 2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger > offset > 3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block > that resides at X - the I/O from 1) completes > > In this case the user is running I/O and issuing the fallocate at the same > time, and the sequencing could have worked out that #1 and #2 were reversed, > giving you the same behavior. IMO this seems fine and that we shouldn't have > the DAX synchronization call in ext4_insert_range(), but I'm happy to add it > if I'm wrong. Hi, I think you're right, this case might mot matter much. I am just worried about unforeseen consequences of changing the layout with dax pages mapped. I guess we can also add this later fi we discover anything. You can add Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> Thanks! -Lukas ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180702075948.i4aqjg5rrorwoxqj-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180702075948.i4aqjg5rrorwoxqj-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2018-07-02 16:27 ` Ross Zwisler 0 siblings, 0 replies; 26+ messages in thread From: Ross Zwisler @ 2018-07-02 16:27 UTC (permalink / raw) To: Lukas Czerner Cc: Jan Kara, Darrick J. Wong, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Mon, Jul 02, 2018 at 09:59:48AM +0200, Lukas Czerner wrote: > On Fri, Jun 29, 2018 at 09:13:00AM -0600, Ross Zwisler wrote: > > On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote: > > > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > --- > > > > fs/ext4/ext4.h | 1 + > > > > fs/ext4/extents.c | 12 ++++++++++++ > > > > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > > > fs/ext4/truncate.h | 4 ++++ > > > > 4 files changed, 63 insertions(+) > > > > > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > > > index 0b127853c584..34bccd64d83d 100644 > > > > --- a/fs/ext4/ext4.h > > > > +++ b/fs/ext4/ext4.h > > > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > > > > extern int ext4_inode_attach_jinode(struct inode *inode); > > > > extern int ext4_can_truncate(struct inode *inode); > > > > extern int ext4_truncate(struct inode *); > > > > +extern int ext4_break_layouts(struct inode *); > > > > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > > > > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > > > > extern void ext4_set_inode_flags(struct inode *); > > > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > > > index 0057fe3f248d..a6aef06f455b 100644 > > > > --- a/fs/ext4/extents.c > > > > +++ b/fs/ext4/extents.c > > > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > > > * released from page cache. > > > > */ > > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > > + > > > > + ret = ext4_break_layouts(inode); > > > > + if (ret) { > > > > + up_write(&EXT4_I(inode)->i_mmap_sem); > > > > + goto out_mutex; > > > > + } > > > > + > > > > ret = ext4_update_disksize_before_punch(inode, offset, len); > > > > if (ret) { > > > > up_write(&EXT4_I(inode)->i_mmap_sem); > > > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > > > > * page cache. > > > > */ > > > > down_write(&EXT4_I(inode)->i_mmap_sem); > > > > + > > > > + ret = ext4_break_layouts(inode); > > > > + if (ret) > > > > + goto out_mmap; > > > > > > Hi, > > > > > > don't we need to do the same for ext4_insert_range() since we're about > > > to truncate_pagecache() as well ? > > > > > > /thinking out loud/ > > > Xfs seems to do this before every fallocate operation, but in ext4 > > > it does not seem to be needed at least for simply allocating falocate... > > > > I saw the case in ext4_insert_range(), and decided that we didn't need to > > worry about synchronizing with DAX because no blocks were being removed from > > the inode's extent map. IIUC the truncate_pagecache() call is needed because > > we are unmapping and removing any page cache mappings for the part of the file > > after the insert because those blocks are now at a different offset in the > > inode. Because at the end of the operation we haven't removed any DAX pages > > from the inode, we have nothing that we need to synchronize. > > > > Hmm, unless this is a failure case we care about fixing? > > 1) schedule I/O via O_DIRECT to page X > > 2) fallocate(FALLOC_FL_INSERT_RANGE) to block < X, shifting X to a larger > > offset > > 3) O_DIRECT I/O from 1) completes, but ends up writing into the *new* block > > that resides at X - the I/O from 1) completes > > > > In this case the user is running I/O and issuing the fallocate at the same > > time, and the sequencing could have worked out that #1 and #2 were reversed, > > giving you the same behavior. IMO this seems fine and that we shouldn't have > > the DAX synchronization call in ext4_insert_range(), but I'm happy to add it > > if I'm wrong. > > Hi, > > I think you're right, this case might mot matter much. I am just worried > about unforeseen consequences of changing the layout with dax pages > mapped. I guess we can also add this later fi we discover anything. > > You can add > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > Thanks! > -Lukas Thank you for the review. I'll add a comment to help explain my reasoning, as Jan suggested. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180629120223.oaslngsvspnwf4ae-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2018-06-29 15:13 ` Ross Zwisler @ 2018-06-30 1:12 ` Dave Chinner 1 sibling, 0 replies; 26+ messages in thread From: Dave Chinner @ 2018-06-30 1:12 UTC (permalink / raw) To: Lukas Czerner Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Fri, Jun 29, 2018 at 02:02:23PM +0200, Lukas Czerner wrote: > On Wed, Jun 27, 2018 at 03:22:52PM -0600, Ross Zwisler wrote: > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > --- > > fs/ext4/ext4.h | 1 + > > fs/ext4/extents.c | 12 ++++++++++++ > > fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ > > fs/ext4/truncate.h | 4 ++++ > > 4 files changed, 63 insertions(+) > > > > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h > > index 0b127853c584..34bccd64d83d 100644 > > --- a/fs/ext4/ext4.h > > +++ b/fs/ext4/ext4.h > > @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); > > extern int ext4_inode_attach_jinode(struct inode *inode); > > extern int ext4_can_truncate(struct inode *inode); > > extern int ext4_truncate(struct inode *); > > +extern int ext4_break_layouts(struct inode *); > > extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); > > extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); > > extern void ext4_set_inode_flags(struct inode *); > > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c > > index 0057fe3f248d..a6aef06f455b 100644 > > --- a/fs/ext4/extents.c > > +++ b/fs/ext4/extents.c > > @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, > > * released from page cache. > > */ > > down_write(&EXT4_I(inode)->i_mmap_sem); > > + > > + ret = ext4_break_layouts(inode); > > + if (ret) { > > + up_write(&EXT4_I(inode)->i_mmap_sem); > > + goto out_mutex; > > + } > > + > > ret = ext4_update_disksize_before_punch(inode, offset, len); > > if (ret) { > > up_write(&EXT4_I(inode)->i_mmap_sem); > > @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) > > * page cache. > > */ > > down_write(&EXT4_I(inode)->i_mmap_sem); > > + > > + ret = ext4_break_layouts(inode); > > + if (ret) > > + goto out_mmap; > > Hi, > > don't we need to do the same for ext4_insert_range() since we're about > to truncate_pagecache() as well ? > > /thinking out loud/ > Xfs seems to do this before every fallocate operation, but in ext4 > it does not seem to be needed at least for simply allocating falocate... The PNFS client may have mapped a hole over the range we are about to allocate. Hence it's mapping will be stale, and it needs to remap the range. i.e. any change to the extent list needs to break the lease of any client that may have a cached layout on that inode. i.e. it's not just extent removal that we are protecting against - we are trying to ensure clients using leases to access blocks directly remain coherent with the filesystem's extent map. Hence any potential change to the extent map requires breaking the lease.... Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180627212252.31032-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-29 12:02 ` Lukas Czerner @ 2018-07-02 17:29 ` Ross Zwisler [not found] ` <20180702172912.329-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 1 sibling, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-07-02 17:29 UTC (permalink / raw) To: Jan Kara, Dan Williams, Dave Chinner, Darrick J. Wong, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Jeff Moyer, linux-ext4-u79uwXL29TY76Z2rM5mHXA Follow the lead of xfs_break_dax_layouts() and add synchronization between operations in ext4 which remove blocks from an inode (hole punch, truncate down, etc.) and pages which are pinned due to DAX DMA operations. Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> --- Changes since v2: * Added a comment to ext4_insert_range() explaining why we don't call ext4_break_layouts(). (Jan) * Added Lukas's reviewed-by. Thanks again to Lukas & Jan for their reviews. --- fs/ext4/ext4.h | 1 + fs/ext4/extents.c | 17 +++++++++++++++++ fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++ fs/ext4/truncate.h | 4 ++++ 4 files changed, 68 insertions(+) diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 0b127853c584..34bccd64d83d 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -2460,6 +2460,7 @@ extern int ext4_get_inode_loc(struct inode *, struct ext4_iloc *); extern int ext4_inode_attach_jinode(struct inode *inode); extern int ext4_can_truncate(struct inode *inode); extern int ext4_truncate(struct inode *); +extern int ext4_break_layouts(struct inode *); extern int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length); extern int ext4_truncate_restart_trans(handle_t *, struct inode *, int nblocks); extern void ext4_set_inode_flags(struct inode *); diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 0057fe3f248d..2d0f7e942b05 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4820,6 +4820,13 @@ static long ext4_zero_range(struct file *file, loff_t offset, * released from page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) { + up_write(&EXT4_I(inode)->i_mmap_sem); + goto out_mutex; + } + ret = ext4_update_disksize_before_punch(inode, offset, len); if (ret) { up_write(&EXT4_I(inode)->i_mmap_sem); @@ -5493,6 +5500,11 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_mmap; + /* * Need to round down offset to be aligned with page size boundary * for page size > block size. @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) LLONG_MAX); if (ret) goto out_mmap; + /* + * We don't need to call ext4_break_layouts() because we aren't + * removing any blocks from the inode. We are just changing their + * offset by inserting a hole. + */ truncate_pagecache(inode, ioffset); credits = ext4_writepage_trans_blocks(inode); diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 2ea07efbe016..fadb8ecacb1e 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4193,6 +4193,39 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset, return 0; } +static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock) +{ + *did_unlock = true; + up_write(&ei->i_mmap_sem); + schedule(); + down_write(&ei->i_mmap_sem); +} + +int ext4_break_layouts(struct inode *inode) +{ + struct ext4_inode_info *ei = EXT4_I(inode); + struct page *page; + bool retry; + int error; + + if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem))) + return -EINVAL; + + do { + retry = false; + page = dax_layout_busy_page(inode->i_mapping); + if (!page) + return 0; + + error = ___wait_var_event(&page->_refcount, + atomic_read(&page->_refcount) == 1, + TASK_INTERRUPTIBLE, 0, 0, + ext4_wait_dax_page(ei, &retry)); + } while (error == 0 && retry); + + return error; +} + /* * ext4_punch_hole: punches a hole in a file by releasing the blocks * associated with the given offset and length @@ -4266,6 +4299,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length) * page cache. */ down_write(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_break_layouts(inode); + if (ret) + goto out_dio; + first_block_offset = round_up(offset, sb->s_blocksize); last_block_offset = round_down((offset + length), sb->s_blocksize) - 1; @@ -5554,6 +5592,14 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr) ext4_wait_for_tail_page_commit(inode); } down_write(&EXT4_I(inode)->i_mmap_sem); + + rc = ext4_break_layouts(inode); + if (rc) { + up_write(&EXT4_I(inode)->i_mmap_sem); + error = rc; + goto err_out; + } + /* * Truncate pagecache after we've waited for commit * in data=journal mode to make pages freeable. diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h index 0cb13badf473..bcbe3668c1d4 100644 --- a/fs/ext4/truncate.h +++ b/fs/ext4/truncate.h @@ -11,6 +11,10 @@ */ static inline void ext4_truncate_failed_write(struct inode *inode) { + /* + * We don't need to call ext4_break_layouts() because the blocks we + * are truncating were never visible to userspace. + */ down_write(&EXT4_I(inode)->i_mmap_sem); truncate_inode_pages(inode->i_mapping, inode->i_size); ext4_truncate(inode); -- 2.14.4 ^ permalink raw reply related [flat|nested] 26+ messages in thread
[parent not found: <20180702172912.329-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180702172912.329-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-07-04 0:49 ` Dave Chinner 2018-07-04 12:27 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-07-04 0:49 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > Follow the lead of xfs_break_dax_layouts() and add synchronization between > operations in ext4 which remove blocks from an inode (hole punch, truncate > down, etc.) and pages which are pinned due to DAX DMA operations. > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > --- > > Changes since v2: > * Added a comment to ext4_insert_range() explaining why we don't call > ext4_break_layouts(). (Jan) Which I think is wrong and will cause data corruption. > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > LLONG_MAX); > if (ret) > goto out_mmap; > + /* > + * We don't need to call ext4_break_layouts() because we aren't > + * removing any blocks from the inode. We are just changing their > + * offset by inserting a hole. > + */ The entire point of these leases is so that a thrid party can directly access the blocks underlying the file. That means they are keeping their own file offset<->disk block mapping internally, and they are assuming that it is valid for as long as they hold the lease. If the filesystem modifies the extent map - even something like a shift here which changes the offset<->disk block mapping - the userspace app now has a stale mapping and so the lease *must be broken* to tell it that it's mappings are now stale and it needs to refetch them. If the app doesn't refetch it's mappings after something like a shift, it will be reading and writing data from the wrong file offset, and that will lead to the app silently corrupting it's data. IOWs, layouts need to be broken by any operation that modifies the extent map in any way, not just those operations that remove blocks. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-04 0:49 ` Dave Chinner @ 2018-07-04 12:27 ` Jan Kara [not found] ` <20180704122723.lup2wovzb6u6ta6v-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2018-07-04 12:27 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Wed 04-07-18 10:49:23, Dave Chinner wrote: > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > --- > > > > Changes since v2: > > * Added a comment to ext4_insert_range() explaining why we don't call > > ext4_break_layouts(). (Jan) > > Which I think is wrong and will cause data corruption. > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > LLONG_MAX); > > if (ret) > > goto out_mmap; > > + /* > > + * We don't need to call ext4_break_layouts() because we aren't > > + * removing any blocks from the inode. We are just changing their > > + * offset by inserting a hole. > > + */ > > The entire point of these leases is so that a thrid party can > directly access the blocks underlying the file. That means they are > keeping their own file offset<->disk block mapping internally, and > they are assuming that it is valid for as long as they hold the > lease. If the filesystem modifies the extent map - even something > like a shift here which changes the offset<->disk block mapping - > the userspace app now has a stale mapping and so the lease *must be > broken* to tell it that it's mappings are now stale and it needs to > refetch them. Well, ext4 has no real concept of leases and no pNFS support. And DAX requirements wrt consistency are much weaker than those of pNFS. This is mostly caused by the fact that calls like invalidate_mapping_pages() will flush offset<->pfn mappings DAX maintains in the radix tree automatically (similarly as it happens when page cache is used). What Ross did just keeps ext4 + DAX behave similarly as ext4 + page cache does - i.e., if you use mmaped file as a buffer e.g. for direct IO and mix your direct IO with extent manipulations on that buffer file, you will get inconsistent results. With page cache, pages you use as buffers will get detached from the inode during extent manipulations and discarded once you drop your page references. With DAX, changes will land at a different offset of the file than you might have thought. But that is the same as if we just waited for the IO to complete (which is what ext4_break_layouts() effectively does) and then shifted those blocks. So the biggest problem I can see here is that ext4_break_layouts() is a misnomer as it promises more than the function actually does (wait for page references to be dropped). If we called it like ext4_dax_unmap_pages(), things would be clearer I guess. Ross? Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180704122723.lup2wovzb6u6ta6v-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180704122723.lup2wovzb6u6ta6v-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> @ 2018-07-04 23:54 ` Dave Chinner 2018-07-05 3:59 ` Darrick J. Wong 0 siblings, 1 reply; 26+ messages in thread From: Dave Chinner @ 2018-07-04 23:54 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > --- > > > > > > Changes since v2: > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > ext4_break_layouts(). (Jan) > > > > Which I think is wrong and will cause data corruption. > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > LLONG_MAX); > > > if (ret) > > > goto out_mmap; > > > + /* > > > + * We don't need to call ext4_break_layouts() because we aren't > > > + * removing any blocks from the inode. We are just changing their > > > + * offset by inserting a hole. > > > + */ > > > > The entire point of these leases is so that a thrid party can > > directly access the blocks underlying the file. That means they are > > keeping their own file offset<->disk block mapping internally, and > > they are assuming that it is valid for as long as they hold the > > lease. If the filesystem modifies the extent map - even something > > like a shift here which changes the offset<->disk block mapping - > > the userspace app now has a stale mapping and so the lease *must be > > broken* to tell it that it's mappings are now stale and it needs to > > refetch them. > > Well, ext4 has no real concept of leases and no pNFS support. And DAX > requirements wrt consistency are much weaker than those of pNFS. This is > mostly caused by the fact that calls like invalidate_mapping_pages() will > flush offset<->pfn mappings DAX maintains in the radix tree automatically > (similarly as it happens when page cache is used). I'm more concerned about apps that use file leases behaving the same way, not just the pNFS stuff. if we are /delegating file layouts/ to 3rd parties, then all filesystems *need* to behave the same way. We've already defined those semantics with XFS - every time the filesystem changes an extent layout in any way it needs to break existing layout delegations... > What Ross did just keeps ext4 + DAX behave similarly as ext4 + page cache > does Sure. But the issue I'm raising is that ext4 is not playing by the same extent layout delegation rules that XFS has already defined for 3rd party use. i.e. don't fuck up layout delegation behaviour consistency right from the start just because "<this subset of functionality> is all we need right now for ext4". All the filesystems should implement the same semantics and behaviour right from the start, otherwise we're just going to make life a misery for anyone who tries to use layout delegations in future. Haven't we learnt this lesson the hard way enough times already? Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-04 23:54 ` Dave Chinner @ 2018-07-05 3:59 ` Darrick J. Wong 2018-07-05 16:53 ` Ross Zwisler 2018-07-05 20:40 ` Dan Williams 0 siblings, 2 replies; 26+ messages in thread From: Darrick J. Wong @ 2018-07-05 3:59 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Christoph Hellwig, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > --- > > > > > > > > Changes since v2: > > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > > ext4_break_layouts(). (Jan) > > > > > > Which I think is wrong and will cause data corruption. > > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > > LLONG_MAX); > > > > if (ret) > > > > goto out_mmap; > > > > + /* > > > > + * We don't need to call ext4_break_layouts() because we aren't > > > > + * removing any blocks from the inode. We are just changing their > > > > + * offset by inserting a hole. > > > > + */ Does calling ext4_break_layouts from insert range not work? It's my understanding that file leases work are a mechanism for the filesystem to delegate some of its authority over physical space mappings to "client" software. AFAICT it's used for mmap'ing pmem directly into userspace and exporting space on shared storage over pNFS. Some day we might use the same mechanism for the similar things that RDMA does, or the swapfile code since that's essentially how it works today. The other part of these file leases is that the filesystem revokes them any time it wants to perform a mapping operation on a file. This breaks my mental model of how leases work, and if you commit to this for ext4 then I'll have to remember that leases are different between xfs and ext4. Worse, since the reason for skipping ext4_break_layouts seems to be the implementation detail that "DAX won't care", then someone else wiring up pNFS/future RDMA/whatever will also have to remember to put it back into ext4 or else kaboom. Granted, Dave said all these things already, but I actually feel strongly enough to reiterate. --D > > > > > > The entire point of these leases is so that a thrid party can > > > directly access the blocks underlying the file. That means they are > > > keeping their own file offset<->disk block mapping internally, and > > > they are assuming that it is valid for as long as they hold the > > > lease. If the filesystem modifies the extent map - even something > > > like a shift here which changes the offset<->disk block mapping - > > > the userspace app now has a stale mapping and so the lease *must be > > > broken* to tell it that it's mappings are now stale and it needs to > > > refetch them. > > > > Well, ext4 has no real concept of leases and no pNFS support. And DAX > > requirements wrt consistency are much weaker than those of pNFS. This is > > mostly caused by the fact that calls like invalidate_mapping_pages() will > > flush offset<->pfn mappings DAX maintains in the radix tree automatically > > (similarly as it happens when page cache is used). > > I'm more concerned about apps that use file leases behaving the same > way, not just the pNFS stuff. if we are /delegating file layouts/ to > 3rd parties, then all filesystems *need* to behave the same way. > We've already defined those semantics with XFS - every time the > filesystem changes an extent layout in any way it needs to break > existing layout delegations... > > > What Ross did just keeps ext4 + DAX behave similarly as ext4 + page cache > > does > > Sure. But the issue I'm raising is that ext4 is not playing by the > same extent layout delegation rules that XFS has already defined for > 3rd party use. > > i.e. don't fuck up layout delegation behaviour consistency right > from the start just because "<this subset of functionality> is all > we need right now for ext4". All the filesystems should implement > the same semantics and behaviour right from the start, otherwise > we're just going to make life a misery for anyone who tries to use > layout delegations in future. > > Haven't we learnt this lesson the hard way enough times already? > > Cheers, > > Dave. > > -- > Dave Chinner > david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-05 3:59 ` Darrick J. Wong @ 2018-07-05 16:53 ` Ross Zwisler [not found] ` <20180705165310.GB22200-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-05 20:40 ` Dan Williams 1 sibling, 1 reply; 26+ messages in thread From: Ross Zwisler @ 2018-07-05 16:53 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Christoph Hellwig, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote: > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > --- > > > > > > > > > > Changes since v2: > > > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > > > ext4_break_layouts(). (Jan) > > > > > > > > Which I think is wrong and will cause data corruption. > > > > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > > > LLONG_MAX); > > > > > if (ret) > > > > > goto out_mmap; > > > > > + /* > > > > > + * We don't need to call ext4_break_layouts() because we aren't > > > > > + * removing any blocks from the inode. We are just changing their > > > > > + * offset by inserting a hole. > > > > > + */ > > Does calling ext4_break_layouts from insert range not work? > > It's my understanding that file leases work are a mechanism for the > filesystem to delegate some of its authority over physical space > mappings to "client" software. AFAICT it's used for mmap'ing pmem > directly into userspace and exporting space on shared storage over > pNFS. Some day we might use the same mechanism for the similar things > that RDMA does, or the swapfile code since that's essentially how it > works today. > > The other part of these file leases is that the filesystem revokes them > any time it wants to perform a mapping operation on a file. This breaks > my mental model of how leases work, and if you commit to this for ext4 > then I'll have to remember that leases are different between xfs and > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > be the implementation detail that "DAX won't care", then someone else > wiring up pNFS/future RDMA/whatever will also have to remember to put it > back into ext4 or else kaboom. > > Granted, Dave said all these things already, but I actually feel > strongly enough to reiterate. Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() to keep the lease mechanism consistent between ext4 and XFS, or would you prefer the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename? ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180705165310.GB22200-VuQAYsv1563Yd54FQh9/CA@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180705165310.GB22200-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> @ 2018-07-09 12:33 ` Jan Kara [not found] ` <20180709123347.nw3ixr64prgk7sxz-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 0 siblings, 1 reply; 26+ messages in thread From: Jan Kara @ 2018-07-09 12:33 UTC (permalink / raw) To: Ross Zwisler Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Darrick J. Wong, Dave Chinner, linux-ext4-u79uwXL29TY76Z2rM5mHXA, Christoph Hellwig On Thu 05-07-18 10:53:10, Ross Zwisler wrote: > On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote: > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > > > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > --- > > > > > > > > > > > > Changes since v2: > > > > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > > > > ext4_break_layouts(). (Jan) > > > > > > > > > > Which I think is wrong and will cause data corruption. > > > > > > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > > > > LLONG_MAX); > > > > > > if (ret) > > > > > > goto out_mmap; > > > > > > + /* > > > > > > + * We don't need to call ext4_break_layouts() because we aren't > > > > > > + * removing any blocks from the inode. We are just changing their > > > > > > + * offset by inserting a hole. > > > > > > + */ > > > > Does calling ext4_break_layouts from insert range not work? > > > > It's my understanding that file leases work are a mechanism for the > > filesystem to delegate some of its authority over physical space > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > directly into userspace and exporting space on shared storage over > > pNFS. Some day we might use the same mechanism for the similar things > > that RDMA does, or the swapfile code since that's essentially how it > > works today. > > > > The other part of these file leases is that the filesystem revokes them > > any time it wants to perform a mapping operation on a file. This breaks > > my mental model of how leases work, and if you commit to this for ext4 > > then I'll have to remember that leases are different between xfs and > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > be the implementation detail that "DAX won't care", then someone else > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > back into ext4 or else kaboom. > > > > Granted, Dave said all these things already, but I actually feel > > strongly enough to reiterate. > > Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() to > keep the lease mechanism consistent between ext4 and XFS, or would you prefer > the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename? Let's just call it from ext4_insert_range(). I think the simple semantics Dave and Darrick defend is more maintainable and insert range isn't really performance critical operation. The question remains whether equivalent of BREAK_UNMAP is really required also for allocation of new blocks using fallocate. Because that doesn't really seem fundamentally different from normal write which uses BREAK_WRITE for xfs_break_layouts(). And that it more often used operation so bothering with GUP synchronization when not needed could hurt there. Dave, Darrick? Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180709123347.nw3ixr64prgk7sxz-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180709123347.nw3ixr64prgk7sxz-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> @ 2018-07-09 16:23 ` Darrick J. Wong 2018-07-09 19:49 ` Jan Kara 0 siblings, 1 reply; 26+ messages in thread From: Darrick J. Wong @ 2018-07-09 16:23 UTC (permalink / raw) To: Jan Kara Cc: linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Christoph Hellwig, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Mon, Jul 09, 2018 at 02:33:47PM +0200, Jan Kara wrote: > On Thu 05-07-18 10:53:10, Ross Zwisler wrote: > > On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote: > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > > > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > > > > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > > --- > > > > > > > > > > > > > > Changes since v2: > > > > > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > > > > > ext4_break_layouts(). (Jan) > > > > > > > > > > > > Which I think is wrong and will cause data corruption. > > > > > > > > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > > > > > LLONG_MAX); > > > > > > > if (ret) > > > > > > > goto out_mmap; > > > > > > > + /* > > > > > > > + * We don't need to call ext4_break_layouts() because we aren't > > > > > > > + * removing any blocks from the inode. We are just changing their > > > > > > > + * offset by inserting a hole. > > > > > > > + */ > > > > > > Does calling ext4_break_layouts from insert range not work? > > > > > > It's my understanding that file leases work are a mechanism for the > > > filesystem to delegate some of its authority over physical space > > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > > directly into userspace and exporting space on shared storage over > > > pNFS. Some day we might use the same mechanism for the similar things > > > that RDMA does, or the swapfile code since that's essentially how it > > > works today. > > > > > > The other part of these file leases is that the filesystem revokes them > > > any time it wants to perform a mapping operation on a file. This breaks > > > my mental model of how leases work, and if you commit to this for ext4 > > > then I'll have to remember that leases are different between xfs and > > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > > be the implementation detail that "DAX won't care", then someone else > > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > > back into ext4 or else kaboom. > > > > > > Granted, Dave said all these things already, but I actually feel > > > strongly enough to reiterate. > > > > Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() to > > keep the lease mechanism consistent between ext4 and XFS, or would you prefer > > the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename? > > Let's just call it from ext4_insert_range(). I think the simple semantics > Dave and Darrick defend is more maintainable and insert range isn't really > performance critical operation. > > The question remains whether equivalent of BREAK_UNMAP is really required > also for allocation of new blocks using fallocate. Because that doesn't > really seem fundamentally different from normal write which uses > BREAK_WRITE for xfs_break_layouts(). And that it more often used operation > so bothering with GUP synchronization when not needed could hurt there. > Dave, Darrick? Hmm, IIRC BREAK_UNMAP is supposed to be for callers who are going to remove (or move) mappings that already exist, so that the caller blocks until the lessee acknowledges that they've forgotten all the mappings they knew about. So I /think/ for fallocate mode 0 I think this could be BREAK_WRITE instead of _UNMAP, though (at least for xfs) the other modes all need _UNMAP. Side question: in xfs_file_aio_write_checks, do we need to do BREAK_UNMAP if is possible that writeback will end up performing a copy write? Granted, the pnfs export and dax stuff don't support reflink or cow so I guess this is an academic question for now... --D > Honza > -- > Jan Kara <jack-IBi9RG/b67k@public.gmane.org> > SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-09 16:23 ` Darrick J. Wong @ 2018-07-09 19:49 ` Jan Kara 0 siblings, 0 replies; 26+ messages in thread From: Jan Kara @ 2018-07-09 19:49 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw, Dave Chinner, Christoph Hellwig, linux-ext4-u79uwXL29TY76Z2rM5mHXA On Mon 09-07-18 09:23:38, Darrick J. Wong wrote: > On Mon, Jul 09, 2018 at 02:33:47PM +0200, Jan Kara wrote: > > On Thu 05-07-18 10:53:10, Ross Zwisler wrote: > > > On Wed, Jul 04, 2018 at 08:59:52PM -0700, Darrick J. Wong wrote: > > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > > > > On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > > > > > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > > > > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > > > > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > > > > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > > > > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > > > > > > > > > > > > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > > > > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > > > > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > > > > > > --- > > > > > > > > > > > > > > > > Changes since v2: > > > > > > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > > > > > > ext4_break_layouts(). (Jan) > > > > > > > > > > > > > > Which I think is wrong and will cause data corruption. > > > > > > > > > > > > > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > > > > > > LLONG_MAX); > > > > > > > > if (ret) > > > > > > > > goto out_mmap; > > > > > > > > + /* > > > > > > > > + * We don't need to call ext4_break_layouts() because we aren't > > > > > > > > + * removing any blocks from the inode. We are just changing their > > > > > > > > + * offset by inserting a hole. > > > > > > > > + */ > > > > > > > > Does calling ext4_break_layouts from insert range not work? > > > > > > > > It's my understanding that file leases work are a mechanism for the > > > > filesystem to delegate some of its authority over physical space > > > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > > > directly into userspace and exporting space on shared storage over > > > > pNFS. Some day we might use the same mechanism for the similar things > > > > that RDMA does, or the swapfile code since that's essentially how it > > > > works today. > > > > > > > > The other part of these file leases is that the filesystem revokes them > > > > any time it wants to perform a mapping operation on a file. This breaks > > > > my mental model of how leases work, and if you commit to this for ext4 > > > > then I'll have to remember that leases are different between xfs and > > > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > > > be the implementation detail that "DAX won't care", then someone else > > > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > > > back into ext4 or else kaboom. > > > > > > > > Granted, Dave said all these things already, but I actually feel > > > > strongly enough to reiterate. > > > > > > Jan, would you like me to call ext4_break_layouts() in ext4_insert_range() to > > > keep the lease mechanism consistent between ext4 and XFS, or would you prefer > > > the s/ext4_break_layouts()/ext4_dax_unmap_pages()/ rename? > > > > Let's just call it from ext4_insert_range(). I think the simple semantics > > Dave and Darrick defend is more maintainable and insert range isn't really > > performance critical operation. > > > > The question remains whether equivalent of BREAK_UNMAP is really required > > also for allocation of new blocks using fallocate. Because that doesn't > > really seem fundamentally different from normal write which uses > > BREAK_WRITE for xfs_break_layouts(). And that it more often used operation > > so bothering with GUP synchronization when not needed could hurt there. > > Dave, Darrick? > > Hmm, IIRC BREAK_UNMAP is supposed to be for callers who are going to > remove (or move) mappings that already exist, so that the caller blocks > until the lessee acknowledges that they've forgotten all the mappings > they knew about. So I /think/ for fallocate mode 0 I think this could > be BREAK_WRITE instead of _UNMAP, though (at least for xfs) the other > modes all need _UNMAP. OK, so we are on the same page here :) > Side question: in xfs_file_aio_write_checks, do we need to do > BREAK_UNMAP if is possible that writeback will end up performing a copy > write? Granted, the pnfs export and dax stuff don't support reflink or > cow so I guess this is an academic question for now... My naive understanding would be that yes, BREAK_UNMAP is needed in such case... Honza -- Jan Kara <jack-IBi9RG/b67k@public.gmane.org> SUSE Labs, CR ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-05 3:59 ` Darrick J. Wong 2018-07-05 16:53 ` Ross Zwisler @ 2018-07-05 20:40 ` Dan Williams [not found] ` <CAPcyv4jSNh95XUPh4ZzguKmcJpgNG7AG5_9=+gbLEjsaZUTq4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 26+ messages in thread From: Dan Williams @ 2018-07-05 20:40 UTC (permalink / raw) To: Darrick J. Wong Cc: Jan Kara, linux-nvdimm, Dave Chinner, linux-ext4, Christoph Hellwig On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote: >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: >> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between >> > > > operations in ext4 which remove blocks from an inode (hole punch, truncate >> > > > down, etc.) and pages which are pinned due to DAX DMA operations. >> > > > >> > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> >> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> >> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> >> > > > --- >> > > > >> > > > Changes since v2: >> > > > * Added a comment to ext4_insert_range() explaining why we don't call >> > > > ext4_break_layouts(). (Jan) >> > > >> > > Which I think is wrong and will cause data corruption. >> > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) >> > > > LLONG_MAX); >> > > > if (ret) >> > > > goto out_mmap; >> > > > + /* >> > > > + * We don't need to call ext4_break_layouts() because we aren't >> > > > + * removing any blocks from the inode. We are just changing their >> > > > + * offset by inserting a hole. >> > > > + */ > > Does calling ext4_break_layouts from insert range not work? > > It's my understanding that file leases work are a mechanism for the > filesystem to delegate some of its authority over physical space > mappings to "client" software. AFAICT it's used for mmap'ing pmem > directly into userspace and exporting space on shared storage over > pNFS. Some day we might use the same mechanism for the similar things > that RDMA does, or the swapfile code since that's essentially how it > works today. > > The other part of these file leases is that the filesystem revokes them > any time it wants to perform a mapping operation on a file. This breaks > my mental model of how leases work, and if you commit to this for ext4 > then I'll have to remember that leases are different between xfs and > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > be the implementation detail that "DAX won't care", then someone else > wiring up pNFS/future RDMA/whatever will also have to remember to put it > back into ext4 or else kaboom. > > Granted, Dave said all these things already, but I actually feel > strongly enough to reiterate. This patch kit is only for the DAX fix, this isn't full layout lease support. Even XFS is special casing unmap with the BREAK_UNMAP flag. So ext4 is achieving feature parity for BREAK_UNMAP, just not BREAK_WRITE, yet. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <CAPcyv4jSNh95XUPh4ZzguKmcJpgNG7AG5_9=+gbLEjsaZUTq4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <CAPcyv4jSNh95XUPh4ZzguKmcJpgNG7AG5_9=+gbLEjsaZUTq4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2018-07-05 23:29 ` Dave Chinner 2018-07-06 5:08 ` Dan Williams 2018-07-09 9:59 ` Lukas Czerner 0 siblings, 2 replies; 26+ messages in thread From: Dave Chinner @ 2018-07-05 23:29 UTC (permalink / raw) To: Dan Williams Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, linux-ext4, Christoph Hellwig On Thu, Jul 05, 2018 at 01:40:17PM -0700, Dan Williams wrote: > On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > >> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > >> > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > >> > > > down, etc.) and pages which are pinned due to DAX DMA operations. > >> > > > > >> > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > >> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > >> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > >> > > > --- > >> > > > > >> > > > Changes since v2: > >> > > > * Added a comment to ext4_insert_range() explaining why we don't call > >> > > > ext4_break_layouts(). (Jan) > >> > > > >> > > Which I think is wrong and will cause data corruption. > >> > > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > >> > > > LLONG_MAX); > >> > > > if (ret) > >> > > > goto out_mmap; > >> > > > + /* > >> > > > + * We don't need to call ext4_break_layouts() because we aren't > >> > > > + * removing any blocks from the inode. We are just changing their > >> > > > + * offset by inserting a hole. > >> > > > + */ > > > > Does calling ext4_break_layouts from insert range not work? > > > > It's my understanding that file leases work are a mechanism for the > > filesystem to delegate some of its authority over physical space > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > directly into userspace and exporting space on shared storage over > > pNFS. Some day we might use the same mechanism for the similar things > > that RDMA does, or the swapfile code since that's essentially how it > > works today. > > > > The other part of these file leases is that the filesystem revokes them > > any time it wants to perform a mapping operation on a file. This breaks > > my mental model of how leases work, and if you commit to this for ext4 > > then I'll have to remember that leases are different between xfs and > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > be the implementation detail that "DAX won't care", then someone else > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > back into ext4 or else kaboom. > > > > Granted, Dave said all these things already, but I actually feel > > strongly enough to reiterate. > > This patch kit is only for the DAX fix, this isn't full layout lease > support. Even XFS is special casing unmap with the BREAK_UNMAP flag. > So ext4 is achieving feature parity for BREAK_UNMAP, just not > BREAK_WRITE, yet. BREAK_UNMAP is issued unconditionally by XFS for all fallocate operations. There is no special except for extent shifting (up or down) in XFS as this patch set is making for ext4. IOWs, this patchset does not implement BREAK_UNMAP with the same semantics as XFS. Cheers, Dave. -- Dave Chinner david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-05 23:29 ` Dave Chinner @ 2018-07-06 5:08 ` Dan Williams 2018-07-09 9:59 ` Lukas Czerner 1 sibling, 0 replies; 26+ messages in thread From: Dan Williams @ 2018-07-06 5:08 UTC (permalink / raw) To: Dave Chinner Cc: ext4 hackers, Darrick J. Wong, Jan Kara, Christoph Hellwig, linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org On Thu, Jul 5, 2018 at 4:29 PM Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org> wrote: > On Thu, Jul 05, 2018 at 01:40:17PM -0700, Dan Williams wrote: > > On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> > wrote: > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > >> > > > Follow the lead of xfs_break_dax_layouts() and add > synchronization between > > >> > > > operations in ext4 which remove blocks from an inode (hole > punch, truncate > > >> > > > down, etc.) and pages which are pinned due to DAX DMA > operations. > > >> > > > > > >> > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > >> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > >> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> > > > --- > > >> > > > > > >> > > > Changes since v2: > > >> > > > * Added a comment to ext4_insert_range() explaining why we > don't call > > >> > > > ext4_break_layouts(). (Jan) > > >> > > > > >> > > Which I think is wrong and will cause data corruption. > > >> > > > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode > *inode, loff_t offset, loff_t len) > > >> > > > LLONG_MAX); > > >> > > > if (ret) > > >> > > > goto out_mmap; > > >> > > > + /* > > >> > > > + * We don't need to call ext4_break_layouts() because > we aren't > > >> > > > + * removing any blocks from the inode. We are just > changing their > > >> > > > + * offset by inserting a hole. > > >> > > > + */ > > > > > > Does calling ext4_break_layouts from insert range not work? > > > > > > It's my understanding that file leases work are a mechanism for the > > > filesystem to delegate some of its authority over physical space > > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > > directly into userspace and exporting space on shared storage over > > > pNFS. Some day we might use the same mechanism for the similar things > > > that RDMA does, or the swapfile code since that's essentially how it > > > works today. > > > > > > The other part of these file leases is that the filesystem revokes them > > > any time it wants to perform a mapping operation on a file. This > breaks > > > my mental model of how leases work, and if you commit to this for ext4 > > > then I'll have to remember that leases are different between xfs and > > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > > be the implementation detail that "DAX won't care", then someone else > > > wiring up pNFS/future RDMA/whatever will also have to remember to put > it > > > back into ext4 or else kaboom. > > > > > > Granted, Dave said all these things already, but I actually feel > > > strongly enough to reiterate. > > > > This patch kit is only for the DAX fix, this isn't full layout lease > > support. Even XFS is special casing unmap with the BREAK_UNMAP flag. > > So ext4 is achieving feature parity for BREAK_UNMAP, just not > > BREAK_WRITE, yet. > > BREAK_UNMAP is issued unconditionally by XFS for all fallocate > operations. There is no special except for extent shifting (up or > down) in XFS as this patch set is making for ext4. IOWs, this > patchset does not implement BREAK_UNMAP with the same semantics as > XFS. Ah true, I see that now. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings 2018-07-05 23:29 ` Dave Chinner 2018-07-06 5:08 ` Dan Williams @ 2018-07-09 9:59 ` Lukas Czerner [not found] ` <20180709095907.i3mnyodvn6gpcidt-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 1 sibling, 1 reply; 26+ messages in thread From: Lukas Czerner @ 2018-07-09 9:59 UTC (permalink / raw) To: Dave Chinner Cc: Jan Kara, linux-nvdimm, Darrick J. Wong, linux-ext4, Christoph Hellwig On Fri, Jul 06, 2018 at 09:29:34AM +1000, Dave Chinner wrote: > On Thu, Jul 05, 2018 at 01:40:17PM -0700, Dan Williams wrote: > > On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > >> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > >> > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > >> > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > >> > > > > > >> > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > >> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > >> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > >> > > > --- > > >> > > > > > >> > > > Changes since v2: > > >> > > > * Added a comment to ext4_insert_range() explaining why we don't call > > >> > > > ext4_break_layouts(). (Jan) > > >> > > > > >> > > Which I think is wrong and will cause data corruption. > > >> > > > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > >> > > > LLONG_MAX); > > >> > > > if (ret) > > >> > > > goto out_mmap; > > >> > > > + /* > > >> > > > + * We don't need to call ext4_break_layouts() because we aren't > > >> > > > + * removing any blocks from the inode. We are just changing their > > >> > > > + * offset by inserting a hole. > > >> > > > + */ > > > > > > Does calling ext4_break_layouts from insert range not work? > > > > > > It's my understanding that file leases work are a mechanism for the > > > filesystem to delegate some of its authority over physical space > > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > > directly into userspace and exporting space on shared storage over > > > pNFS. Some day we might use the same mechanism for the similar things > > > that RDMA does, or the swapfile code since that's essentially how it > > > works today. > > > > > > The other part of these file leases is that the filesystem revokes them > > > any time it wants to perform a mapping operation on a file. This breaks > > > my mental model of how leases work, and if you commit to this for ext4 > > > then I'll have to remember that leases are different between xfs and > > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > > be the implementation detail that "DAX won't care", then someone else > > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > > back into ext4 or else kaboom. > > > > > > Granted, Dave said all these things already, but I actually feel > > > strongly enough to reiterate. > > > > This patch kit is only for the DAX fix, this isn't full layout lease > > support. Even XFS is special casing unmap with the BREAK_UNMAP flag. > > So ext4 is achieving feature parity for BREAK_UNMAP, just not > > BREAK_WRITE, yet. > > BREAK_UNMAP is issued unconditionally by XFS for all fallocate > operations. There is no special except for extent shifting (up or > down) in XFS as this patch set is making for ext4. IOWs, this > patchset does not implement BREAK_UNMAP with the same semantics as > XFS. If anything this is very usefull discussion ( at least for me ) and what I do take away from it is that there is no documentation, nor specification of the leases nor BREAK_UNMAP nor BREAK_WRITE. grep -iR -e break_layout -e BREAK_UNMAP -e BREAK_WRITE Documentation/* Maybe someone with a good understanding of how this stuff is supposed to be done could write it down so filesystem devs can make it behave the same. Thanks! -Lukas > > Cheers, > > Dave. > -- > Dave Chinner > david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <20180709095907.i3mnyodvn6gpcidt-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org>]
* Re: [PATCH v3 2/2] ext4: handle layout changes to pinned DAX mappings [not found] ` <20180709095907.i3mnyodvn6gpcidt-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> @ 2018-07-09 16:18 ` Darrick J. Wong 0 siblings, 0 replies; 26+ messages in thread From: Darrick J. Wong @ 2018-07-09 16:18 UTC (permalink / raw) To: Lukas Czerner Cc: Jan Kara, linux-nvdimm, Dave Chinner, Christoph Hellwig, linux-ext4 On Mon, Jul 09, 2018 at 11:59:07AM +0200, Lukas Czerner wrote: > On Fri, Jul 06, 2018 at 09:29:34AM +1000, Dave Chinner wrote: > > On Thu, Jul 05, 2018 at 01:40:17PM -0700, Dan Williams wrote: > > > On Wed, Jul 4, 2018 at 8:59 PM, Darrick J. Wong <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote: > > > > On Thu, Jul 05, 2018 at 09:54:14AM +1000, Dave Chinner wrote: > > > >> On Wed, Jul 04, 2018 at 02:27:23PM +0200, Jan Kara wrote: > > > >> > On Wed 04-07-18 10:49:23, Dave Chinner wrote: > > > >> > > On Mon, Jul 02, 2018 at 11:29:12AM -0600, Ross Zwisler wrote: > > > >> > > > Follow the lead of xfs_break_dax_layouts() and add synchronization between > > > >> > > > operations in ext4 which remove blocks from an inode (hole punch, truncate > > > >> > > > down, etc.) and pages which are pinned due to DAX DMA operations. > > > >> > > > > > > >> > > > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> > > > >> > > > Reviewed-by: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org> > > > >> > > > Reviewed-by: Lukas Czerner <lczerner-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> > > > >> > > > --- > > > >> > > > > > > >> > > > Changes since v2: > > > >> > > > * Added a comment to ext4_insert_range() explaining why we don't call > > > >> > > > ext4_break_layouts(). (Jan) > > > >> > > > > > >> > > Which I think is wrong and will cause data corruption. > > > >> > > > > > >> > > > @@ -5651,6 +5663,11 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len) > > > >> > > > LLONG_MAX); > > > >> > > > if (ret) > > > >> > > > goto out_mmap; > > > >> > > > + /* > > > >> > > > + * We don't need to call ext4_break_layouts() because we aren't > > > >> > > > + * removing any blocks from the inode. We are just changing their > > > >> > > > + * offset by inserting a hole. > > > >> > > > + */ > > > > > > > > Does calling ext4_break_layouts from insert range not work? > > > > > > > > It's my understanding that file leases work are a mechanism for the > > > > filesystem to delegate some of its authority over physical space > > > > mappings to "client" software. AFAICT it's used for mmap'ing pmem > > > > directly into userspace and exporting space on shared storage over > > > > pNFS. Some day we might use the same mechanism for the similar things > > > > that RDMA does, or the swapfile code since that's essentially how it > > > > works today. > > > > > > > > The other part of these file leases is that the filesystem revokes them > > > > any time it wants to perform a mapping operation on a file. This breaks > > > > my mental model of how leases work, and if you commit to this for ext4 > > > > then I'll have to remember that leases are different between xfs and > > > > ext4. Worse, since the reason for skipping ext4_break_layouts seems to > > > > be the implementation detail that "DAX won't care", then someone else > > > > wiring up pNFS/future RDMA/whatever will also have to remember to put it > > > > back into ext4 or else kaboom. > > > > > > > > Granted, Dave said all these things already, but I actually feel > > > > strongly enough to reiterate. > > > > > > This patch kit is only for the DAX fix, this isn't full layout lease > > > support. Even XFS is special casing unmap with the BREAK_UNMAP flag. > > > So ext4 is achieving feature parity for BREAK_UNMAP, just not > > > BREAK_WRITE, yet. > > > > BREAK_UNMAP is issued unconditionally by XFS for all fallocate > > operations. There is no special except for extent shifting (up or > > down) in XFS as this patch set is making for ext4. IOWs, this > > patchset does not implement BREAK_UNMAP with the same semantics as > > XFS. > > If anything this is very usefull discussion ( at least for me ) and what > I do take away from it is that there is no documentation, nor > specification of the leases nor BREAK_UNMAP nor BREAK_WRITE. > > grep -iR -e break_layout -e BREAK_UNMAP -e BREAK_WRITE Documentation/* > > Maybe someone with a good understanding of how this stuff is supposed to > be done could write it down so filesystem devs can make it behave the > same. Dan? :) IIRC, BREAK_WRITE means "terminate all leases immediately" as the caller prepares to write to a file range (which may or may not involve adding more mappings), whereas BREAK_UNMAP means "terminate all leases and wait until the lessee acknowledges" as the caller prepares to remove (or move) file extent mappings. --D > Thanks! > -Lukas > > > > > > Cheers, > > > > Dave. > > -- > > Dave Chinner > > david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org ^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2018-07-09 19:49 UTC | newest] Thread overview: 26+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2018-06-27 21:22 [PATCH v2 0/2] ext4: fix DAX dma vs truncate/hole-punch Ross Zwisler [not found] ` <20180627212252.31032-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-27 21:22 ` [PATCH v2 1/2] dax: dax_layout_busy_page() warn on !exceptional Ross Zwisler [not found] ` <20180627212252.31032-2-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-02 22:15 ` Theodore Y. Ts'o [not found] ` <20180702221503.GA12830-AKGzg7BKzIDYtjvyW6yDsg@public.gmane.org> 2018-07-03 15:41 ` Ross Zwisler [not found] ` <20180703154137.GB13019-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-03 17:44 ` Theodore Y. Ts'o 2018-06-27 21:22 ` [PATCH v2 2/2] ext4: handle layout changes to pinned DAX mappings Ross Zwisler [not found] ` <20180627212252.31032-3-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-06-29 12:02 ` Lukas Czerner [not found] ` <20180629120223.oaslngsvspnwf4ae-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2018-06-29 15:13 ` Ross Zwisler [not found] ` <20180629151300.GA3006-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-02 7:34 ` Jan Kara 2018-07-02 7:59 ` Lukas Czerner [not found] ` <20180702075948.i4aqjg5rrorwoxqj-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2018-07-02 16:27 ` Ross Zwisler 2018-06-30 1:12 ` Dave Chinner 2018-07-02 17:29 ` [PATCH v3 " Ross Zwisler [not found] ` <20180702172912.329-1-ross.zwisler-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-04 0:49 ` Dave Chinner 2018-07-04 12:27 ` Jan Kara [not found] ` <20180704122723.lup2wovzb6u6ta6v-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2018-07-04 23:54 ` Dave Chinner 2018-07-05 3:59 ` Darrick J. Wong 2018-07-05 16:53 ` Ross Zwisler [not found] ` <20180705165310.GB22200-VuQAYsv1563Yd54FQh9/CA@public.gmane.org> 2018-07-09 12:33 ` Jan Kara [not found] ` <20180709123347.nw3ixr64prgk7sxz-4I4JzKEfoa/jFM9bn6wA6Q@public.gmane.org> 2018-07-09 16:23 ` Darrick J. Wong 2018-07-09 19:49 ` Jan Kara 2018-07-05 20:40 ` Dan Williams [not found] ` <CAPcyv4jSNh95XUPh4ZzguKmcJpgNG7AG5_9=+gbLEjsaZUTq4g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2018-07-05 23:29 ` Dave Chinner 2018-07-06 5:08 ` Dan Williams 2018-07-09 9:59 ` Lukas Czerner [not found] ` <20180709095907.i3mnyodvn6gpcidt-bi+AKbBUZKY6gyzm1THtWbp2dZbC/Bob@public.gmane.org> 2018-07-09 16:18 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).