public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3
@ 2014-08-28 11:49 Dave Chinner
  2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

Hi folks,

These are the patches that I'm planning to send to Linus for
3.17-rc3. I've been testing them all week, an dI'll push them out
into a topic branch tomorrow morning for a linux-next build before
pushing them to Linus.

Only the first patch doesn't have a review; there was a bit of
discussion about the problem that the patch fixes but no changes
have really been suggested from that discussion. I've fixed the one
problem that was pointed out, so please have a look over it...

I also added Brian's eofblocks trimming for collapse range fix to
the set as it fixed another problem I was tripping over in testing
and it's simple enough to consider a valid -rc fix.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/7] xfs: don't dirty buffers beyond EOF
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-28 13:34   ` Brian Foster
  2014-08-29  0:39   ` [PATCH 1/7] " Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..4c79f90 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+	struct page		*page)
+{
+	struct address_space	*mapping = page->mapping;
+	struct inode		*inode = mapping->host;
+	loff_t			end_offset;
+	loff_t			offset;
+	int			newly_dirty;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	end_offset = i_size_read(inode);
+	offset = end_offset & PAGE_CACHE_MASK;
+
+	spin_lock(&mapping->private_lock);
+	if (page_has_buffers(page)) {
+		struct buffer_head *head = page_buffers(page);
+		struct buffer_head *bh = head;
+
+		do {
+			if (offset < end_offset)
+				set_buffer_dirty(bh);
+			bh = bh->b_this_page;
+			offset += 1 << inode->i_blkbits;
+		} while (bh != head);
+	}
+	newly_dirty = !TestSetPageDirty(page);
+	spin_unlock(&mapping->private_lock);
+
+	if (newly_dirty) {
+		/* sigh - __set_page_dirty() is static, so copy it here, too */
+		unsigned long flags;
+
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		if (page->mapping) {	/* Race with truncate? */
+			WARN_ON_ONCE(!PageUptodate(page));
+			account_page_dirtied(page, mapping);
+			radix_tree_tag_set(&mapping->page_tree,
+					page_index(page), PAGECACHE_TAG_DIRTY);
+		}
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	}
+	return newly_dirty;
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
+	.set_page_dirty		= xfs_vm_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.write_begin		= xfs_vm_write_begin,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
  2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:39   ` Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 3/7] " Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Chris Mason <clm@fb.com>

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads.  This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page.  This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_file.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 076b170..827cfb2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -296,7 +296,16 @@ xfs_file_read_iter(
 				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
 				return ret;
 			}
-			truncate_pagecache_range(VFS_I(ip), pos, -1);
+
+			/*
+			 * Invalidate whole pages. This can return an error if
+			 * we fail to invalidate a page, but this should never
+			 * happen on XFS. Warn if it does fail.
+			 */
+			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+						pos >> PAGE_CACHE_SHIFT, -1);
+			WARN_ON_ONCE(ret);
+			ret = 0;
 		}
 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
 	}
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/7] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
  2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
  2014-08-28 11:49 ` [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:39   ` Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_file.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 827cfb2..19917fa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -644,7 +644,15 @@ xfs_file_dio_aio_write(
 						    pos, -1);
 		if (ret)
 			goto out;
-		truncate_pagecache_range(VFS_I(ip), pos, -1);
+		/*
+		 * Invalidate whole pages. This can return an error if
+		 * we fail to invalidate a page, but this should never
+		 * happen on XFS. Warn if it does fail.
+		 */
+		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+						pos >> PAGE_CACHE_SHIFT, -1);
+		WARN_ON_ONCE(ret);
+		ret = 0;
 	}
 
 	/*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
                   ` (2 preceding siblings ...)
  2014-08-28 11:49 ` [PATCH 3/7] " Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:40   ` Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 19917fa..de5368c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -291,7 +291,7 @@ xfs_file_read_iter(
 		if (inode->i_mapping->nrpages) {
 			ret = filemap_write_and_wait_range(
 							VFS_I(ip)->i_mapping,
-							pos, -1);
+							pos, pos + size - 1);
 			if (ret) {
 				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
 				return ret;
@@ -303,7 +303,8 @@ xfs_file_read_iter(
 			 * happen on XFS. Warn if it does fail.
 			 */
 			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-						pos >> PAGE_CACHE_SHIFT, -1);
+					pos >> PAGE_CACHE_SHIFT,
+					(pos + size - 1) >> PAGE_CACHE_SHIFT);
 			WARN_ON_ONCE(ret);
 			ret = 0;
 		}
@@ -641,7 +642,7 @@ xfs_file_dio_aio_write(
 
 	if (mapping->nrpages) {
 		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						    pos, -1);
+						    pos, pos + count - 1);
 		if (ret)
 			goto out;
 		/*
@@ -650,7 +651,8 @@ xfs_file_dio_aio_write(
 		 * happen on XFS. Warn if it does fail.
 		 */
 		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-						pos >> PAGE_CACHE_SHIFT, -1);
+					pos >> PAGE_CACHE_SHIFT,
+					(pos + count - 1) >> PAGE_CACHE_SHIFT);
 		WARN_ON_ONCE(ret);
 		ret = 0;
 	}
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
                   ` (3 preceding siblings ...)
  2014-08-28 11:49 ` [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:41   ` Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
  2014-08-28 11:49 ` [PATCH 7/7] xfs: trim eofblocks before collapse range Dave Chinner
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Brian Foster <bfoster@redhat.com>

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3a6a700..e5c2518 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5422,7 +5422,7 @@ xfs_bmap_shift_extents(
 	struct xfs_bmap_free	*flist,
 	int			num_exts)
 {
-	struct xfs_btree_cur		*cur;
+	struct xfs_btree_cur		*cur = NULL;
 	struct xfs_bmbt_rec_host	*gotp;
 	struct xfs_bmbt_irec            got;
 	struct xfs_bmbt_irec		left;
@@ -5433,7 +5433,7 @@ xfs_bmap_shift_extents(
 	int				error = 0;
 	int				i;
 	int				whichfork = XFS_DATA_FORK;
-	int				logflags;
+	int				logflags = 0;
 	xfs_filblks_t			blockcount = 0;
 	int				total_extents;
 
@@ -5476,16 +5476,11 @@ xfs_bmap_shift_extents(
 		}
 	}
 
-	/* We are going to change core inode */
-	logflags = XFS_ILOG_CORE;
 	if (ifp->if_flags & XFS_IFBROOT) {
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 		cur->bc_private.b.firstblock = *firstblock;
 		cur->bc_private.b.flist = flist;
 		cur->bc_private.b.flags = 0;
-	} else {
-		cur = NULL;
-		logflags |= XFS_ILOG_DEXT;
 	}
 
 	/*
@@ -5543,11 +5538,14 @@ xfs_bmap_shift_extents(
 			blockcount = left.br_blockcount +
 				got.br_blockcount;
 			xfs_iext_remove(ip, *current_ext, 1, 0);
+			logflags |= XFS_ILOG_CORE;
 			if (cur) {
 				error = xfs_btree_delete(cur, &i);
 				if (error)
 					goto del_cursor;
 				XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
+			} else {
+				logflags |= XFS_ILOG_DEXT;
 			}
 			XFS_IFORK_NEXT_SET(ip, whichfork,
 				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
@@ -5573,6 +5571,7 @@ xfs_bmap_shift_extents(
 			got.br_startoff = startoff;
 		}
 
+		logflags |= XFS_ILOG_CORE;
 		if (cur) {
 			error = xfs_bmbt_update(cur, got.br_startoff,
 						got.br_startblock,
@@ -5580,6 +5579,8 @@ xfs_bmap_shift_extents(
 						got.br_state);
 			if (error)
 				goto del_cursor;
+		} else {
+			logflags |= XFS_ILOG_DEXT;
 		}
 
 		(*current_ext)++;
@@ -5595,6 +5596,7 @@ del_cursor:
 		xfs_btree_del_cursor(cur,
 			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
 
-	xfs_trans_log_inode(tp, ip, logflags);
+	if (logflags)
+		xfs_trans_log_inode(tp, ip, logflags);
 	return error;
 }
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
                   ` (4 preceding siblings ...)
  2014-08-28 11:49 ` [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:41   ` Christoph Hellwig
  2014-08-28 11:49 ` [PATCH 7/7] xfs: trim eofblocks before collapse range Dave Chinner
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.

Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index aa70620..283e20c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1444,6 +1444,19 @@ xfs_collapse_file_space(
 	start_fsb = XFS_B_TO_FSB(mp, offset + len);
 	shift_fsb = XFS_B_TO_FSB(mp, len);
 
+	/*
+	 * writeback the entire file to prevent concurrent writeback of ranges
+	 * outside the collapsing region from changing the extent list.
+	 *
+	 * XXX: This is a temporary fix until the extent shift loop below is
+	 * converted to use offsets and lookups within the ILOCK rather than
+	 * carrying around the index into the extent list for the next
+	 * iteration.
+	 */
+	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+	if (error)
+		return error;
+
 	error = xfs_free_file_space(ip, offset, len);
 	if (error)
 		return error;
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 7/7] xfs: trim eofblocks before collapse range
  2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
                   ` (5 preceding siblings ...)
  2014-08-28 11:49 ` [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
@ 2014-08-28 11:49 ` Dave Chinner
  2014-08-29  0:42   ` Christoph Hellwig
  6 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 11:49 UTC (permalink / raw)
  To: xfs

From: Brian Foster <bfoster@redhat.com>

xfs_collapse_file_space() currently writes back the entire file
undergoing collapse range to settle things down for the extent shift
algorithm. While this prevents changes to the extent list during the
collapse operation, the writeback itself is not enough to prevent
unnecessary collapse failures.

The current shift algorithm uses the extent index to iterate the in-core
extent list. If a post-eof delalloc extent persists after the writeback
(e.g., a prior zero range op where the end of the range aligns with eof
can separate the post-eof blocks such that they are not written back and
converted), xfs_bmap_shift_extents() becomes confused over the encoded
br_startblock value and fails the collapse.

As with the full writeback, this is a temporary fix until the algorithm
is improved to cope with a volatile extent list and avoid attempts to
shift post-eof extents.

Signed-off-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 283e20c..c692260 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1445,8 +1445,10 @@ xfs_collapse_file_space(
 	shift_fsb = XFS_B_TO_FSB(mp, len);
 
 	/*
-	 * writeback the entire file to prevent concurrent writeback of ranges
-	 * outside the collapsing region from changing the extent list.
+	 * Writeback the entire file and force remove any post-eof blocks. The
+	 * writeback prevents changes to the extent list via concurrent
+	 * writeback and the eofblocks trim prevents the extent shift algorithm
+	 * from running into a post-eof delalloc extent.
 	 *
 	 * XXX: This is a temporary fix until the extent shift loop below is
 	 * converted to use offsets and lookups within the ILOCK rather than
@@ -1456,6 +1458,11 @@ xfs_collapse_file_space(
 	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
 	if (error)
 		return error;
+	if (xfs_can_free_eofblocks(ip, true)) {
+		error = xfs_free_eofblocks(mp, ip, false);
+		if (error)
+			return error;
+	}
 
 	error = xfs_free_file_space(ip, offset, len);
 	if (error)
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] xfs: don't dirty buffers beyond EOF
  2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
@ 2014-08-28 13:34   ` Brian Foster
  2014-08-28 22:37     ` Dave Chinner
  2014-08-29  0:39   ` [PATCH 1/7] " Christoph Hellwig
  1 sibling, 1 reply; 20+ messages in thread
From: Brian Foster @ 2014-08-28 13:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 28, 2014 at 09:49:05PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/263 is failing fsx at this point with a page spanning
> EOF that cannot be invalidated. The operations are:
> 
> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> 
> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> write attempts to invalidate the cached page over this range, it
> fails with -EBUSY and so any attempt to do page invalidation fails.
> 
> The real question is this: Why can't that page be invalidated after
> it has been written to disk and cleaned?
> 
> Well, there's data on the first two buffers in the page (1k block
> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> is failing drop_buffers because it's bh->b_state == 0x3, which is
> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> what?
> 
> OK, set_buffer_dirty() is called on all buffers from
> __set_page_buffers_dirty(), regardless of whether the buffer is
> beyond EOF or not, which means that when we get to ->writepage,
> we have buffers marked dirty beyond EOF that we need to clean.
> So, we need to implement our own .set_page_dirty method that
> doesn't dirty buffers beyond EOF.
> 
> This is messy because the buffer code is not meant to be shared
> and it has interesting locking issues on the buffer dirty bits.
> So just copy and paste it and then modify it to suit what we need.
> 
> Note: the solutions the other filesystems and generic block code use
> of marking the buffers clean in ->writepage does not work for XFS.
> It still leaves dirty buffers beyond EOF and invalidations still
> fail. Hence rather than play whack-a-mole, this patch simply
> prevents those buffers from being dirtied in the first place.
> 
> cc: <stable@kernel.org>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 11e9b4c..4c79f90 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1753,11 +1753,69 @@ xfs_vm_readpages(
>  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
>  }
>  
> +/*
> + * This is basically a copy of __set_page_dirty_buffers() with one
> + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> + * dirty, we'll never be able to clean them because we don't write buffers
> + * beyond EOF, and that means we can't invalidate pages that span EOF
> + * that have been marked dirty. Further, the dirty state can leak into
> + * the file interior if the file is extended, resulting in all sorts of
> + * bad things happening as the state does not match the unerlying data.
> + */
> +STATIC int
> +xfs_vm_set_page_dirty(
> +	struct page		*page)
> +{
> +	struct address_space	*mapping = page->mapping;
> +	struct inode		*inode = mapping->host;
> +	loff_t			end_offset;
> +	loff_t			offset;
> +	int			newly_dirty;
> +
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	end_offset = i_size_read(inode);
> +	offset = end_offset & PAGE_CACHE_MASK;

Is this what you intended to do here?

	offset = page_offset(page);

Brian

> +
> +	spin_lock(&mapping->private_lock);
> +	if (page_has_buffers(page)) {
> +		struct buffer_head *head = page_buffers(page);
> +		struct buffer_head *bh = head;
> +
> +		do {
> +			if (offset < end_offset)
> +				set_buffer_dirty(bh);
> +			bh = bh->b_this_page;
> +			offset += 1 << inode->i_blkbits;
> +		} while (bh != head);
> +	}
> +	newly_dirty = !TestSetPageDirty(page);
> +	spin_unlock(&mapping->private_lock);
> +
> +	if (newly_dirty) {
> +		/* sigh - __set_page_dirty() is static, so copy it here, too */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		if (page->mapping) {	/* Race with truncate? */
> +			WARN_ON_ONCE(!PageUptodate(page));
> +			account_page_dirtied(page, mapping);
> +			radix_tree_tag_set(&mapping->page_tree,
> +					page_index(page), PAGECACHE_TAG_DIRTY);
> +		}
> +		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +	}
> +	return newly_dirty;
> +}
> +
>  const struct address_space_operations xfs_address_space_operations = {
>  	.readpage		= xfs_vm_readpage,
>  	.readpages		= xfs_vm_readpages,
>  	.writepage		= xfs_vm_writepage,
>  	.writepages		= xfs_vm_writepages,
> +	.set_page_dirty		= xfs_vm_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
>  	.write_begin		= xfs_vm_write_begin,
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] xfs: don't dirty buffers beyond EOF
  2014-08-28 13:34   ` Brian Foster
@ 2014-08-28 22:37     ` Dave Chinner
  2014-08-28 23:49       ` [PATCH 1/7 v2] " Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 22:37 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Thu, Aug 28, 2014 at 09:34:57AM -0400, Brian Foster wrote:
> On Thu, Aug 28, 2014 at 09:49:05PM +1000, Dave Chinner wrote:
> > +/*
> > + * This is basically a copy of __set_page_dirty_buffers() with one
> > + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> > + * dirty, we'll never be able to clean them because we don't write buffers
> > + * beyond EOF, and that means we can't invalidate pages that span EOF
> > + * that have been marked dirty. Further, the dirty state can leak into
> > + * the file interior if the file is extended, resulting in all sorts of
> > + * bad things happening as the state does not match the unerlying data.
> > + */
> > +STATIC int
> > +xfs_vm_set_page_dirty(
> > +	struct page		*page)
> > +{
> > +	struct address_space	*mapping = page->mapping;
> > +	struct inode		*inode = mapping->host;
> > +	loff_t			end_offset;
> > +	loff_t			offset;
> > +	int			newly_dirty;
> > +
> > +	if (unlikely(!mapping))
> > +		return !TestSetPageDirty(page);
> > +
> > +	end_offset = i_size_read(inode);
> > +	offset = end_offset & PAGE_CACHE_MASK;
> 
> Is this what you intended to do here?
> 
> 	offset = page_offset(page);

Yup, that's a bug. Which points out just how important the buffer
dirty flag is (not) to XFS, doesn't it?

I'll post a fixed patch in a few minutes after it's run a few
tens of millions fsx ops...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/7 v2] xfs: don't dirty buffers beyond EOF
  2014-08-28 22:37     ` Dave Chinner
@ 2014-08-28 23:49       ` Dave Chinner
  2014-08-29 12:13         ` Brian Foster
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-28 23:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs


From: Dave Chinner <dchinner@redhat.com>

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so any attempt to do page invalidation fails.

The real question is this: Why can't that page be invalidated after
it has been written to disk and cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

Note: the solutions the other filesystems and generic block code use
of marking the buffers clean in ->writepage does not work for XFS.
It still leaves dirty buffers beyond EOF and invalidations still
fail. Hence rather than play whack-a-mole, this patch simply
prevents those buffers from being dirtied in the first place.

cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---

v2: fix page offset calculation. passed 61 million fsx ops before
hitting an unrelated problem in xfs_zero_file_space(), so no
difference to the result with this updated patch.

 fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..9bd2f53 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+	struct page		*page)
+{
+	struct address_space	*mapping = page->mapping;
+	struct inode		*inode = mapping->host;
+	loff_t			end_offset;
+	loff_t			offset;
+	int			newly_dirty;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	end_offset = i_size_read(inode);
+	offset = page_offset(page);
+
+	spin_lock(&mapping->private_lock);
+	if (page_has_buffers(page)) {
+		struct buffer_head *head = page_buffers(page);
+		struct buffer_head *bh = head;
+
+		do {
+			if (offset < end_offset)
+				set_buffer_dirty(bh);
+			bh = bh->b_this_page;
+			offset += 1 << inode->i_blkbits;
+		} while (bh != head);
+	}
+	newly_dirty = !TestSetPageDirty(page);
+	spin_unlock(&mapping->private_lock);
+
+	if (newly_dirty) {
+		/* sigh - __set_page_dirty() is static, so copy it here, too */
+		unsigned long flags;
+
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		if (page->mapping) {	/* Race with truncate? */
+			WARN_ON_ONCE(!PageUptodate(page));
+			account_page_dirtied(page, mapping);
+			radix_tree_tag_set(&mapping->page_tree,
+					page_index(page), PAGECACHE_TAG_DIRTY);
+		}
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	}
+	return newly_dirty;
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
+	.set_page_dirty		= xfs_vm_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.write_begin		= xfs_vm_write_begin,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] xfs: don't dirty buffers beyond EOF
  2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
  2014-08-28 13:34   ` Brian Foster
@ 2014-08-29  0:39   ` Christoph Hellwig
  2014-08-29  0:53     ` Dave Chinner
  1 sibling, 1 reply; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Not sure if my 10 cents are worth anything given that I haven't spent
much time with this code recently, but I feel very uneasy diverging
from the generic path in this area.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-28 11:49 ` [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
@ 2014-08-29  0:39   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 28, 2014 at 09:49:06PM +1000, Dave Chinner wrote:
> From: Chris Mason <clm@fb.com>
> 
> xfs is using truncate_pagecache_range to invalidate the page cache
> during DIO reads.  This is different from the other filesystems who
> only invalidate pages during DIO writes.
> 
> truncate_pagecache_range is meant to be used when we are freeing the
> underlying data structs from disk, so it will zero any partial
> ranges in the page.  This means a DIO read can zero out part of the
> page cache page, and it is possible the page will stay in cache.
> 
> buffered reads will find an up to date page with zeros instead of
> the data actually on disk.
> 
> This patch fixes things by using invalidate_inode_pages2_range
> instead.  It preserves the page cache invalidation, but won't zero
> any pages.
> 
> [dchinner: catch error and warn if it fails. Comment.]

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/7] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-28 11:49 ` [PATCH 3/7] " Dave Chinner
@ 2014-08-29  0:39   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 28, 2014 at 09:49:07PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to direct IO reads, direct IO writes are using
> truncate_pagecache_range to invalidate the page cache. This is
> incorrect due to the sub-block zeroing in the page cache that
> truncate_pagecache_range() triggers.
> 
> This patch fixes things by using invalidate_inode_pages2_range
> instead.  It preserves the page cache invalidation, but won't zero
> any pages.

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO
  2014-08-28 11:49 ` [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
@ 2014-08-29  0:40   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 28, 2014 at 09:49:08PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we are not doing silly things with dirtying buffers beyond EOF
> and using invalidation correctly, we can finally reduce the ranges of
> writeback and invalidation used by direct IO to match that of the IO
> being issued.
> 
> Bring the writeback and invalidation ranges back to match the
> generic direct IO code - this will greatly reduce the perturbation
> of cached data when direct IO and buffered IO are mixed, but still
> provide the same buffered vs direct IO coherency behaviour we
> currently have.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks correct, although I wonder if it's really worth the risk exposing
us to the crazy details of the page cache invalidation range notations..

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications
  2014-08-28 11:49 ` [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
@ 2014-08-29  0:41   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged
  2014-08-28 11:49 ` [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
@ 2014-08-29  0:41   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:41 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Aug 28, 2014 at 09:49:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we have delalloc extents on a file before we run a collapse range
> opertaion, we sync the range that we are going to collapse to
> convert delalloc extents in that region to real extents to simplify
> the shift operation.
> 
> However, the shift operation then assumes that the extent list is
> not going to change as it iterates over the extent list moving
> things about. Unfortunately, this isn't true because we can't hold
> the ILOCK over all the operations. We can prevent new IO from
> modifying the extent list by holding the IOLOCK, but that doesn't
> prevent writeback from running....
> 
> And when writeback runs, it can convert delalloc extents is the
> range of the file prior to the region being collapsed, and this
> changes the indexes of all the extents in the file. That causes the
> collapse range operation to Go Bad.
> 
> The right fix is to rewrite the extent shift operation not to be
> dependent on the extent list not changing across the entire
> operation, but this is a fairly significant piece of work to do.
> Hence, as a short-term workaround for the problem, sync the entire
> file before starting a collapse operation to remove all delalloc
> ranges from the file and so avoid the problem of concurrent
> writeback changing the extent list.
> 
> Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 7/7] xfs: trim eofblocks before collapse range
  2014-08-28 11:49 ` [PATCH 7/7] xfs: trim eofblocks before collapse range Dave Chinner
@ 2014-08-29  0:42   ` Christoph Hellwig
  0 siblings, 0 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-29  0:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7] xfs: don't dirty buffers beyond EOF
  2014-08-29  0:39   ` [PATCH 1/7] " Christoph Hellwig
@ 2014-08-29  0:53     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-29  0:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Aug 28, 2014 at 05:39:11PM -0700, Christoph Hellwig wrote:
> Not sure if my 10 cents are worth anything given that I haven't spent
> much time with this code recently, but I feel very uneasy diverging
> from the generic path in this area.

I can't see how we have any other choice right now. We're caught
between a rock and a hard place - XFS uses bufferheads differently
to all other filesystems (esp. w.r.t. to EOF block zeroing
behaviour), and so changing behaviour in the generic code to suit
XFS is likely to introduce subtle data corruption bugs in other
filesystems.

I think the best thing we can do is move away from bufferheads in
XFS. We've already got lots of hacky code to manage
bufferhead/extent state coherency and so the sooner we get rid of
bufferheads the sooner that crap goes away, too.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/7 v2] xfs: don't dirty buffers beyond EOF
  2014-08-28 23:49       ` [PATCH 1/7 v2] " Dave Chinner
@ 2014-08-29 12:13         ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2014-08-29 12:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Fri, Aug 29, 2014 at 09:49:32AM +1000, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/263 is failing fsx at this point with a page spanning
> EOF that cannot be invalidated. The operations are:
> 
> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> 
> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> write attempts to invalidate the cached page over this range, it
> fails with -EBUSY and so any attempt to do page invalidation fails.
> 
> The real question is this: Why can't that page be invalidated after
> it has been written to disk and cleaned?
> 
> Well, there's data on the first two buffers in the page (1k block
> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> is failing drop_buffers because it's bh->b_state == 0x3, which is
> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> what?
> 
> OK, set_buffer_dirty() is called on all buffers from
> __set_page_buffers_dirty(), regardless of whether the buffer is
> beyond EOF or not, which means that when we get to ->writepage,
> we have buffers marked dirty beyond EOF that we need to clean.
> So, we need to implement our own .set_page_dirty method that
> doesn't dirty buffers beyond EOF.
> 
> This is messy because the buffer code is not meant to be shared
> and it has interesting locking issues on the buffer dirty bits.
> So just copy and paste it and then modify it to suit what we need.
> 
> Note: the solutions the other filesystems and generic block code use
> of marking the buffers clean in ->writepage does not work for XFS.
> It still leaves dirty buffers beyond EOF and invalidations still
> fail. Hence rather than play whack-a-mole, this patch simply
> prevents those buffers from being dirtied in the first place.
> 
> cc: <stable@kernel.org>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> 
> v2: fix page offset calculation. passed 61 million fsx ops before
> hitting an unrelated problem in xfs_zero_file_space(), so no
> difference to the result with this updated patch.
> 
>  fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 11e9b4c..9bd2f53 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1753,11 +1753,69 @@ xfs_vm_readpages(
>  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
>  }
>  
> +/*
> + * This is basically a copy of __set_page_dirty_buffers() with one
> + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> + * dirty, we'll never be able to clean them because we don't write buffers
> + * beyond EOF, and that means we can't invalidate pages that span EOF
> + * that have been marked dirty. Further, the dirty state can leak into
> + * the file interior if the file is extended, resulting in all sorts of
> + * bad things happening as the state does not match the unerlying data.
							   underlying

I tend to agree with Christoph in that it would be nice if this was
handled generically one way or another. That said, I understand not
wanting to tweak behavior for other filesystems. You mention that the
trajectory for XFS is to kill the use of buffer heads, I suppose that
means this code is hopefully short-lived and probably less likely
subject to problems due to changes in the core code. Given that and the
fact that it looks correct at this point:

Reviewed-by: Brian Foster <bfoster@redhat.com>

Though it would be nice to see a small addition to the comment above to
state that explicitly. E.g., 'XXX this code should die when buffer heads
in XFS die...' or something along those lines... thanks.

Brian

> + */
> +STATIC int
> +xfs_vm_set_page_dirty(
> +	struct page		*page)
> +{
> +	struct address_space	*mapping = page->mapping;
> +	struct inode		*inode = mapping->host;
> +	loff_t			end_offset;
> +	loff_t			offset;
> +	int			newly_dirty;
> +
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	end_offset = i_size_read(inode);
> +	offset = page_offset(page);
> +
> +	spin_lock(&mapping->private_lock);
> +	if (page_has_buffers(page)) {
> +		struct buffer_head *head = page_buffers(page);
> +		struct buffer_head *bh = head;
> +
> +		do {
> +			if (offset < end_offset)
> +				set_buffer_dirty(bh);
> +			bh = bh->b_this_page;
> +			offset += 1 << inode->i_blkbits;
> +		} while (bh != head);
> +	}
> +	newly_dirty = !TestSetPageDirty(page);
> +	spin_unlock(&mapping->private_lock);
> +
> +	if (newly_dirty) {
> +		/* sigh - __set_page_dirty() is static, so copy it here, too */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		if (page->mapping) {	/* Race with truncate? */
> +			WARN_ON_ONCE(!PageUptodate(page));
> +			account_page_dirtied(page, mapping);
> +			radix_tree_tag_set(&mapping->page_tree,
> +					page_index(page), PAGECACHE_TAG_DIRTY);
> +		}
> +		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +	}
> +	return newly_dirty;
> +}
> +
>  const struct address_space_operations xfs_address_space_operations = {
>  	.readpage		= xfs_vm_readpage,
>  	.readpages		= xfs_vm_readpages,
>  	.writepage		= xfs_vm_writepage,
>  	.writepages		= xfs_vm_writepages,
> +	.set_page_dirty		= xfs_vm_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
>  	.write_begin		= xfs_vm_write_begin,
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-08-29 12:13 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-08-28 11:49 [PATCH v2 0/7] xfs: invalidation and related fixes for v3.17-rc3 Dave Chinner
2014-08-28 11:49 ` [PATCH 1/7] xfs: don't dirty buffers beyond EOF Dave Chinner
2014-08-28 13:34   ` Brian Foster
2014-08-28 22:37     ` Dave Chinner
2014-08-28 23:49       ` [PATCH 1/7 v2] " Dave Chinner
2014-08-29 12:13         ` Brian Foster
2014-08-29  0:39   ` [PATCH 1/7] " Christoph Hellwig
2014-08-29  0:53     ` Dave Chinner
2014-08-28 11:49 ` [PATCH 2/7] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
2014-08-29  0:39   ` Christoph Hellwig
2014-08-28 11:49 ` [PATCH 3/7] " Dave Chinner
2014-08-29  0:39   ` Christoph Hellwig
2014-08-28 11:49 ` [PATCH 4/7] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
2014-08-29  0:40   ` Christoph Hellwig
2014-08-28 11:49 ` [PATCH 5/7] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
2014-08-29  0:41   ` Christoph Hellwig
2014-08-28 11:49 ` [PATCH 6/7] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
2014-08-29  0:41   ` Christoph Hellwig
2014-08-28 11:49 ` [PATCH 7/7] xfs: trim eofblocks before collapse range Dave Chinner
2014-08-29  0:42   ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox