public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: fix zero range i_size problems
@ 2014-10-02 13:12 Brian Foster
  2014-10-06 13:11 ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2014-10-02 13:12 UTC (permalink / raw)
  To: xfs

The zero range operation is analogous to fallocate with the exception of
converting the range to zeroes. E.g., it attempts to allocate zeroed
blocks over the range specified by the caller. The XFS implementation
kills all delalloc blocks currently over the aligned range, converts the
range to allocated zero blocks (unwritten extents) and handles the
partial pages at the ends of the range by sending writes through the
pagecache.

The current implementation suffers from several problems associated with
inode size. If the aligned range covers an extending I/O, said I/O is
discarded and an inode size update from a previous write never makes it
to disk. Further, if an unaligned zero range extends beyond eof, the
page write induced for the partial end page can itself increase the
inode size, even if the zero range request is not supposed to update
i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).

Update xfs_zero_file_space() to forego using buffered I/O for partial
pages. Instead, convert the block aligned (rather than page aligned)
range to unwritten extents, update the partial blocks on disk and use
existing pagecache functionality to ensure cached pages are consistent
with data on disk.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This passes xfstests and survived an overnight fsx run (+300m ops) on a
1k fsb fs. The tradeoff that I've observed from this implementation is
that fzero is now also likely to trigger the problem described in the
recently posted xfs/053 test:

http://oss.sgi.com/archives/xfs/2014-09/msg00473.html

This is slightly unfortunate, but as I understand it that problem is
more fundamentally associated with extent conversion and writeback vs.
any particular fs operation that might expose it.

Brian

 fs/xfs/xfs_bmap_util.c | 110 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 78 insertions(+), 32 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index d8b77b5..f2d58e2 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1364,7 +1364,17 @@ xfs_free_file_space(
 	goto out;
 }
 
-
+/*
+ * Preallocate and zero a range of a file. This mechanism has the allocation
+ * semantics of fallocate and in addition converts data in the range to zeroes.
+ * This is done using unwritten extent conversion for complete blocks within the
+ * range. Partial start/end blocks cannot be converted to unwritten as they
+ * contain valid data. Therefore, partial blocks are preallocated and explicitly
+ * zeroed on-disk.
+ *
+ * Note that all allocation occurs in file sequential order (e.g., partial
+ * start, middle, partial end) to help the allocator create contiguous extents.
+ */
 int
 xfs_zero_file_space(
 	struct xfs_inode	*ip,
@@ -1372,63 +1382,99 @@ xfs_zero_file_space(
 	xfs_off_t		len)
 {
 	struct xfs_mount	*mp = ip->i_mount;
-	uint			granularity;
+	uint			blksize;
 	xfs_off_t		start_boundary;
 	xfs_off_t		end_boundary;
 	int			error;
+	loff_t			eof;
+	xfs_off_t		endoffset;
 
 	trace_xfs_zero_file_space(ip);
 
-	granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
+	blksize = 1 << mp->m_sb.sb_blocklog;
 
 	/*
-	 * Round the range of extents we are going to convert inwards.  If the
-	 * offset is aligned, then it doesn't get changed so we zero from the
-	 * start of the block offset points to.
+	 * Align the range inward to blocksize. This represents the range that
+	 * can be converted to unwritten extents.
 	 */
-	start_boundary = round_up(offset, granularity);
-	end_boundary = round_down(offset + len, granularity);
+	start_boundary = round_up(offset, blksize);
+	end_boundary = round_down(offset + len, blksize);
 
 	ASSERT(start_boundary >= offset);
 	ASSERT(end_boundary <= offset + len);
 
-	if (start_boundary < end_boundary - 1) {
+	/*
+	 * Flush the eof page before we start so we don't lose an i_size update
+	 * due to discarding an appending I/O in cache.
+	 */
+	eof = round_down(i_size_read(VFS_I(ip)) - 1, PAGE_CACHE_SIZE);
+	if (eof >= start_boundary && eof <= end_boundary)
+		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, eof, -1);
+
+	/*
+	 * Writeback the partial pages if either the start or end is not page
+	 * aligned to prevent writeback races with partial block zeroing.
+	 * truncate_pagecache_range() handles partial page zeroing if the pages
+	 * are cached.
+	 */
+	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(offset + len)) {
+		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset,
+					     start_boundary);
+		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, end_boundary,
+					     offset + len);
+	}
+	truncate_pagecache_range(VFS_I(ip), offset, offset + len - 1);
+
+	/* handle a partial start block */
+	if (offset < start_boundary) {
+		/* don't go past the end offset if it's within the block */
+		endoffset = min_t(xfs_off_t, start_boundary, offset + len);
+
+		/* preallocate the block and zero the partial range on disk */
+		error = xfs_alloc_file_space(ip, offset,
+				start_boundary - offset, XFS_BMAPI_PREALLOC);
+		if (error)
+			goto out;
+		error = xfs_zero_remaining_bytes(ip, offset, endoffset - 1);
+		if (error)
+			goto out;
+
+		/* if we've hit the end offset, we're done */
+		if (endoffset == offset + len)
+			goto out;
+	}
+
+	/* handle complete blocks */
+	if (end_boundary > start_boundary) {
 		/*
-		 * punch out delayed allocation blocks and the page cache over
-		 * the conversion range
+		 * Punch out any delalloc extents first. We don't need the data
+		 * and this is more efficient than converting them all to
+		 * written->unwritten.
 		 */
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		error = xfs_bmap_punch_delalloc_range(ip,
 				XFS_B_TO_FSBT(mp, start_boundary),
-				XFS_B_TO_FSB(mp, end_boundary - start_boundary));
+				XFS_B_TO_FSB(mp,
+					     end_boundary - start_boundary));
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		truncate_pagecache_range(VFS_I(ip), start_boundary,
-					 end_boundary - 1);
 
-		/* convert the blocks */
+		/* convert the range to unwritten extents */
 		error = xfs_alloc_file_space(ip, start_boundary,
 					end_boundary - start_boundary - 1,
 					XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
 		if (error)
 			goto out;
+	}
 
-		/* We've handled the interior of the range, now for the edges */
-		if (start_boundary != offset) {
-			error = xfs_iozero(ip, offset, start_boundary - offset);
-			if (error)
-				goto out;
-		}
-
-		if (end_boundary != offset + len)
-			error = xfs_iozero(ip, end_boundary,
-					   offset + len - end_boundary);
-
-	} else {
-		/*
-		 * It's either a sub-granularity range or the range spanned lies
-		 * partially across two adjacent blocks.
-		 */
-		error = xfs_iozero(ip, offset, len);
+	/* handle a partial end block */
+	if (offset + len > end_boundary) {
+		error = xfs_alloc_file_space(ip, end_boundary,
+				offset + len - end_boundary,
+				XFS_BMAPI_PREALLOC);
+		if (error)
+			goto out;
+		error = xfs_zero_remaining_bytes(ip, end_boundary,
+						 offset + len - 1);
 	}
 
 out:
-- 
1.8.3.1

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix zero range i_size problems
  2014-10-02 13:12 [PATCH] xfs: fix zero range i_size problems Brian Foster
@ 2014-10-06 13:11 ` Brian Foster
  2014-10-06 20:29   ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2014-10-06 13:11 UTC (permalink / raw)
  To: xfs

On Thu, Oct 02, 2014 at 09:12:13AM -0400, Brian Foster wrote:
> The zero range operation is analogous to fallocate with the exception of
> converting the range to zeroes. E.g., it attempts to allocate zeroed
> blocks over the range specified by the caller. The XFS implementation
> kills all delalloc blocks currently over the aligned range, converts the
> range to allocated zero blocks (unwritten extents) and handles the
> partial pages at the ends of the range by sending writes through the
> pagecache.
> 
> The current implementation suffers from several problems associated with
> inode size. If the aligned range covers an extending I/O, said I/O is
> discarded and an inode size update from a previous write never makes it
> to disk. Further, if an unaligned zero range extends beyond eof, the
> page write induced for the partial end page can itself increase the
> inode size, even if the zero range request is not supposed to update
> i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> 
> Update xfs_zero_file_space() to forego using buffered I/O for partial
> pages. Instead, convert the block aligned (rather than page aligned)
> range to unwritten extents, update the partial blocks on disk and use
> existing pagecache functionality to ensure cached pages are consistent
> with data on disk.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This passes xfstests and survived an overnight fsx run (+300m ops) on a
> 1k fsb fs. The tradeoff that I've observed from this implementation is
> that fzero is now also likely to trigger the problem described in the
> recently posted xfs/053 test:
> 
> http://oss.sgi.com/archives/xfs/2014-09/msg00473.html
> 
> This is slightly unfortunate, but as I understand it that problem is
> more fundamentally associated with extent conversion and writeback vs.
> any particular fs operation that might expose it.
> 
> Brian
> 

I spent the past few days trying to reproduce and nail down some of the
delalloc related asserts reproduced by generic/269 and discovered that
zero range is at least one cause of the problem. This patch actually
addresses that particular problem, yet introduces another similar
variant of the problem.

The original cause is "leakage" of speculative preallocation blocks into
the addressable range of the file (within eof) without any pages
associated with the delalloc extent. This means that it isn't
necessarily sufficient to writeback a particular delalloc range to
convert it. This sets a landmine for direct I/O or inode eviction to
step on when it comes time to perform I/O over the range or clean the
inode.

For example:

# xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" /mnt/file
wrote 131072/131072 bytes at offset 0
128 KiB, 32 ops; 0.0000 sec (786.164 MiB/sec and 201257.8616 ops/sec)
# xfs_io -d -c "pread 128k 128k" /mnt/file 
Segmentation fault

The first command creates a delalloc extent with preallocation and then
runs a non-file-size-extending zero range beyond the end of the file
with an unaligned end offset (note that doing the zero range within the
same process is required to ensure prealloc isn't trimmed by close()).
As discussed above, this results in an internal I/O that increases
i_size and leaks the previous preallocation within eof. The subsequent
command simply attempts a delalloc read across where the preallocation
would have started. It attempts a writeback of the range where the
dellaloc exists, but there are no pages in the particular range of I/O
to drive extent conversion and hence the request ultimately falls over.
I haven't looked into the xfs_io crash, but the intent of the command is
to reproduce the direct I/O BUG():

kernel: kernel BUG at fs/xfs/xfs_aops.c:1392!

I don't reproduce the umount problem with this same sequence. I suspect
this is because a writeback over the broader range of the file (e.g.,
0-128k) will ultimately convert the extent. I suspect some sequence that
can create contiguous but separate delalloc extents would lead to the
problem on umount (e.g., a delalloc extent that has no page cache
coverage whatsoever). I'm pretty sure I've seen this before when playing
around with collapse but I can't quite recall how it occurs. This is
important to dig into as well as anything that can lead to this naked
delalloc extent scenario can cause this problem.

The patch here addresses the zero range problem simply by avoiding
xfs_iozero(). It introduces a similar problem...

>  fs/xfs/xfs_bmap_util.c | 110 +++++++++++++++++++++++++++++++++++--------------
>  1 file changed, 78 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index d8b77b5..f2d58e2 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1364,7 +1364,17 @@ xfs_free_file_space(
>  	goto out;
>  }
>  
> -
> +/*
> + * Preallocate and zero a range of a file. This mechanism has the allocation
> + * semantics of fallocate and in addition converts data in the range to zeroes.
> + * This is done using unwritten extent conversion for complete blocks within the
> + * range. Partial start/end blocks cannot be converted to unwritten as they
> + * contain valid data. Therefore, partial blocks are preallocated and explicitly
> + * zeroed on-disk.
> + *
> + * Note that all allocation occurs in file sequential order (e.g., partial
> + * start, middle, partial end) to help the allocator create contiguous extents.
> + */
>  int
>  xfs_zero_file_space(
>  	struct xfs_inode	*ip,
> @@ -1372,63 +1382,99 @@ xfs_zero_file_space(
>  	xfs_off_t		len)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> -	uint			granularity;
> +	uint			blksize;
>  	xfs_off_t		start_boundary;
>  	xfs_off_t		end_boundary;
>  	int			error;
> +	loff_t			eof;
> +	xfs_off_t		endoffset;
>  
>  	trace_xfs_zero_file_space(ip);
>  
> -	granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
> +	blksize = 1 << mp->m_sb.sb_blocklog;
>  
>  	/*
> -	 * Round the range of extents we are going to convert inwards.  If the
> -	 * offset is aligned, then it doesn't get changed so we zero from the
> -	 * start of the block offset points to.
> +	 * Align the range inward to blocksize. This represents the range that
> +	 * can be converted to unwritten extents.
>  	 */
> -	start_boundary = round_up(offset, granularity);
> -	end_boundary = round_down(offset + len, granularity);
> +	start_boundary = round_up(offset, blksize);
> +	end_boundary = round_down(offset + len, blksize);
>  
>  	ASSERT(start_boundary >= offset);
>  	ASSERT(end_boundary <= offset + len);
>  
> -	if (start_boundary < end_boundary - 1) {
> +	/*
> +	 * Flush the eof page before we start so we don't lose an i_size update
> +	 * due to discarding an appending I/O in cache.
> +	 */
> +	eof = round_down(i_size_read(VFS_I(ip)) - 1, PAGE_CACHE_SIZE);
> +	if (eof >= start_boundary && eof <= end_boundary)
> +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, eof, -1);
> +
> +	/*
> +	 * Writeback the partial pages if either the start or end is not page
> +	 * aligned to prevent writeback races with partial block zeroing.
> +	 * truncate_pagecache_range() handles partial page zeroing if the pages
> +	 * are cached.
> +	 */
> +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(offset + len)) {
> +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset,
> +					     start_boundary);
> +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, end_boundary,
> +					     offset + len);
> +	}
> +	truncate_pagecache_range(VFS_I(ip), offset, offset + len - 1);

... here because the request can still fail with ENOSPC during block
allocation after we have truncated pagecache for the entire range but
before we might have killed the delalloc blocks in the range. This leads
to the same delalloc blocks not covered by pagecache problem.

I suspect the fix is to truncate the pagecache in smaller portions as we
proceed through the function, but I'll have to give it a try. I *think*
it's important to handle the truncate before making any extent
modifications lest weird things happen with writeback.

Brian

> +
> +	/* handle a partial start block */
> +	if (offset < start_boundary) {
> +		/* don't go past the end offset if it's within the block */
> +		endoffset = min_t(xfs_off_t, start_boundary, offset + len);
> +
> +		/* preallocate the block and zero the partial range on disk */
> +		error = xfs_alloc_file_space(ip, offset,
> +				start_boundary - offset, XFS_BMAPI_PREALLOC);
> +		if (error)
> +			goto out;
> +		error = xfs_zero_remaining_bytes(ip, offset, endoffset - 1);
> +		if (error)
> +			goto out;
> +
> +		/* if we've hit the end offset, we're done */
> +		if (endoffset == offset + len)
> +			goto out;
> +	}
> +
> +	/* handle complete blocks */
> +	if (end_boundary > start_boundary) {
>  		/*
> -		 * punch out delayed allocation blocks and the page cache over
> -		 * the conversion range
> +		 * Punch out any delalloc extents first. We don't need the data
> +		 * and this is more efficient than converting them all to
> +		 * written->unwritten.
>  		 */
>  		xfs_ilock(ip, XFS_ILOCK_EXCL);
>  		error = xfs_bmap_punch_delalloc_range(ip,
>  				XFS_B_TO_FSBT(mp, start_boundary),
> -				XFS_B_TO_FSB(mp, end_boundary - start_boundary));
> +				XFS_B_TO_FSB(mp,
> +					     end_boundary - start_boundary));
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		truncate_pagecache_range(VFS_I(ip), start_boundary,
> -					 end_boundary - 1);
>  
> -		/* convert the blocks */
> +		/* convert the range to unwritten extents */
>  		error = xfs_alloc_file_space(ip, start_boundary,
>  					end_boundary - start_boundary - 1,
>  					XFS_BMAPI_PREALLOC | XFS_BMAPI_CONVERT);
>  		if (error)
>  			goto out;
> +	}
>  
> -		/* We've handled the interior of the range, now for the edges */
> -		if (start_boundary != offset) {
> -			error = xfs_iozero(ip, offset, start_boundary - offset);
> -			if (error)
> -				goto out;
> -		}
> -
> -		if (end_boundary != offset + len)
> -			error = xfs_iozero(ip, end_boundary,
> -					   offset + len - end_boundary);
> -
> -	} else {
> -		/*
> -		 * It's either a sub-granularity range or the range spanned lies
> -		 * partially across two adjacent blocks.
> -		 */
> -		error = xfs_iozero(ip, offset, len);
> +	/* handle a partial end block */
> +	if (offset + len > end_boundary) {
> +		error = xfs_alloc_file_space(ip, end_boundary,
> +				offset + len - end_boundary,
> +				XFS_BMAPI_PREALLOC);
> +		if (error)
> +			goto out;
> +		error = xfs_zero_remaining_bytes(ip, end_boundary,
> +						 offset + len - 1);
>  	}
>  
>  out:
> -- 
> 1.8.3.1
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix zero range i_size problems
  2014-10-06 13:11 ` Brian Foster
@ 2014-10-06 20:29   ` Dave Chinner
  2014-10-06 22:09     ` Brian Foster
  0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2014-10-06 20:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Mon, Oct 06, 2014 at 09:11:11AM -0400, Brian Foster wrote:
> On Thu, Oct 02, 2014 at 09:12:13AM -0400, Brian Foster wrote:
> > The zero range operation is analogous to fallocate with the exception of
> > converting the range to zeroes. E.g., it attempts to allocate zeroed
> > blocks over the range specified by the caller. The XFS implementation
> > kills all delalloc blocks currently over the aligned range, converts the
> > range to allocated zero blocks (unwritten extents) and handles the
> > partial pages at the ends of the range by sending writes through the
> > pagecache.
> > 
> > The current implementation suffers from several problems associated with
> > inode size. If the aligned range covers an extending I/O, said I/O is
> > discarded and an inode size update from a previous write never makes it
> > to disk. Further, if an unaligned zero range extends beyond eof, the
> > page write induced for the partial end page can itself increase the
> > inode size, even if the zero range request is not supposed to update
> > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> > 
> > Update xfs_zero_file_space() to forego using buffered I/O for partial
> > pages. Instead, convert the block aligned (rather than page aligned)
> > range to unwritten extents, update the partial blocks on disk and use
> > existing pagecache functionality to ensure cached pages are consistent
> > with data on disk.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This passes xfstests and survived an overnight fsx run (+300m ops) on a
> > 1k fsb fs. The tradeoff that I've observed from this implementation is
> > that fzero is now also likely to trigger the problem described in the
> > recently posted xfs/053 test:
> > 
> > http://oss.sgi.com/archives/xfs/2014-09/msg00473.html
> > 
> > This is slightly unfortunate, but as I understand it that problem is
> > more fundamentally associated with extent conversion and writeback vs.
> > any particular fs operation that might expose it.
> > 
> > Brian
> > 
> 
> I spent the past few days trying to reproduce and nail down some of the
> delalloc related asserts reproduced by generic/269 and discovered that
> zero range is at least one cause of the problem. This patch actually
> addresses that particular problem, yet introduces another similar
> variant of the problem.
> 
> The original cause is "leakage" of speculative preallocation blocks into
> the addressable range of the file (within eof) without any pages
> associated with the delalloc extent. This means that it isn't
> necessarily sufficient to writeback a particular delalloc range to
> convert it. This sets a landmine for direct I/O or inode eviction to
> step on when it comes time to perform I/O over the range or clean the
> inode.
> 
> For example:
> 
> # xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" /mnt/file
> wrote 131072/131072 bytes at offset 0
> 128 KiB, 32 ops; 0.0000 sec (786.164 MiB/sec and 201257.8616 ops/sec)
> # xfs_io -d -c "pread 128k 128k" /mnt/file 
> Segmentation fault
> 
> The first command creates a delalloc extent with preallocation and then
> runs a non-file-size-extending zero range beyond the end of the file
> with an unaligned end offset (note that doing the zero range within the
> same process is required to ensure prealloc isn't trimmed by close()).
> As discussed above, this results in an internal I/O that increases
> i_size and leaks the previous preallocation within eof.

This is a bug within the zero range code - we should never try to
sub-block zeroing beyond EOF unless we are extending the file size
and need to zero blocks/prealloc that already exists. Extending the
file should only be done through xfs_setattr_size(). If the range to
be zeroed beyond EOF is smaller than a block, then we should just
prealloc the entire block because the EOF extension (whenever it is
done) will do IO that results in the block being fully zeroed,
anyway.

> The patch here addresses the zero range problem simply by avoiding
> xfs_iozero(). It introduces a similar problem...

Right, same conclusion, different reasoning. ;)

> > +	/*
> > +	 * Writeback the partial pages if either the start or end is not page
> > +	 * aligned to prevent writeback races with partial block zeroing.
> > +	 * truncate_pagecache_range() handles partial page zeroing if the pages
> > +	 * are cached.
> > +	 */
> > +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(offset + len)) {
> > +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset,
> > +					     start_boundary);
> > +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, end_boundary,
> > +					     offset + len);
> > +	}
> > +	truncate_pagecache_range(VFS_I(ip), offset, offset + len - 1);
> 
> ... here because the request can still fail with ENOSPC during block
> allocation after we have truncated pagecache for the entire range but
> before we might have killed the delalloc blocks in the range. This leads
> to the same delalloc blocks not covered by pagecache problem.

Another reason for punching the delalloc blocks first. ;)

> I suspect the fix is to truncate the pagecache in smaller portions as we
> proceed through the function, but I'll have to give it a try. I *think*
> it's important to handle the truncate before making any extent
> modifications lest weird things happen with writeback.

Or we could just punch the delalloc blocks on failure. i.e. punching
delalloc blocks is used only for error handling (like in the buffer
write error handling case). It will at least "zero" the delalloc
portions of the range on failure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] xfs: fix zero range i_size problems
  2014-10-06 20:29   ` Dave Chinner
@ 2014-10-06 22:09     ` Brian Foster
  0 siblings, 0 replies; 4+ messages in thread
From: Brian Foster @ 2014-10-06 22:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Tue, Oct 07, 2014 at 07:29:21AM +1100, Dave Chinner wrote:
> On Mon, Oct 06, 2014 at 09:11:11AM -0400, Brian Foster wrote:
> > On Thu, Oct 02, 2014 at 09:12:13AM -0400, Brian Foster wrote:
> > > The zero range operation is analogous to fallocate with the exception of
> > > converting the range to zeroes. E.g., it attempts to allocate zeroed
> > > blocks over the range specified by the caller. The XFS implementation
> > > kills all delalloc blocks currently over the aligned range, converts the
> > > range to allocated zero blocks (unwritten extents) and handles the
> > > partial pages at the ends of the range by sending writes through the
> > > pagecache.
> > > 
> > > The current implementation suffers from several problems associated with
> > > inode size. If the aligned range covers an extending I/O, said I/O is
> > > discarded and an inode size update from a previous write never makes it
> > > to disk. Further, if an unaligned zero range extends beyond eof, the
> > > page write induced for the partial end page can itself increase the
> > > inode size, even if the zero range request is not supposed to update
> > > i_size (via KEEP_SIZE, similar to an fallocate beyond EOF).
> > > 
> > > Update xfs_zero_file_space() to forego using buffered I/O for partial
> > > pages. Instead, convert the block aligned (rather than page aligned)
> > > range to unwritten extents, update the partial blocks on disk and use
> > > existing pagecache functionality to ensure cached pages are consistent
> > > with data on disk.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > > 
> > > This passes xfstests and survived an overnight fsx run (+300m ops) on a
> > > 1k fsb fs. The tradeoff that I've observed from this implementation is
> > > that fzero is now also likely to trigger the problem described in the
> > > recently posted xfs/053 test:
> > > 
> > > http://oss.sgi.com/archives/xfs/2014-09/msg00473.html
> > > 
> > > This is slightly unfortunate, but as I understand it that problem is
> > > more fundamentally associated with extent conversion and writeback vs.
> > > any particular fs operation that might expose it.
> > > 
> > > Brian
> > > 
> > 
> > I spent the past few days trying to reproduce and nail down some of the
> > delalloc related asserts reproduced by generic/269 and discovered that
> > zero range is at least one cause of the problem. This patch actually
> > addresses that particular problem, yet introduces another similar
> > variant of the problem.
> > 
> > The original cause is "leakage" of speculative preallocation blocks into
> > the addressable range of the file (within eof) without any pages
> > associated with the delalloc extent. This means that it isn't
> > necessarily sufficient to writeback a particular delalloc range to
> > convert it. This sets a landmine for direct I/O or inode eviction to
> > step on when it comes time to perform I/O over the range or clean the
> > inode.
> > 
> > For example:
> > 
> > # xfs_io -fc "pwrite 0 128k" -c "fzero -k 1m 54321" /mnt/file
> > wrote 131072/131072 bytes at offset 0
> > 128 KiB, 32 ops; 0.0000 sec (786.164 MiB/sec and 201257.8616 ops/sec)
> > # xfs_io -d -c "pread 128k 128k" /mnt/file 
> > Segmentation fault
> > 
> > The first command creates a delalloc extent with preallocation and then
> > runs a non-file-size-extending zero range beyond the end of the file
> > with an unaligned end offset (note that doing the zero range within the
> > same process is required to ensure prealloc isn't trimmed by close()).
> > As discussed above, this results in an internal I/O that increases
> > i_size and leaks the previous preallocation within eof.
> 
> This is a bug within the zero range code - we should never try to
> sub-block zeroing beyond EOF unless we are extending the file size
> and need to zero blocks/prealloc that already exists. Extending the
> file should only be done through xfs_setattr_size(). If the range to
> be zeroed beyond EOF is smaller than a block, then we should just
> prealloc the entire block because the EOF extension (whenever it is
> done) will do IO that results in the block being fully zeroed,
> anyway.
> 

Right, the original purpose of this patch was to address that problem.
The way it fell out, I didn't end up using xfs_iozero() at all.
Technically, it still has the potential to do sub-block zeroing
post-eof, but it doesn't matter because it isn't doing buffered I/O.
Sure, it's pointless in the post-eof case, but not problematic and not
something I thought worth complicating the code over. FWIW, I also don't
really see any reason to reintroduce xfs_iozero() into this rework.

The larger point is that the current, broken zero range seems to also be
a primary cause of the umount and direct I/O delalloc asserts that I see
via generic/269.

BTW, any thoughts on how we could end up with contiguous but independent
delalloc extents in the in-core list?

> > The patch here addresses the zero range problem simply by avoiding
> > xfs_iozero(). It introduces a similar problem...
> 
> Right, same conclusion, different reasoning. ;)
> 
> > > +	/*
> > > +	 * Writeback the partial pages if either the start or end is not page
> > > +	 * aligned to prevent writeback races with partial block zeroing.
> > > +	 * truncate_pagecache_range() handles partial page zeroing if the pages
> > > +	 * are cached.
> > > +	 */
> > > +	if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(offset + len)) {
> > > +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset,
> > > +					     start_boundary);
> > > +		filemap_write_and_wait_range(VFS_I(ip)->i_mapping, end_boundary,
> > > +					     offset + len);
> > > +	}
> > > +	truncate_pagecache_range(VFS_I(ip), offset, offset + len - 1);
> > 
> > ... here because the request can still fail with ENOSPC during block
> > allocation after we have truncated pagecache for the entire range but
> > before we might have killed the delalloc blocks in the range. This leads
> > to the same delalloc blocks not covered by pagecache problem.
> 
> Another reason for punching the delalloc blocks first. ;)
> 
> > I suspect the fix is to truncate the pagecache in smaller portions as we
> > proceed through the function, but I'll have to give it a try. I *think*
> > it's important to handle the truncate before making any extent
> > modifications lest weird things happen with writeback.
> 
> Or we could just punch the delalloc blocks on failure. i.e. punching
> delalloc blocks is used only for error handling (like in the buffer
> write error handling case). It will at least "zero" the delalloc
> portions of the range on failure...
> 

I started reordering things and whatnot and noticed that much of this
code is duplicated in xfs_free_file_space(). IOW, we can probably get
this behavior with a significantly more simple implementation that
basically consists of a page cache truncate, a call to
xfs_free_file_space() and a call to xfs_alloc_file_space().

That unnecessarily punches out all extents rather than just the delalloc
extents, but also naturally leaves the file in an expected state (the
range is zeroed) if the preallocation fails anywhere along the way. I'm
wondering if it's worth it. Thoughts?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2014-10-06 22:10 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-02 13:12 [PATCH] xfs: fix zero range i_size problems Brian Foster
2014-10-06 13:11 ` Brian Foster
2014-10-06 20:29   ` Dave Chinner
2014-10-06 22:09     ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox