Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data

From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Al@disappointment.disaster, Viro@disappointment.disaster,
	viro@ZenIV.linux.org.uk, xfs@oss.sgi.com
Subject: Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
Date: Sat, 29 Mar 2014 11:14:19 -0400	[thread overview]
Message-ID: <20140329151419.GA33827@bfoster.bfoster> (raw)
In-Reply-To: <1395396710-3824-3-git-send-email-david@fromorbit.com>

On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we fail a write beyond EOF and have to handle it in
> xfs_vm_write_begin(), we truncate the inode back to the current inode
> size. This doesn't take into account the fact that we may have
> already made successful writes to the same page (in the case of block
> size < page size) and hence we can truncate the page cache away from
> blocks with valid data in them. If these blocks are delayed
> allocation blocks, we now have a mismatch between the page cache and
> the extent tree, and this will trigger - at minimum - a delayed
> block count mismatch assert when the inode is evicted from the cache.
> We can also trip over it when block mapping for direct IO - this is
> the most common symptom seen from fsx and fsstress when run from
> xfstests.
> 
> Fix it by only truncating away the exact range we are updating state
> for in this write_begin call.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e810243..6b4ecc8 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
>  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
>  	if (unlikely(status)) {
>  		struct inode	*inode = mapping->host;
> +		size_t		isize = i_size_read(inode);
>  
>  		xfs_vm_write_failed(inode, page, pos, len);
>  		unlock_page(page);
>  
> -		if (pos + len > i_size_read(inode))
> -			truncate_pagecache(inode, i_size_read(inode));

>From what I can see, we have a write_begin/copy/write_end sequence per
page and the inode size is updated down in the write_end path. If the
write fails at write_begin time, then we haven't copied any data and the
inode size hasn't changed.

The intent of the original code looks like it wants to break off any
blocks that might have been set up beyond EOF before the write happened
to fail. Could you elaborate on how this can kill blocks that might have
been written successfully? Doesn't this only affect post-EOF metadata?

> +		/*
> +		 * If the write is beyond EOF, we only want to kill blocks
> +		 * allocated in this write, not blocks that were previously
> +		 * written successfully.
> +		 */
> +		if (pos + len > isize)
> +			truncate_pagecache_range(inode, pos, pos + len);

So the cache truncate now covers the range of the write. What happens if
the part of the write is an overwrite? This seems similar to the problem
you've described above... should the truncate start at isize instead?

Brian

>  
>  		page_cache_release(page);
>  		page = NULL;
> -- 
> 1.9.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs