All of lore.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Eryu Guan <eguan@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2] xfs: truncate pagecache before writeback in xfs_setattr_size()
Date: Thu, 2 Nov 2017 08:39:13 -0400	[thread overview]
Message-ID: <20171102123912.GA16645@bfoster.bfoster> (raw)
In-Reply-To: <20171101140540.21542-1-eguan@redhat.com>

On Wed, Nov 01, 2017 at 10:05:40PM +0800, Eryu Guan wrote:
> On truncate down, if new size is not block size aligned, we zero the
> rest of block to avoid exposing stale data to user, and
> iomap_truncate_page() skips zeroing if the range is already in
> unwritten state or a hole. Then we writeback from on-disk i_size to
> the new size if this range hasn't been written to disk yet, and
> truncate page cache beyond new EOF and set in-core i_size.
> 
> The problem is that we could write data between di_size and newsize
> before removing the page cache beyond newsize, as the extents may
> still be in unwritten state right after a buffer write. As such, the
> page of data that newsize lies in has not been zeroed by page cache
> invalidation before it is written, and xfs_do_writepage() hasn't
> triggered it's "zero data beyond EOF" case because we haven't
> updated in-core i_size yet. Then a subsequent mmap read could see
> non-zeros past EOF.
> 
> I occasionally see this in fsx runs in fstests generic/112, a
> simplified fsx operation sequence is like (assuming 4k block size
> xfs):
> 
>   fallocate 0x0 0x1000 0x0 keep_size
>   write 0x0 0x1000 0x0
>   truncate 0x0 0x800 0x1000
>   punch_hole 0x0 0x800 0x800
>   mapread 0x0 0x800 0x800
> 
> where fallocate allocates unwritten extent but doesn't update
> i_size, buffer write populates the page cache and extent is still
> unwritten, truncate skips zeroing page past new EOF and writes the
> page to disk, punch_hole invalidates the page cache, at last mapread
> reads the block back and sees non-zero beyond EOF.
> 
> Fix it by moving truncate_setsize() to before writeback so the page
> cache invalidation zeros the partial page at the new EOF. This also
> triggers "zero data beyond EOF" in xfs_do_writepage() at writeback
> time, because newsize has been set and page straddles the newsize.
> 
> Also fixed the wrong 'end' param of filemap_write_and_wait_range()
> call while we're at it, the 'end' is inclusive and should be
> 'newsize - 1'.
> 
> Suggested-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Eryu Guan <eguan@redhat.com>
> ---
> 

Thanks for the nice explanation. This looks Ok to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

> v2:
> - fix the bug by moving truncate_setsize() before writeback as suggested
>   by Dave
> - update summary and commit log accordingly, with some words stolen from
>   Dave :)
> 
> v1: https://www.spinics.net/lists/linux-xfs/msg12321.html
> 
>  fs/xfs/xfs_iops.c | 36 ++++++++++++++++++++----------------
>  1 file changed, 20 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 17081c77ef86..f24e5b6cfc86 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -886,22 +886,6 @@ xfs_setattr_size(
>  		return error;
>  
>  	/*
> -	 * We are going to log the inode size change in this transaction so
> -	 * any previous writes that are beyond the on disk EOF and the new
> -	 * EOF that have not been written out need to be written here.  If we
> -	 * do not write the data out, we expose ourselves to the null files
> -	 * problem. Note that this includes any block zeroing we did above;
> -	 * otherwise those blocks may not be zeroed after a crash.
> -	 */
> -	if (did_zeroing ||
> -	    (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						      ip->i_d.di_size, newsize);
> -		if (error)
> -			return error;
> -	}
> -
> -	/*
>  	 * We've already locked out new page faults, so now we can safely remove
>  	 * pages from the page cache knowing they won't get refaulted until we
>  	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
> @@ -917,9 +901,29 @@ xfs_setattr_size(
>  	 * user visible changes). There's not much we can do about this, except
>  	 * to hope that the caller sees ENOMEM and retries the truncate
>  	 * operation.
> +	 *
> +	 * And we update in-core i_size and truncate page cache beyond newsize
> +	 * before writeback the [di_size, newsize] range, so we're guaranteed
> +	 * not to write stale data past the new EOF on truncate down.
>  	 */
>  	truncate_setsize(inode, newsize);
>  
> +	/*
> +	 * We are going to log the inode size change in this transaction so
> +	 * any previous writes that are beyond the on disk EOF and the new
> +	 * EOF that have not been written out need to be written here.  If we
> +	 * do not write the data out, we expose ourselves to the null files
> +	 * problem. Note that this includes any block zeroing we did above;
> +	 * otherwise those blocks may not be zeroed after a crash.
> +	 */
> +	if (did_zeroing ||
> +	    (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
> +		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> +						ip->i_d.di_size, newsize - 1);
> +		if (error)
> +			return error;
> +	}
> +
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
>  	if (error)
>  		return error;
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

      parent reply	other threads:[~2017-11-02 12:39 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-01 14:05 [PATCH v2] xfs: truncate pagecache before writeback in xfs_setattr_size() Eryu Guan
2017-11-02  0:46 ` Dave Chinner
2017-11-02 12:39 ` Brian Foster [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171102123912.GA16645@bfoster.bfoster \
    --to=bfoster@redhat.com \
    --cc=eguan@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.