From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mx1.redhat.com ([209.132.183.28]:41414 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752688AbdKBMjO (ORCPT ); Thu, 2 Nov 2017 08:39:14 -0400 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.phx2.redhat.com [10.5.11.13]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B5C4D81DF7 for ; Thu, 2 Nov 2017 12:39:14 +0000 (UTC) Date: Thu, 2 Nov 2017 08:39:13 -0400 From: Brian Foster Subject: Re: [PATCH v2] xfs: truncate pagecache before writeback in xfs_setattr_size() Message-ID: <20171102123912.GA16645@bfoster.bfoster> References: <20171101140540.21542-1-eguan@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171101140540.21542-1-eguan@redhat.com> Sender: linux-xfs-owner@vger.kernel.org List-ID: List-Id: xfs To: Eryu Guan Cc: linux-xfs@vger.kernel.org On Wed, Nov 01, 2017 at 10:05:40PM +0800, Eryu Guan wrote: > On truncate down, if new size is not block size aligned, we zero the > rest of block to avoid exposing stale data to user, and > iomap_truncate_page() skips zeroing if the range is already in > unwritten state or a hole. Then we writeback from on-disk i_size to > the new size if this range hasn't been written to disk yet, and > truncate page cache beyond new EOF and set in-core i_size. > > The problem is that we could write data between di_size and newsize > before removing the page cache beyond newsize, as the extents may > still be in unwritten state right after a buffer write. As such, the > page of data that newsize lies in has not been zeroed by page cache > invalidation before it is written, and xfs_do_writepage() hasn't > triggered it's "zero data beyond EOF" case because we haven't > updated in-core i_size yet. Then a subsequent mmap read could see > non-zeros past EOF. > > I occasionally see this in fsx runs in fstests generic/112, a > simplified fsx operation sequence is like (assuming 4k block size > xfs): > > fallocate 0x0 0x1000 0x0 keep_size > write 0x0 0x1000 0x0 > truncate 0x0 0x800 0x1000 > punch_hole 0x0 0x800 0x800 > mapread 0x0 0x800 0x800 > > where fallocate allocates unwritten extent but doesn't update > i_size, buffer write populates the page cache and extent is still > unwritten, truncate skips zeroing page past new EOF and writes the > page to disk, punch_hole invalidates the page cache, at last mapread > reads the block back and sees non-zero beyond EOF. > > Fix it by moving truncate_setsize() to before writeback so the page > cache invalidation zeros the partial page at the new EOF. This also > triggers "zero data beyond EOF" in xfs_do_writepage() at writeback > time, because newsize has been set and page straddles the newsize. > > Also fixed the wrong 'end' param of filemap_write_and_wait_range() > call while we're at it, the 'end' is inclusive and should be > 'newsize - 1'. > > Suggested-by: Dave Chinner > Signed-off-by: Eryu Guan > --- > Thanks for the nice explanation. This looks Ok to me: Reviewed-by: Brian Foster > v2: > - fix the bug by moving truncate_setsize() before writeback as suggested > by Dave > - update summary and commit log accordingly, with some words stolen from > Dave :) > > v1: https://www.spinics.net/lists/linux-xfs/msg12321.html > > fs/xfs/xfs_iops.c | 36 ++++++++++++++++++++---------------- > 1 file changed, 20 insertions(+), 16 deletions(-) > > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 17081c77ef86..f24e5b6cfc86 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -886,22 +886,6 @@ xfs_setattr_size( > return error; > > /* > - * We are going to log the inode size change in this transaction so > - * any previous writes that are beyond the on disk EOF and the new > - * EOF that have not been written out need to be written here. If we > - * do not write the data out, we expose ourselves to the null files > - * problem. Note that this includes any block zeroing we did above; > - * otherwise those blocks may not be zeroed after a crash. > - */ > - if (did_zeroing || > - (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > - error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > - ip->i_d.di_size, newsize); > - if (error) > - return error; > - } > - > - /* > * We've already locked out new page faults, so now we can safely remove > * pages from the page cache knowing they won't get refaulted until we > * drop the XFS_MMAP_EXCL lock after the extent manipulations are > @@ -917,9 +901,29 @@ xfs_setattr_size( > * user visible changes). There's not much we can do about this, except > * to hope that the caller sees ENOMEM and retries the truncate > * operation. > + * > + * And we update in-core i_size and truncate page cache beyond newsize > + * before writeback the [di_size, newsize] range, so we're guaranteed > + * not to write stale data past the new EOF on truncate down. > */ > truncate_setsize(inode, newsize); > > + /* > + * We are going to log the inode size change in this transaction so > + * any previous writes that are beyond the on disk EOF and the new > + * EOF that have not been written out need to be written here. If we > + * do not write the data out, we expose ourselves to the null files > + * problem. Note that this includes any block zeroing we did above; > + * otherwise those blocks may not be zeroed after a crash. > + */ > + if (did_zeroing || > + (newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) { > + error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, > + ip->i_d.di_size, newsize - 1); > + if (error) > + return error; > + } > + > error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > if (error) > return error; > -- > 2.13.6 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html