linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org,
	djwong@kernel.org, hch@infradead.org, brauner@kernel.org,
	chandanbabu@kernel.org, jack@suse.cz, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
Date: Wed, 22 May 2024 09:57:13 +0800	[thread overview]
Message-ID: <122ab6ed-147b-517c-148d-7cb35f7f888b@huaweicloud.com> (raw)
In-Reply-To: <ZkwJJuFCV+WQLl40@dread.disaster.area>

On 2024/5/21 10:38, Dave Chinner wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating a realtime file unaligned to a shorter size,
>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>> xfs_truncate_page() also only zeros the EOF block. This could expose
>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>> a write operation").
>>
>> If the sb_rextsize is bigger than one block, and we have a realtime
>> inode that contains a long enough written extent. If we unaligned
>> truncate into the middle of this extent, xfs_itruncate_extents() could
>> split the extent and align the it's tail to sb_rextsize, there maybe
>> have more than one blocks more between the end of the file. Since
>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>> value, so it may leftover some blocks contains stale data that could be
>> exposed if we append write it over a long enough distance later.
>>
>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>> and make sure the entire zeroed range have been flushed to disk before
>> updating the inode size.
>>
>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_iops.c  | 10 ----------
>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4958cc3337bc..fc379450fe74 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>  	loff_t			pos,
>>  	bool			*did_zero)
>>  {
>> +	struct xfs_mount	*mp = ip->i_mount;
>>  	struct inode		*inode = VFS_I(ip);
>>  	unsigned int		blocksize = i_blocksize(inode);
>> +	int			error;
>> +
>> +	if (XFS_IS_REALTIME_INODE(ip))
>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>> +
>> +	/*
>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>> +	 * cow block over a hole) and subsequently skips zeroing the
>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>> +	 * convert the block before the pagecache truncate.
>> +	 */
>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					     roundup_64(pos, blocksize));
>> +	if (error)
>> +		return error;
>>  
>>  	if (IS_DAX(inode))
>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
>> -					&xfs_dax_write_iomap_ops);
>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>> -				   &xfs_buffered_write_iomap_ops);
>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
>> +					  &xfs_dax_write_iomap_ops);
>> +	else
>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>> +					    &xfs_buffered_write_iomap_ops);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Write back path won't write dirty blocks post EOF folio,
>> +	 * flush the entire zeroed range before updating the inode
>> +	 * size.
>> +	 */
>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					    roundup_64(pos, blocksize));
>>  }
> 
> Ok, this means we do -three- blocking writebacks through this path
> instead of one or maybe two.
> 
> We already know that this existing blocking writeback case for dirty
> pages over unwritten extents is a significant performance issue for
> some workloads. I have a fix in progress for iomap to handle this
> case without requiring blocking writeback to be done to convert the
> extent to written before we do the truncate.
> 
> Regardless, I think this whole "truncate is allocation unit size
> aware" algorithm is largely unworkable without a rewrite. What XFS
> needs to do on truncate *down* before we start the truncate
> transaction is pretty simple:
> 
> 	- ensure that the new EOF extent tail contains zeroes
> 	- ensure that the range from the existing ip->i_disk_size to
> 	  the new EOF is on disk so data vs metadata ordering is
> 	  correct for crash recovery purposes.
> 
> What this patch does to acheive that is:
> 
> 	1. blocking writeback to clean dirty unwritten/cow blocks at
> 	the new EOF.
> 	2. iomap_truncate_page() writes zeroes into the page cache,
> 	which dirties the pages we just cleaned at the new EOF.
> 	3. blocking writeback to clean the dirty blocks at the new
> 	EOF.
> 	4. truncate_setsize() then writes zeros to partial folios at
> 	the new EOF, dirtying the EOF page again.
> 	5. blocking writeback to clean dirty blocks from the current
> 	on-disk size to the new EOF.
> 
> This is pretty crazy when you stop and think about it. We're writing
> the same EOF block -three- times. The first data write gets
> overwritten by zeroes on the second write, and the third write
> writes the same zeroes as the second write. There are two redundant
> *blocking* writes in this process.

Yes, this is indeed a performance disaster, and iomap_zero_range()
should aware the dirty pages. I had the same problem when developing
buffered iomap conversion on ext4.

> 
> We can do all this with a single writeback operation if we are a
> little bit smarter about the order of operations we perform and we
> are a little bit smarter in iomap about zeroing dirty pages in the
> page cache:
> 
> 	1. change iomap_zero_range() to do the right thing with
> 	dirty unwritten and cow extents (the patch I've been working
> 	on).
> 
> 	2. pass the range to be zeroed into iomap_truncate_page()
> 	(the fundamental change being made here).
> 
> 	3. zero the required range *through the page cache*
> 	(iomap_zero_range() already does this).
> 
> 	4. write back the XFS inode from ip->i_disk_size to the end
> 	of the range zeroed by iomap_truncate_page()
> 	(xfs_setattr_size() already does this).
> 
> 	5. i_size_write(newsize);
> 
> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> 	the page cache beyond the new EOF without doing any zeroing
> 	as we've already done all the zeroing needed to the page
> 	cache through iomap_truncate_page().
> 
> 
> The patch I'm working on for step 1 is below. It still needs to be
> extended to handle the cow case, but I'm unclear on how to exercise
> that case so I haven't written the code to do it. The rest of it is
> just rearranging the code that we already use just to get the order
> of operations right. The only notable change in behaviour is using
> invalidate_inode_pages2_range() instead of truncate_pagecache(),
> because we don't want the EOF page to be dirtied again once we've
> already written zeroes to disk....
> 

Indeed, this sounds like the best solution. Since Darrick recommended
that we could fix the stale data exposure on realtime inode issue by
convert the tail extent to unwritten, I suppose we could do this after
fixing the problem.

Thanks,
Yi.


  reply	other threads:[~2024-05-22  1:57 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-17 17:29   ` Darrick J. Wong
2024-05-18  2:01     ` Zhang Yi
2024-05-17 11:13 ` [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
2024-05-17 17:59   ` Darrick J. Wong
2024-05-18  6:35     ` Zhang Yi
2024-05-18 19:26       ` Darrick J. Wong
2024-05-20  6:56         ` Zhang Yi
2024-05-20  7:11           ` Zhang Yi
2024-05-20 18:37           ` Darrick J. Wong
2024-05-21 13:45             ` Zhang Yi
2024-05-21  2:38   ` Dave Chinner
2024-05-22  1:57     ` Zhang Yi [this message]
2024-05-23  1:11       ` Dave Chinner
2024-05-23  2:00         ` Zhang Yi
2024-05-22  3:00     ` Darrick J. Wong
2024-05-23  1:14       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=122ab6ed-147b-517c-148d-7cb35f7f888b@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=brauner@kernel.org \
    --cc=chandanbabu@kernel.org \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).