From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush
Date: Sat, 29 Oct 2022 08:30:14 +1100 [thread overview]
Message-ID: <20221028213014.GD3600936@dread.disaster.area> (raw)
In-Reply-To: <Y1we59XylviZs+Ry@bfoster>
On Fri, Oct 28, 2022 at 02:26:47PM -0400, Brian Foster wrote:
> On Fri, Oct 28, 2022 at 09:11:09AM -0400, Brian Foster wrote:
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >
> > Here's a quick prototype of "option 3" described in my previous mail.
> > This has been spot tested and confirmed to prevent the original stale
> > data exposure problem. More thorough regression testing is still
> > required. Barring unforeseen issues with that, however, I think this is
> > tentatively my new preferred option. The primary reason for that is it
> > avoids looking at extent state and is more in line with what iomap based
> > zeroing should be doing more generically.
> >
> > Because of that, I think this provides a bit more opportunity for follow
> > on fixes (there are other truncate/zeroing problems I've come across
> > during this investigation that still need fixing), cleanup and
> > consolidation of the zeroing code. For example, I think the trajectory
> > of this could look something like:
> >
> > - Genericize a bit more to handle all truncates.
> > - Repurpose iomap_truncate_page() (currently only used by XFS) into a
> > unique implementation from zero range that does explicit zeroing
> > instead of relying on pagecache truncate.
> > - Refactor XFS ranged zeroing to an abstraction that uses a combination
> > of iomap_zero_range() and the new iomap_truncate_page().
> >
>
> After playing with this and thinking a bit more about the above, I think
> I managed to come up with an iomap_truncate_page() prototype that DTRT
> based on this. Only spot tested so far, needs to pass iomap_flags to the
> other bmbt_to_iomap() calls to handle the cow fork, undoubtedly has
> other bugs/warts, etc. etc. This is just a quick prototype to
> demonstrate the idea, which is essentially to check dirty state along
> with extent state while under lock and transfer that state back to iomap
> so it can decide whether it can shortcut or forcibly perform the zero.
>
> In a nutshell, IOMAP_TRUNC_PAGE asks the fs to check dirty state while
> under lock and implies that the range is sub-block (single page).
> IOMAP_F_TRUNC_PAGE on the imap informs iomap that the range was in fact
> dirty, so perform the zero via buffered write regardless of extent
> state.
I'd much prefer we fix this in the iomap infrastructure - failing to
zero dirty data in memory over an unwritten extent isn't an XFS bug,
so we shouldn't be working around it in XFS like we did previously.
I don't think this should be call "IOMAP_TRUNC_PAGE", though,
because that indicates the caller context, not what we are asking
the internal iomap code to do. What we are really asking is for
iomap_zero_iter() to do is zero the page cache if it exists in
memory, otherwise ignore unwritten/hole pages. Hence I think a name
like IOMAP_ZERO_PAGECACHE is more appropriate,
>
> Brian
>
> --- 8< ---
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 91ee0b308e13..14a9734b2838 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -899,7 +899,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> loff_t written = 0;
>
> /* already zeroed? we're done. */
> - if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> + if ((srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN) &&
> + !(srcmap->flags & IOMAP_F_TRUNC_PAGE))
> return length;
Why even involve the filesystem in this? We can do this directly
in iomap_zero_iter() with:
if ((srcmap->type == IOMAP_HOLE)
return;
if (srcmap->type == IOMAP_UNWRITTEN) {
if (!(iter->flags & IOMAP_ZERO_PAGECACHE))
return;
if (!filemap_range_needs_writeback(inode->i_mapping,
iomap->offset, iomap->offset + iomap->length))
return;
}
It probably also warrants a coment that a clean folio over EOF on an
unwritten extent already contains zeros, so we're only interested in
folios that *have been dirtied* over this extent. If it's under
writeback, we should still be zeroing because it will shortly
contain real data on disk and so it needs to be zeroed and
redirtied....
> @@ -916,6 +917,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> if (bytes > folio_size(folio) - offset)
> bytes = folio_size(folio) - offset;
>
> + trace_printk("%d: ino 0x%lx offset 0x%lx bytes 0x%lx\n",
> + __LINE__, folio->mapping->host->i_ino, offset, bytes);
> folio_zero_range(folio, offset, bytes);
> folio_mark_accessed(folio);
>
> @@ -933,6 +936,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> return written;
> }
>
> +static int
> +__iomap_zero_range(struct iomap_iter *iter, bool *did_zero,
> + const struct iomap_ops *ops)
> +{
> + int ret;
> +
> + while ((ret = iomap_iter(iter, ops)) > 0)
> + iter->processed = iomap_zero_iter(iter, did_zero);
> + return ret;
> +}
I'd just leave this simple loop open coded in the two callers.
> +
> int
> iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> const struct iomap_ops *ops)
> @@ -943,11 +957,8 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
> .len = len,
> .flags = IOMAP_ZERO,
> };
> - int ret;
>
> - while ((ret = iomap_iter(&iter, ops)) > 0)
> - iter.processed = iomap_zero_iter(&iter, did_zero);
> - return ret;
> + return __iomap_zero_range(&iter, did_zero, ops);
> }
> EXPORT_SYMBOL_GPL(iomap_zero_range);
>
> @@ -957,11 +968,17 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> {
> unsigned int blocksize = i_blocksize(inode);
> unsigned int off = pos & (blocksize - 1);
> + struct iomap_iter iter = {
> + .inode = inode,
> + .pos = pos,
> + .len = blocksize - off,
> + .flags = IOMAP_ZERO | IOMAP_TRUNC_PAGE,
> + };
>
> /* Block boundary? Nothing to do */
> if (!off)
> return 0;
> - return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
> + return __iomap_zero_range(&iter, did_zero, ops);
> }
> EXPORT_SYMBOL_GPL(iomap_truncate_page);
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 07da03976ec1..16d9b838e82d 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -915,6 +915,7 @@ xfs_buffered_write_iomap_begin(
> int allocfork = XFS_DATA_FORK;
> int error = 0;
> unsigned int lockmode = XFS_ILOCK_EXCL;
> + u16 iomap_flags = 0;
>
> if (xfs_is_shutdown(mp))
> return -EIO;
> @@ -942,6 +943,10 @@ xfs_buffered_write_iomap_begin(
> if (error)
> goto out_unlock;
>
> + if ((flags & IOMAP_TRUNC_PAGE) &&
> + filemap_range_needs_writeback(VFS_I(ip)->i_mapping, offset, offset))
> + iomap_flags |= IOMAP_F_TRUNC_PAGE;
As per above, I don't think we should be putting this check in the
filesystem. That simplifies this a lot as filesystems don't need to
know anything about how iomap manages the page cache for the
filesystem...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-10-28 21:30 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-28 13:04 [PATCH RFC 0/2] xfs: optimize truncate cache flushing Brian Foster
2022-10-28 13:04 ` [PATCH RFC 1/2] xfs: lift truncate iomap zeroing into a new helper Brian Foster
2022-10-28 13:04 ` [PATCH RFC 2/2] xfs: optimize eof page flush for iomap zeroing on truncate Brian Foster
2022-10-28 13:11 ` [PATCH] xfs: redirty eof folio on truncate to avoid filemap flush Brian Foster
2022-10-28 18:26 ` Brian Foster
2022-10-28 21:30 ` Dave Chinner [this message]
2022-10-28 23:49 ` Darrick J. Wong
2022-10-29 22:01 ` Dave Chinner
2022-11-02 8:15 ` Christoph Hellwig
2022-11-03 14:53 ` Brian Foster
2022-11-03 22:25 ` Dave Chinner
2022-11-04 18:22 ` Brian Foster
2022-11-02 8:25 ` Christoph Hellwig
2022-10-28 21:35 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221028213014.GD3600936@dread.disaster.area \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.