From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 4/7] xfs: buffered write failure should not truncate the page cache
Date: Wed, 2 Nov 2022 09:41:30 -0700 [thread overview]
Message-ID: <Y2KdumAbAF0mV0sh@magnolia> (raw)
In-Reply-To: <20221101003412.3842572-5-david@fromorbit.com>
On Tue, Nov 01, 2022 at 11:34:09AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> xfs_buffered_write_iomap_end() currently invalidates the page cache
> over the unused range of the delalloc extent it allocated. While the
> write allocated the delalloc extent, it does not own it exclusively
> as the write does not hold any locks that prevent either writeback
> or mmap page faults from changing the state of either the page cache
> or the extent state backing this range.
>
> Whilst xfs_bmap_punch_delalloc_range() already handles races in
> extent conversion - it will only punch out delalloc extents and it
> ignores any other type of extent - the page cache truncate does not
> discriminate between data written by this write or some other task.
> As a result, truncating the page cache can result in data corruption
> if the write races with mmap modifications to the file over the same
> range.
>
> generic/346 exercises this workload, and if we randomly fail writes
> (as will happen when iomap gets stale iomap detection later in the
> patchset), it will randomly corrupt the file data because it removes
> data written by mmap() in the same page as the write() that failed.
>
> Hence we do not want to punch out the page cache over the range of
> the extent we failed to write to - what we actually need to do is
> detect the ranges that have dirty data in cache over them and *not
> punch them out*.
Same dumb question as hch -- why do we need to punch out the nondirty
pagecache after a failed write? If the folios are uptodate then we're
evicting cache unnecessarily, and if they're !uptodate can't we let
reclaim do the dirty work for us?
I don't know if there are hysterical raisins for this or if the goal is
to undo memory consumption after a write failure? If we're stale-ing
the write because the iomapping changed, why not leave the folio where
it is, refresh the iomapping, and come back to (possibly?) the same
folio?
--D
> TO do this, we have to walk the page cache over the range of the
> delalloc extent we want to remove. This is made complex by the fact
> we have to handle partially up-to-date folios correctly and this can
> happen even when the FSB size == PAGE_SIZE because we now support
> multi-page folios in the page cache.
>
> Because we are only interested in discovering the edges of data
> ranges in the page cache (i.e. hole-data boundaries) we can make use
> of mapping_seek_hole_data() to find those transitions in the page
> cache. As we hold the invalidate_lock, we know that the boundaries
> are not going to change while we walk the range. This interface is
> also byte-based and is sub-page block aware, so we can find the data
> ranges in the cache based on byte offsets rather than page, folio or
> fs block sized chunks. This greatly simplifies the logic of finding
> dirty cached ranges in the page cache.
>
> Once we've identified a range that contains cached data, we can then
> iterate the range folio by folio. This allows us to determine if the
> data is dirty and hence perform the correct delalloc extent punching
> operations. The seek interface we use to iterate data ranges will
> give us sub-folio start/end granularity, so we may end up looking up
> the same folio multiple times as the seek interface iterates across
> each discontiguous data region in the folio.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
> fs/xfs/xfs_iomap.c | 151 ++++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 141 insertions(+), 10 deletions(-)
>
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 7bb55dbc19d3..2d48fcc7bd6f 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1134,6 +1134,146 @@ xfs_buffered_write_delalloc_punch(
> end_fsb - start_fsb);
> }
>
> +/*
> + * Scan the data range passed to us for dirty page cache folios. If we find a
> + * dirty folio, punch out the preceeding range and update the offset from which
> + * the next punch will start from.
> + *
> + * We can punch out clean pages because they either contain data that has been
> + * written back - in which case the delalloc punch over that range is a no-op -
> + * or they have been read faults in which case they contain zeroes and we can
> + * remove the delalloc backing range and any new writes to those pages will do
> + * the normal hole filling operation...
> + *
> + * This makes the logic simple: we only need to keep the delalloc extents only
> + * over the dirty ranges of the page cache.
> + */
> +static int
> +xfs_buffered_write_delalloc_scan(
> + struct inode *inode,
> + loff_t *punch_start_byte,
> + loff_t start_byte,
> + loff_t end_byte)
> +{
> + loff_t offset = start_byte;
> +
> + while (offset < end_byte) {
> + struct folio *folio;
> +
> + /* grab locked page */
> + folio = filemap_lock_folio(inode->i_mapping, offset >> PAGE_SHIFT);
> + if (!folio) {
> + offset = ALIGN_DOWN(offset, PAGE_SIZE) + PAGE_SIZE;
> + continue;
> + }
> +
> + /* if dirty, punch up to offset */
> + if (folio_test_dirty(folio)) {
> + if (offset > *punch_start_byte) {
> + int error;
> +
> + error = xfs_buffered_write_delalloc_punch(inode,
> + *punch_start_byte, offset);
> + if (error) {
> + folio_unlock(folio);
> + folio_put(folio);
> + return error;
> + }
> + }
> +
> + /*
> + * Make sure the next punch start is correctly bound to
> + * the end of this data range, not the end of the folio.
> + */
> + *punch_start_byte = min_t(loff_t, end_byte,
> + folio_next_index(folio) << PAGE_SHIFT);
> + }
> +
> + /* move offset to start of next folio in range */
> + offset = folio_next_index(folio) << PAGE_SHIFT;
> + folio_unlock(folio);
> + folio_put(folio);
> + }
> + return 0;
> +}
> +
> +/*
> + * Punch out all the delalloc blocks in the range given except for those that
> + * have dirty data still pending in the page cache - those are going to be
> + * written and so must still retain the delalloc backing for writeback.
> + *
> + * As we are scanning the page cache for data, we don't need to reimplement the
> + * wheel - mapping_seek_hole_data() does exactly what we need to identify the
> + * start and end of data ranges correctly even for sub-folio block sizes. This
> + * byte range based iteration is especially convenient because it means we don't
> + * have to care about variable size folios, nor where the start or end of the
> + * data range lies within a folio, if they lie within the same folio or even if
> + * there are multiple discontiguous data ranges within the folio.
> + */
> +static int
> +xfs_buffered_write_delalloc_release(
> + struct inode *inode,
> + loff_t start_byte,
> + loff_t end_byte)
> +{
> + loff_t punch_start_byte = start_byte;
> + int error = 0;
> +
> + /*
> + * Lock the mapping to avoid races with page faults re-instantiating
> + * folios and dirtying them via ->page_mkwrite whilst we walk the
> + * cache and perform delalloc extent removal. Failing to do this can
> + * leave dirty pages with no space reservation in the cache.
> + */
> + filemap_invalidate_lock(inode->i_mapping);
> + while (start_byte < end_byte) {
> + loff_t data_end;
> +
> + start_byte = mapping_seek_hole_data(inode->i_mapping,
> + start_byte, end_byte, SEEK_DATA);
> + /*
> + * If there is no more data to scan, all that is left is to
> + * punch out the remaining range.
> + */
> + if (start_byte == -ENXIO || start_byte == end_byte)
> + break;
> + if (start_byte < 0) {
> + error = start_byte;
> + goto out_unlock;
> + }
> + ASSERT(start_byte >= punch_start_byte);
> + ASSERT(start_byte < end_byte);
> +
> + /*
> + * We find the end of this contiguous cached data range by
> + * seeking from start_byte to the beginning of the next hole.
> + */
> + data_end = mapping_seek_hole_data(inode->i_mapping, start_byte,
> + end_byte, SEEK_HOLE);
> + if (data_end < 0) {
> + error = data_end;
> + goto out_unlock;
> + }
> + ASSERT(data_end > start_byte);
> + ASSERT(data_end <= end_byte);
> +
> + error = xfs_buffered_write_delalloc_scan(inode,
> + &punch_start_byte, start_byte, data_end);
> + if (error)
> + goto out_unlock;
> +
> + /* The next data search starts at the end of this one. */
> + start_byte = data_end;
> + }
> +
> + if (punch_start_byte < end_byte)
> + error = xfs_buffered_write_delalloc_punch(inode,
> + punch_start_byte, end_byte);
> +out_unlock:
> + filemap_invalidate_unlock(inode->i_mapping);
> + return error;
> +}
> +
> static int
> xfs_buffered_write_iomap_end(
> struct inode *inode,
> @@ -1179,16 +1319,7 @@ xfs_buffered_write_iomap_end(
> if (start_byte >= end_byte)
> return 0;
>
> - /*
> - * Lock the mapping to avoid races with page faults re-instantiating
> - * folios and dirtying them via ->page_mkwrite between the page cache
> - * truncation and the delalloc extent removal. Failing to do this can
> - * leave dirty pages with no space reservation in the cache.
> - */
> - filemap_invalidate_lock(inode->i_mapping);
> - truncate_pagecache_range(inode, start_byte, end_byte - 1);
> - error = xfs_buffered_write_delalloc_punch(inode, start_byte, end_byte);
> - filemap_invalidate_unlock(inode->i_mapping);
> + error = xfs_buffered_write_delalloc_release(inode, start_byte, end_byte);
> if (error && !xfs_is_shutdown(mp)) {
> xfs_alert(mp, "%s: unable to clean up ino 0x%llx",
> __func__, XFS_I(inode)->i_ino);
> --
> 2.37.2
>
next prev parent reply other threads:[~2022-11-02 16:45 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-01 0:34 xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-01 0:34 ` [PATCH 1/7] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-02 7:17 ` Christoph Hellwig
2022-11-02 16:12 ` Darrick J. Wong
2022-11-02 21:11 ` Dave Chinner
2022-11-01 0:34 ` [PATCH 2/7] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-02 7:18 ` Christoph Hellwig
2022-11-02 16:22 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 3/7] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-02 7:20 ` Christoph Hellwig
2022-11-02 16:32 ` Darrick J. Wong
2022-11-04 5:40 ` Dave Chinner
2022-11-07 23:53 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 4/7] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-01 11:57 ` kernel test robot
2022-11-02 7:24 ` Christoph Hellwig
2022-11-02 20:57 ` Dave Chinner
2022-11-02 16:41 ` Darrick J. Wong [this message]
2022-11-02 21:04 ` Dave Chinner
2022-11-02 22:26 ` Darrick J. Wong
2022-11-04 8:08 ` Christoph Hellwig
2022-11-04 23:10 ` Dave Chinner
2022-11-07 23:48 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 5/7] iomap: write iomap validity checks Dave Chinner
2022-11-02 8:36 ` Christoph Hellwig
2022-11-02 16:43 ` Darrick J. Wong
2022-11-02 16:58 ` Darrick J. Wong
2022-11-03 0:35 ` Dave Chinner
2022-11-04 8:12 ` Christoph Hellwig
2022-11-02 16:57 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-01 9:15 ` kernel test robot
2022-11-02 8:41 ` Christoph Hellwig
2022-11-02 21:39 ` Dave Chinner
2022-11-04 8:14 ` Christoph Hellwig
2022-11-02 17:19 ` Darrick J. Wong
2022-11-02 22:36 ` Dave Chinner
2022-11-08 0:00 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 7/7] xfs: drop write error injection is unfixable, remove it Dave Chinner
2022-11-01 3:39 ` xfs, iomap: fix data corrupton due to stale cached iomaps Darrick J. Wong
2022-11-01 4:21 ` Dave Chinner
2022-11-02 17:23 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2KdumAbAF0mV0sh@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).